RENs: Relevance Encoding Networks

by   Krithika Iyer, et al.

The manifold assumption for high-dimensional data assumes that the data is generated by varying a set of parameters obtained from a low-dimensional latent space. Deep generative models (DGMs) are widely used to learn data representations in an unsupervised way. DGMs parameterize the underlying low-dimensional manifold in the data space using bottleneck architectures such as variational autoencoders (VAEs). The bottleneck dimension for VAEs is treated as a hyperparameter that depends on the dataset and is fixed at design time after extensive tuning. As the intrinsic dimensionality of most real-world datasets is unknown, often, there is a mismatch between the intrinsic dimensionality and the latent dimensionality chosen as a hyperparameter. This mismatch can negatively contribute to the model performance for representation learning and sample generation tasks. This paper proposes relevance encoding networks (RENs): a novel probabilistic VAE-based framework that uses the automatic relevance determination (ARD) prior in the latent space to learn the data-specific bottleneck dimensionality. The relevance of each latent dimension is directly learned from the data along with the other model parameters using stochastic gradient descent and a reparameterization trick adapted to non-Gaussian priors. We leverage the concept of DeepSets to capture permutation invariant statistical properties in both data and latent spaces for relevance determination. The proposed framework is general and flexible and can be used for the state-of-the-art VAE models that leverage regularizers to impose specific characteristics in the latent space (e.g., disentanglement). With extensive experimentation on synthetic and public image datasets, we show that the proposed model learns the relevant latent bottleneck dimensionality without compromising the representation and generation quality of the samples.


Automatic Relevance Determination For Deep Generative Models

A recurring problem when building probabilistic latent variable models i...

Representing Closed Transformation Paths in Encoded Network Latent Space

Deep generative networks have been widely used for learning mappings fro...

Manifold Relevance Determination

In this paper we present a fully Bayesian latent variable model which ex...

Increasing Expressivity of a Hyperspherical VAE

Learning suitable latent representations for observed, high-dimensional ...

Rodent: Relevance determination in ODE

From a set of observed trajectories of a partially observed system, we a...

Determination of Latent Dimensionality in International Trade Flow

Currently, high-dimensional data is ubiquitous in data science, which ne...

ScRAE: Deterministic Regularized Autoencoders with Flexible Priors for Clustering Single-cell Gene Expression Data

Clustering single-cell RNA sequence (scRNA-seq) data poses statistical a...

1 Introduction

Due to the rapidly evolving computational technologies, large amounts of unlabeled data are continuously generated. Considerable time, labor, and resources are dedicated to labeling, pre-processing, and transforming the unlabelled data for real-world supervised machine learning applications. As an alternative, unsupervised representation learning algorithms extract meaningful and discriminative representations that are amenable to the downstream task in the absence of labels. Unsupervised representation learning have found applications in several domains, such as computer vision

Kim et al. (2019b); Wang et al. (2020); Lin et al. (2017); Jahanian et al. (2021); Kim et al. (2021), medical image analysis Tang et al. (2017); Yadav et al. (2021); Kolyvakis et al. (2018)

, and natural language processing

Han et al. (2021); Radford et al. (2018)

Deep generative models are widely used in unsupervised representation learning to learn informative representations of data and parameterize the underlying data manifold, enabling them to generate new samples from the distribution. Widely used methods include density estimation using flow-based models

Dinh et al. (2016); Zang and Wang (2020); Stypułkowski et al. (2019)

, generative adversarial networks (GANs)

Tanaka and Aranha (2019); Goodfellow et al. (2014), and variational autoencoders Kingma and Welling (2013); Rezende et al. (2014). This paper focuses on autoencoder-type architectures like VAEs that provide representation and sample generation capabilities.

Algorithms and models for unsupervised representation are based on the manifold assumption, where the data is assumed to be generated by varying a set of parameters. Usually, the number of parameters is much smaller than the dimensionality of the data. For example, different facial images can be generated using a finite set of parameters such as lighting, skin color, expressions, facial features, hair, etc. According to the manifold assumption: a set of samples in a dimensional data space, where , is said to lie on or near a low-dimensional manifold with intrinsic dimensionality that refers to the minimum number of parameters necessary to capture the entire information content present in the representation Gong et al. (2019)

Principal component analysis (PCA) was one of the earliest methods that established a relation between the data and the low-dimensional latent space. PCA provides a closed-form solution for determining the optimum latent space dimensionality. However, PCA is limited by linearity and imposes oversimplifying assumptions of Gaussianity on the data distribution. Moreover, traditional approaches such as PCA are not scalable for large datasets and require algorithmic treatments (e.g., online PCA Cardot and Degras (2018); Boutsidis et al. (2014)) for big data. Deep generative models overcome the limitations of PCA providing non-linearity and scalability using deep networks to parameterize the mapping to the latent space. In the case of deep generative models, the dimensionality of the latent space is typically defined upfront for each dataset at the design time. The design process may be under- or over-provision the number of dimensions for the application at hand. If the dimensionality is not predefined, this parameter is usually determined using time and resource-consuming cross-validation. A mismatch between latent dimensionality and intrinsic dimensionality affects the quality of data representation and sample generation Rubenstein et al. (2018a); Mondal et al. (2021); Rubenstein et al. (2018b). Studying and understanding the effects of dimensionality mismatch in the latent space becomes imperative to use deep generative models effectively.

Figure 1: Larger latent dimensions are not always better. Frechet inception distance (FID) scores (lower is better) and reconstruction mean square error (MSE, lower is better) of VAE and dpVAE Bhalodia et al. (2020)

models with varying latent dimensionality for MNIST.

Multiple studies have targeted improving reconstruction quality, but very few have tried to analyze the impact of latent dimensionality mismatch. Wasserstein autoencoders, highlights that a mismatch between the latent dimensionality and the true intrinsic dimensionality leads to an infeasible optimization object Mondal et al. (2021, 2019). For deterministic encoders and decoders, a high capacity bottleneck can cause curling of the manifold Tolstikhin et al. (2017). Whereas using a smaller bottleneck can cause lossy compression of data and deteriorate representation quality De Boom et al. (2020). For deterministic encoders, studies have concluded that larger bottleneck dimensions are not always better Mondal et al. (2019); Tolstikhin et al. (2017). For VAE, it was theoretically proved that increasing the bottleneck capacity beyond the intrinsic dimensionality does not improve the reconstruction quality Dai and Wipf (2019).

We propose a principled framework grounded in probabilistic modeling to identify the optimal data-specific latent dimensionality without adding new hyperparameters. To empirically motivate the proposed model, we performed experiments with - vanilla VAE Kingma and Welling (2019) and dpVAE Bhalodia et al. (2020) to analyze the impact of latent dimensionality on the representation learning and sample generation tasks. See Figure 1. The two models were trained on the MNIST dataset with varying latent dimensions, while other parameters were kept the same. The Frechet inception distance (FID) score was used as the generation metric, and mean squared error (MSE) as the representation metric. The FID score and MSE curve indicate inferior performance when the bottleneck size is under-provisioned. We see the MSE performance improve with the bottleneck size, but FID scores suffers at higher dimensions such as 64 and 128. The lower MSE and increase in FID score at a large bottleneck size indicates over-fitting and loss of generalization. We can conclude that larger dimensions are not always guaranteed to produce the best-performing models. Hence, we need automated ways of identifying the dimensionality mismatch and informing the model about the intrinsic dimensionality of a given dataset. Our contributions are as follows:

  1. Introduce relevance encoding networks (RENs): a framework that facilities the training of VAEs using a unified formulation to parameterize the data distribution and detect latent-intrinsic dimensionalities mismatch. The formulation is general and can be adapted to state-of-art VAE-based methods such as regularized VAEs. This framework also provides a PCA-like ordering for the latent dimensions that conveys the variance of each latent dimension supported by the data manifold.

  2. Derive the evidence lower bound (ELBO) for RENs in case of vanilla VAEs and decoupled prior VAEs Bhalodia et al. (2020) that leverages invertible bottleneck to improve the matching of the aggregate posterior with the latent prior.

  3. Use -VAE Rybkin et al. (2021) to calibrate the RENs decoder and reduce the requirement of tuning the weight on the likelihood term of the VAE ELBO.

  4. Demonstrate the ability of relevance encoding networks in detecting the relevant bottleneck dimensionality for three public image datasets without compromising the representation and generation quality and with no additional hyperparameter tuning.

2 Related Work

VAE Kingma and Welling (2019) is a latent variable model specified by an encoder, decoder, and prior distribution on the latent space. The encoder maps the input to the latent space (inference), while the decoder reconstructs the original input from the latent space (representation). The prior enables sample generation from a tractable probabilistic distribution Doersch (2016). Several studies have suggested that using learnable prior can improve the performance of VAEs and reduce the impact of dimensionality mismatch Xu et al. (2020, 2019); Tomczak and Welling (2018); Bhalodia et al. (2020); Xu et al. (2019). Dai and Wipf rigorously analyzed the VAE objective under various scenarios of dimensionality mismatch Dai and Wipf (2019). A critical conclusion (see Theorem 5 in Dai and Wipf (2019)) states that optimal reconstruction can be achieved when the latent bottleneck dimensionality matches the intrinsic dimensionality, and increasing the bottleneck capacity may negatively impact the generation process.

Very few methods have demonstrated ways of identifying and handling latent-intrinsic dimensionality mismatch. De Boom et al. illustrated the use of Generalized ELBO with constrained optimization (GECO) and the L0-augment-REINFORCE-merge (L0-ARM) gradient estimator Li and Ji (2019); Yin and Zhou (2018) to shrink the latent bottleneck dimensionality of VAEs automatically De Boom et al. (2020)

. The L0 norm was applied to a global binary gating vector that controlled the latent dimensionality. GECO was used to define a constraint on the reconstruction error to give it more weight during optimization until the desired level of accuracy is reached. Once the threshold is reached, the narrowing of the bottleneck is given priority. Kim et al. proposed relevance factor VAE

Kim et al. (2019a) that infers relevance and disentanglement using total correlation. Although the models proposed by De Boom et al. and Kim et al. identify the inactive latent dimensions and eliminate them in the variational posterior distribution, the prior is still an isotropic standard normal. This can result in poor representation and generation quality.

Heim N proposed the relevance determination in differential equations model (Rodent) model Heim et al. (2019) that showed the use of automatic relevance determination (ARD) Tipping (1999); Bishop and Tipping (2013)

priors to minimize the state size of the ordinary differential equation (ODE) and the number of non-parameters required to solve the problem using partial observations. They used a VAE-like architecture, where the encoder was a neural network, and the decoder was an ODE solver. Isotropic Gaussian was used with an ARD prior in the latent space, and a point estimate was used for the variance of the prior distribution. The Rodent model formulation is not fully probabilistic and only focused on solving the ODEs.

For autoencoders, several studies (e.g., Rubenstein et al. (2018a, b)) have analyzed how deterministic and random encoder-decoder pairs perform in the presence of latent-intrinsic dimensionality mismatch. Studies by Rubenstein et al. revealed that the deterministic encoders start curling the manifold in the latent space when the latent dimensionality is higher than the intrinsic dimensionality. At the same time, random encoders fill the irrelevant dimension with noise while encoding useful information in the latent space. Random encoders start behaving like deterministic encoders if the dimensionality is increased further. Deterministic and random exhibit poor sample generation performance with an increase in the volume of the holes in the latent space. Mondal et al. studied the effect of dimensionality mismatch in the case of deterministic autoencoders Mondal et al. (2019, 2021). Mathematically and empirically, Mondal et al. shows that having a fixed prior distribution, oblivious to the dimensionality of the true latent space, leads to the optimization infeasibility, and proposes masked autoencoders (MAAE) Mondal et al. (2019) as a potential solution. MAAE Mondal et al. (2019) introduced modifications to the autoencoder architecture to infer a mask at the end of the encoder to suppress noisy latent dimensions.

Existing approaches that identifies the relevant dimensions, introduces more hyperparameters that have to be tuned to identify the bottleneck size. Hence, the complexity of finding the optimum latent dimension remains the same. Moreover, these methods treat relevance determination as a separate task agnostic to the probabilistic formulation of deep generative models, making the solution less interpretable.The RENs framework facilities training of VAEs using a unified probabilistic formulation to parameterize the data distribution and detect latent-intrinsic dimensionality mismatch without adding new hyperparameters.

3 Background

Notation: We denote a set of observations in a dimensional data space as and their corresponding latent representations . A representation learning model maps an observation to an unobserved latent representation in an dimensional latent space, where . Hereafter, we use boldface lowercase letters to denote vectors, boldface uppercase letters to denote matrices, and non-bold lowercase to denote scalars.

Variational Autoencoders (VAEs): VAEs are latent variable models that learn data representations in an unsupervised way by matching the learned model distribution to the true data distribution . The generative (i.e., decoder) and inference (i.e., encoder) models in VAEs are jointly trained to maximize a tractable lower bound on the marginal log-likelihood of the training data. The structure of the learned latent representation is controlled via imposing a prior distribution on the latent space such as .


where denotes the generative model parameters, denotes the inference model parameters, and is the variational posterior distribution that approximates the true posterior , with , and

dpVAE: Decoupled Prior for VAE: Maximizing the ELBO by optimizing the marginal log-likelihood does not guarantee good representation. With expressive generative models , VAE can ignore the latent representation and not encode any information about the data in them, but still maximize the ELBO Hoffman and Johnson (2016); Alemi et al. (2018); Chen et al. (2016), this is the information preference phenomena of VAEs. It has been shown that data-driven (i.e., learned during training) priors help mitigate the information preference of VAEs Hoffman and Johnson (2016); Rosca et al. (2018); Xu et al. (2019). Specifically, dpVAE Bhalodia et al. (2020) decouples the latent space that performs the representation and the space that drives sample generation using a functional bijective mapping parameterized by the network parameters , where . Affine coupling layers Dinh et al. (2016) are used to build a flexible bijection function by stacking a sequence of simple bijection blocks.


where , is the scaling network of the th block, is the th element in vector; the binary mask used for partition the th block of the scaling and translation function of the th block and is the th element of vectors

Sigma Variational Autoencoders: It is common practice to consider the decoding distribution as a Gaussian with constant variance representing the data noise (tuned as a hyperparameter). When using a fixed variance, a model with high variance will not retain enough information in the latent space to faithfully reconstruct samples and a model with low variance will generate poor samples as the divergence term becomes weaker Alemi et al. (2018); Lucas et al. (2019). The -VAE Rybkin et al. (2021) model is a simple, yet effective solution for calibrating the decode variance by using a single learnable parameter in The variance of the decoder is trainable and is learned along with the rest of the model parameters Rybkin et al. (2021). The formulation reduces the time and resources required to tune the variance for each model.

Considering the performance merits of dpVAE and -VAE, all our experiments will implement the VAE model with the decoupling architecture of dpVAE and -VAE formulation. This dovetails with RENs objective to reduce VAE hyperparamters that require extensive tuning for each datasets while improving sample generation and representation.

4 Relevance Encoding Networks

Larger latent bottleneck sizes does not guarantee a better VAE performance (Figure 1). Therefore, to inform the model about intrinsic dimensionality, we introduce an automatic relevance determination (ARD) Bishop and Tipping (2013); Tipping (1999)hyperprior over the latent space prior .

Figure 2: Graphical models for (a) REN with VAE (b) REN with dpVAE. The solid lines indicate the variational inference flow and the dotted lines indicate the generative flow for all the models. The block arrows indicate the invertible flow network.

4.1 RENs Formulation

The ARD hyperprior regularizes the latent space to discover relevant latent dimensions that are supported by the data, hence reducing the contribution of redundant dimensions. The ARD hyperprior provides relevance of each dimension in the latent representation and this relevance is defined via precision (i.e. inverse of variance). The ARD hyperprior pushes the precision of the spurious dimensions to infinity; thus, the variance of these dimensions is pushed to zero in the latent space. The latent prior is given by:


with ARD hyperprior given as:


Here, is the relevance of the latent dimensions, and is an dimensional vector of ones. The concentration parameter and the rate parameter

of the Gamma distribution are shared across all latent dimensions. The VAE prior now becomes:


We introduce a relevance encoder to the VAE architecture that learns the variational posterior which approximates the true posterior . Here,

denotes the parameters of the relevance encoder network. The relevance of a latent dimension is a statistical property of the underlying latent distribution that is induced by the data distribution. Consequently, relevance cannot be estimated from a single sample, but instead requires access to a finite set of representative samples for the data and latent distributions. Hence, we formulate a set-input problem, where a set of instances is given as an input and the relevance encoder parameterizes the relevance for the entire set with permutation invariance. Taking the complete dataset into consideration, the joint probability of the training data can be expressed as,

and the probability of the training data is . See Figure 2 and Figure 3 for the plate notation of the graphical model and the block diagrams of their architectures, respectively.

Samples of the Gamma distribution are reparameterized to enable gradient flow and network training in the presence of probabilistic layers. The derivatives are computed using the implicit reparameterization approach Figurnov et al. (2018)

. This reparameterization is implemented in the TensorFlow Probability


Figure 3: Block Diagram of (a) Relevance encoding networks (REN) with VAE, (b) REN with dpVAE. The model consists of encoder(), decoder(), and relevance encoder(). The relevance encoding network infers the variational posterior

(c) The relevance encoder is broken down into its constituent parts. It consists of two feature extracts: one for the data and the other for the latent representation. The features are combined and passed to the deep sets aggregator. The aggregated feature is fed to the final relevance encoder that approximates the concentration and rate of the gamma distribution.

Derivations for the ELBO can be found in Appendix A.1. The ELBO for VAE is:


The ELBO for dpVAE is:


4.2 Network and Training Strategies

The ideal scenario would be feeding the relevance encoders all the training samples at once to generate a robust estimate of the relevance. However, the use of the entire dataset to estimate such relevance negatively impact the scalability provided by stochastic gradient descent training for large datasets. We train RENs via stochastic gradient descent and alternating optimization with two batch sizes. Each batch in the training dataset is broken down into smaller batches of size . The alternating optimization is given by following steps: (i) The relevance encoder is kept fixed (i.e., not-trainable) and the smaller batches are used to update the VAE encoder and decoder -times by keeping the obtained from the previous iteration fixed. (ii) The original large batch is used to update the entire network end-to-end (i.e., encoder , decoder , and relevance encoder ), and the value is updated. As mentioned in section 4.1, we use set formulation for the relevance encoder , wherein the relevance encoder is fed in a set of data samples, and their latent representations and a single is inferred. The response of the relevance encoder should be invariant to the ordering of the samples in the given batch. Therefore, we use DeepSets Zaheer et al. (2017) to make the relevance encoder permutation invariant for a given batch. (see Figure 3(c)). Using this alternating optimization and DeepSets aggregator, REN is encouraged to learn the global statistics of latent representations induced by the data distribution in the data space. See Algorithm.1 for more details.

5 Experiments

5.1 Toy Datasets

We use the circle and one-moon datasets for proof-of-concept experiments, both of which exhibit an intrinsic dimensionality of one, parameterized by the radius. We generated three different datasets for circle and one-moon distributions by varying the standard deviation of the zero-mean additive Gaussian noise to mimic data with different noise levels. For all our experiments, we implemented the relevance encoder with dpVAE

Bhalodia et al. (2020) and -VAE framework Dai and Wipf (2019), hereafter referenced as R-dpVAE. We tested the ability of the models to identify the intrinsic dimensionality along with their sample reconstruction capacity and realistic sample generation capability. The R-dpVAE model was compared with relevance factor VAE (RF-VAE) Kim et al. (2019a) and masked adversarial autoencoders (MAAE) Mondal et al. (2019).

Figure 4 shows the results for one-moon and circle datasets with additive Gaussian noise of zero-mean and 10% of the radius of the data manifold as standard deviation. Compared to RF-VAE and MAAE, R-dpVAE can regularize the latent space to discover latent dimensions relevance supported by the data while achieving the lowest mean square errors on the testing samples and suppressing the spurious dimension (Figure 4.2 and 4.6). For R-dpVAE, the variance in the latent space is indicative of the relevance shown in plots (Figure 4.1c and 4.5c). For one-moon, relevance estimated by R-dpVAE is for zdim1 and zdim2, which correspond to the x and y-axis in the latent space. Therefore, the x-axis with low relevance has a larger variance, and the y-axis with high relevance has a low variance, correctly capturing a low-dimensional manifold where latent dimensionality equals the intrinsic dimensionality.

Although, the performance of RF-VAE (Figure 4.3 and 4

.7) comes close to R-dpVAE with the estimated relevance, it fails to generate good quality samples as the relevance is not factored in the aggregate posterior, and the latent prior is still a standard normal distribution. MAAE (Figure 

4.4 and 4.8) identifies the latent manifold but has a higher reconstruction error and generates bad quality samples due to the weak nature of regularization in the latent space; holes can be seen in the latent space (Figure 4.4b and 4.8b). Across all experiments, R-dpVAE models provide a tighter distribution for the reconstruction error than other methods. R-dpVAE is the best performing model consistently, even in the presence of higher noise levels that make learning the underlying one-dimensional manifold more challenging. Results with different noise levels can be found in Appendix A.3.

Figure 4: Reconstruction and sample generation for one-moon and circle at 10% noise level.

5.2 Image Dataset

We experimented with three image datasets: MNIST, Fashion MNIST, and dSprites. Similar to the toy experiments, we compared the performance of R-dpVAE with relevance factor VAE (RF-VAE) Kim et al. (2019a), masked adversarial autoencoders (MAAE) Mondal et al. (2019), with the addition of VAE regularized with L0 ARM and GECO (henceforth referenced as GECO) De Boom et al. (2020). To set baselines for comparisons, we implemented vanilla VAE with -VAE framework and dpVAE with -VAE framework sans the relevance. The supplementary material includes details on all the models’ implementation, architectures, and hyperparameters. We consider the following quantitative metrics to evaluate and compare the models: (1) Frechet Inception Distance (FID): This metric calculates the distance between feature vectors of the real and generated images Heusel et al. (2017). Lower FID scores are better. (2) Mean Squared Error (MSE): The MSE values reported for reconstructed images are averaged over dimensions and sample size. (3) Latent Dimensionality (): The estimated latent bottleneck size and , where is the provisioned bottleneck size of the model.

Although the ground truth intrinsic dimensionality of image datasets is not known, we use previous studies Kim et al. (2019a); Mondal et al. (2019, 2021); De Boom et al. (2020)

as points of reference. The design choices of latent dimensions in the experiments are also motivated by the reported latent dimensions in the relevant literature. For MNIST, studies reported the latent dimensionality between 7 and 10. Hence, we choose 16 as the base dimensionality to provide the models with enough degrees-of-freedom to discover the relevant latent dimensions and 32 as the over-provisioned model to assess the impact of a significant mismatch with the intrinsic dimensionality. Similarly, for Fashion MNIST, the choices were 32 and 64 for base and over-provisioned dimensionalities, and for dSprites (known to have 6 factors of variation

Matthey et al. (2017)), the choices were 10 and 15. The MAAE and GECO models provide the number of active dimensions, whereas the RF-VAE model provides a relevance vector with values from 0 to 1. For RF-VAE, we estimate the dimensionality by calculating the number of dimensions with relevance values higher than the average of the vector. In the case of R-dpVAE, the relevance estimated by the relevance encoding network is the inverse variance as per Eq. 3. We compute the explained variance as the ratio of variance in a single dimension to the sum of all the variances. The number of dimensions required to explain 95% of the variability is considered the detected latent bottleneck dimensionality.

Table 1 summarizes the FID scores of randomly generated images and the estimated latent dimensionality by each model for all three datasets. The proposed model (R-dpVAE) achieves lower FID scores on all three datasets compared to the other models while estimating bottleneck dimensionality in the same range as the other methods. R-dpVAE also achieves better FID scores than the baseline models that do not perform relevance determination. This performance boost via relevance encoding bolsters our argument for the necessity of identifying and fixing the latent and intrinsic dimensionality mismatch. Similar to the findings from the toy experiments, the models that do not modify the posterior based on the learned relevance of the latent space exhibit inferior sample generation. This is reflected in the higher FID scores of RF-VAE, GECO, and MAAE. The MAAE and GECO models show consistency in determining the effective bottleneck size in the case of MNIST and dSprites, irrespective of the provisioned bottleneck size of the model. Although the R-dpVAE model estimates the bottleneck size in the same range as MAAE and GECO and shows better generation capabilities, R-dpVAE does not estimate the same size across for the same dataset. This behavior could be attributed to the use of random samples in the mini-batch used for updating the relevance encoder() and may not capture the entire data variation. Using a stratified sampling approach to generate mini-batch could be a potential solution.

Table 2 summarizes the MSE of the testing (i.e., held-out) images for each model for all three datasets. While R-dpVAE does not achieve low MSE as other models, the overall performance in conjunction with the FID score indicates that R-dpVAE models generalized well with the use of calibrated decoders and relevance encoders. The use of fixed weight on the reconstruction term in models such as RF-VAE, MAAE, and GECO is hypothesized to cause lower MSE and higher FID scores. Figure 5 shows the reconstructed and generated images for the best performing models cross different values for each method.

MNIST Fashion MNIST dSprites
Model FID Model FID Model FID
R-dpVAE 6.57 6 R-dpVAE 48.61 10 R-dpVAE 67.93 7
R-dpVAE 4.43 11 R-dpVAE 52.68 26 R-dpVAE 52.15 9
RF-VAE 184.38 9 RF-VAE 258.37 27 RF-VAE 144.89 8
RF-VAE 208.27 21 RF-VAE 272.87 52 RF-VAE 167.24 12
GECO 29.21 11 GECO 69.67 8 GECO 97.86 5
GECO 30.29 9 GECO 67.21 10 GECO 94.61 5
MAAE 13.81 11 MAAE 104.15 6 MAAE 105.89 7
MAAE 13.04 11 MAAE 79.18 5 MAAE 90.85 11
dpVAE 4.62 16 dpVAE 52.08 32 dpVAE 61.99 10
dpVAE 5.53 32 dpVAE 52.38 64 dpVAE 48.68 15
VAE 8.83 16 VAE 74.40 32 VAE 82.667 10
VAE 9.65 32 VAE 70.18 64 VAE 81.86 15
Table 1: Generative metric FID (lower is better) for MNIST, Fashion MNIST, and dSprites and the identified bottleneck dimensionality by each model. FID = Frchet Inception Distance, = Identified latent dimensionality. Best model, second best model, third best model
MNIST Fashion MNIST dSprites
Model MSE() Model MSE() Model MSE()
R-dpVAE 3.28 R-dpVAE 1.21 R-dpVAE 10 1.79
R-dpVAE 3.55 R-dpVAE 1.17 R-dpVAE 15 1.35
RFVAE 1.82 RFVAE 1.04 RFVAE 3.66
RFVAE 1.83 RFVAE 1.00 RFVAE 3.65
GECO 1.22 GECO 1.06 GECO 2.45
GECO 1.24 GECO 1.17 GECO 3.83
MAAE 1.84 MAAE 2.10 MAAE 1.13
MAAE 1.80 MAAE 2.56 MAAE 1.28
dpVAE 3.15 dpVAE 1.24 dpVAE 1.83
dpVAE 3.22 dpVAE 1.19 dpVAE 9.67
VAE 4.59 VAE 1.36 VAE 2.95
VAE 2.63 VAE 1.35 VAE 1.14
Table 2: Reconstruction error MSE of the testing sample for MNIST, Fashion MNIST, and dSprites and the identified bottleneck dimensionality by each model. MSE=Mean Squared Error, =Identified latent dimensionality. Best model, second best model, third best model.
Figure 5: Results of the best performing models across from Table 1 for each method. We show (a) original images, (b) reconstructed images, and (c) randomly generated images (no cherry-picking).

6 Conclusion

Latent dimensionality mismatch can have a detrimental effect on the performance of deep generative models such as VAEs. We have introduced relevance encoding networks (RENs) to identify this mismatch and inform the model about the relevant bottleneck size. The RENs framework facilities training of VAEs using a unified probabilistic formulation to parameterize the data distribution and detect latent-intrinsic dimensionality mismatch. A key feature of the RENs framework is that it does not require any extra hyperparameter tuning for relevance determination and it provides a PCA-like ranking of the latent dimensions based on the learned, data-specific relevance. The proposed model is general and flexible to be incorporated in any state-of-the-art VAE-based models, including regularized variants of VAEs. Future directions include extending the formulation of RENs toward explainable VAEs.


  • A. Alemi, B. Poole, I. Fischer, J. Dillon, R. A. Saurous, and K. Murphy (2018) Fixing a broken elbo. In International Conference on Machine Learning, pp. 159–168. Cited by: §3, §3.
  • R. Bhalodia, I. Lee, and S. Elhabian (2020) DpVAEs: fixing sample generation for regularized vaes. In Proceedings of the Asian Conference on Computer Vision, Cited by: §A.1, Figure 1, item 2, §1, §2, §3, §5.1.
  • C. M. Bishop and M. Tipping (2013) Variational relevance vector machines. arXiv preprint arXiv:1301.3838. Cited by: §2, §4.
  • C. Boutsidis, D. Garber, Z. Karnin, and E. Liberty (2014) Online principal components analysis. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pp. 887–901. Cited by: §1.
  • H. Cardot and D. Degras (2018) Online principal component analysis in high dimension: which algorithm to choose?. International Statistical Review 86 (1), pp. 29–50. Cited by: §1.
  • X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel (2016) Variational lossy autoencoder. arXiv preprint arXiv:1611.02731. Cited by: §3.
  • B. Dai and D. Wipf (2019) Diagnosing and enhancing vae models. arXiv preprint arXiv:1903.05789. Cited by: §1, §2, §5.1.
  • C. De Boom, S. Wauthier, T. Verbelen, and B. Dhoedt (2020) Dynamic narrowing of vae bottlenecks using geco and l0 regularization. arXiv preprint arXiv:2003.10901. Cited by: §1, §2, §5.2, §5.2.
  • L. Dinh, J. Sohl-Dickstein, and S. Bengio (2016) Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: §1, §3.
  • C. Doersch (2016) Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908. Cited by: §2.
  • M. Figurnov, S. Mohamed, and A. Mnih (2018) Implicit reparameterization gradients. Advances in Neural Information Processing Systems 31. Cited by: §4.1.
  • S. Gong, V. N. Boddeti, and A. K. Jain (2019) On the intrinsic dimensionality of image representations. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 3987–3996. Cited by: §1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. Advances in neural information processing systems 27. Cited by: §1.
  • J. M. Han, I. Babuschkin, H. Edwards, A. Neelakantan, T. Xu, S. Polu, A. Ray, P. Shyam, A. Ramesh, A. Radford, et al. (2021)

    Unsupervised neural machine translation with generative language models only

    arXiv preprint arXiv:2110.05448. Cited by: §1.
  • N. Heim, V. Šmídl, and T. Pevnỳ (2019) Rodent: relevance determination in differential equations. arXiv preprint arXiv:1912.00656. Cited by: §2.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30. Cited by: §5.2.
  • M. D. Hoffman and M. J. Johnson (2016) Elbo surgery: yet another way to carve up the variational evidence lower bound. In

    Workshop in Advances in Approximate Bayesian Inference, NIPS

    Cited by: §3.
  • A. Jahanian, X. Puig, Y. Tian, and P. Isola (2021) Generative models as a data source for multiview representation learning. arXiv preprint arXiv:2106.05258. Cited by: §1.
  • M. Kim, Y. Wang, P. Sahu, and V. Pavlovic (2019a) Relevance factor vae: learning and identifying disentangled factors. arXiv preprint arXiv:1902.01568. Cited by: §2, §5.1, §5.2, §5.2.
  • S. Kim, S. Kim, and J. Lee (2021) Hybrid generative-contrastive representation learning. arXiv preprint arXiv:2106.06162. Cited by: §1.
  • T. Kim, M. Jeong, S. Kim, S. Choi, and C. Kim (2019b) Diversify and match: a domain adaptive representation learning paradigm for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12456–12465. Cited by: §1.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
  • D. P. Kingma and M. Welling (2019) An introduction to variational autoencoders. arXiv preprint arXiv:1906.02691. Cited by: §1, §2.
  • P. Kolyvakis, A. Kalousis, B. Smith, and D. Kiritsis (2018) Biomedical ontology alignment: an approach based on representation learning. Journal of biomedical semantics 9 (1), pp. 1–20. Cited by: §1.
  • Y. Li and S. Ji (2019) L0 -arm: network sparsification via stochastic binary optimization. arXiv preprint arXiv:1904.04432. Cited by: §2.
  • D. Lin, K. Fu, Y. Wang, G. Xu, and X. Sun (2017) MARTA gans: unsupervised representation learning for remote sensing image classification. IEEE Geoscience and Remote Sensing Letters 14 (11), pp. 2092–2096. Cited by: §1.
  • J. Lucas, G. Tucker, R. B. Grosse, and M. Norouzi (2019) Don’t blame the elbo! a linear vae perspective on posterior collapse. Advances in Neural Information Processing Systems 32. Cited by: §3.
  • L. Matthey, I. Higgins, D. Hassabis, and A. Lerchner (2017) DSprites: disentanglement testing sprites dataset. Note: Cited by: §5.2.
  • A. K. Mondal, H. Asnani, P. Singla, and A. Prathosh (2021) FlexAE: flexibly learning latent priors for wasserstein auto-encoders. In

    Uncertainty in Artificial Intelligence

    pp. 525–535. Cited by: §1, §1, §2, §5.2.
  • A. K. Mondal, S. P. Chowdhury, A. Jayendran, P. Singla, H. Asnani, and P. AP (2019) MaskAAE: latent space optimization for adversarial auto-encoders. arXiv preprint arXiv:1912.04564. Cited by: §1, §2, §5.1, §5.2, §5.2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §1.
  • D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    In International conference on machine learning, pp. 1278–1286. Cited by: §1.
  • M. Rosca, B. Lakshminarayanan, and S. Mohamed (2018) Distribution matching in variational inference. arXiv preprint arXiv:1802.06847. Cited by: §3.
  • P. K. Rubenstein, B. Schoelkopf, and I. Tolstikhin (2018a) On the latent space of wasserstein auto-encoders. arXiv preprint arXiv:1802.03761. Cited by: §1, §2.
  • P. K. Rubenstein, B. Schoelkopf, and I. Tolstikhin (2018b) Wasserstein auto-encoders: latent dimensionality and random encoders. Cited by: §1, §2.
  • O. Rybkin, K. Daniilidis, and S. Levine (2021) Simple and effective vae training with calibrated decoders. In International Conference on Machine Learning, pp. 9179–9189. Cited by: item 3, §3.
  • M. Stypułkowski, M. Zamorski, M. Zięba, and J. Chorowski (2019) Conditional invertible flow for point cloud generation. arXiv preprint arXiv:1910.07344. Cited by: §1.
  • F. H. K. d. S. Tanaka and C. Aranha (2019) Data augmentation using gans. arXiv preprint arXiv:1904.09135. Cited by: §1.
  • Q. Tang, Y. Liu, and H. Liu (2017) Medical image classification via multiscale representation learning. Artificial Intelligence in Medicine 79, pp. 71–78. Cited by: §1.
  • M. Tipping (1999) The relevance vector machine. Advances in neural information processing systems 12. Cited by: §2, §4.
  • I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf (2017) Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558. Cited by: §1.
  • J. Tomczak and M. Welling (2018) VAE with a vampprior. In International Conference on Artificial Intelligence and Statistics, pp. 1214–1223. Cited by: §2.
  • J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, et al. (2020) Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence 43 (10), pp. 3349–3364. Cited by: §1.
  • H. Xu, W. Chen, J. Lai, Z. Li, Y. Zhao, and D. Pei (2019) On the necessity and effectiveness of learning the prior of variational auto-encoder. arXiv preprint arXiv:1905.13452. Cited by: §2, §3.
  • H. Xu, W. Chen, J. Lai, Z. Li, Y. Zhao, and D. Pei (2020) Shallow vaes with realnvp prior can perform as well as deep hierarchical vaes. In International Conference on Neural Information Processing, pp. 650–659. Cited by: §2.
  • P. Yadav, N. Menon, V. Ravi, and S. Vishvanathan (2021) Lung-gans: unsupervised representation learning for lung disease classification using chest ct and x-ray images. IEEE Transactions on Engineering Management. Cited by: §1.
  • M. Yin and M. Zhou (2018) ARM: augment-reinforce-merge gradient for stochastic binary networks. arXiv preprint arXiv:1807.11143. Cited by: §2.
  • M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola (2017) Deep sets. Advances in neural information processing systems 30. Cited by: §4.2.
  • C. Zang and F. Wang (2020) MoFlow: an invertible flow model for generating molecular graphs. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 617–626. Cited by: §1.

Appendix A Appendix

a.1 Evidence Lower Bound for RENs

From equation.3 and equation.4,


Considering the graphical model in Figure.3(d), the variational posterior is specified by

We begin with the defining the marginal likelihood of the training data,

ELBO maximizes the marginal log likelihood of the training data:

Applying Jensens inequality

The first term is the reconstruction loss:

The second term can be further simplified as follows,


The ELBO for REN with VAEs:


For dpVAE [2], please refer to the paper for the basic formulation. For dpVAE with REN the joint likelihood changes to,

The second term for dpVAE with REN formulation becomes:


The ELBO for REN with dpVAE:


a.2 Training Algorithm

Input: Training dataset
Networks: Encoder-, Decoder-, Relevance encoder-
Hyper-parameters: VAE learning rate: , REN learning rate:
Initialization: ,

, total number of epochs

epochs, number of burn-in epochsburnin, relevance prior:
for e in range(epochs) do
       for each batch in  do
             Divide batch into smaller batches: b
             for each b in batch do
                   Update and using equation 1 for VAE and equation 2 for dpVAE with
             end for
            if eburnin then
                   Get new estimate of
                   Update using equation 6 for VAE and equation 7 for dpVAE with
             end if
       end for
end for
Algorithm 1 Training algorithm for RENs

a.3 Toy Dataset Results

The results in Figure 6 and Figure 7 shows the results for higher noise levels. The observations are the same as section 5.2. R-dpVAE is able to detect while maintaining low MSE even in the presence of noisy data.

Figure 6: Sample reconstruction and sample generation outputs of R-dpVAE, RF-VAE, MaskAAE for one-moon and circle at 5% and 7% noise level.
Figure 7: Sample reconstruction and sample generation outputs of R-dpVAE, RF-VAE, MaskAAE for one-moon and circle at 1% noise level.