Autoregressive Energy Machines
Neural density estimators are flexible families of parametric models which have seen widespread use in unsupervised machine learning in recent years. Maximum-likelihood training typically dictates that these models be constrained to specify an explicit density. However, this limitation can be overcome by instead using a neural network to specify an energy function, or unnormalized density, which can subsequently be normalized to obtain a valid distribution. The challenge with this approach lies in accurately estimating the normalizing constant of the high-dimensional energy function. We propose the Autoregressive Energy Machine, an energy-based model which simultaneously learns an unnormalized density and computes an importance-sampling estimate of the normalizing constant for each conditional in an autoregressive decomposition. The Autoregressive Energy Machine achieves state-of-the-art performance on a suite of density-estimation tasks.READ FULL TEXT VIEW PDF
Autoregressive models are among the best performing neural density
We propose to train a non-autoregressive machine translation model to
Autoregressive (AR) models have become a popular tool for unsupervised
The fundamental task of general density estimation has been of keen inte...
Media is generally stored digitally and is therefore discrete. Many
Free energy perturbation (FEP) was proposed by Zwanzig more than six dec...
We address the problem of building theoretical models that help elucidat...
Autoregressive Energy Machines
Neural networks are flexible function approximators, and promising candidates to learn a probability density function. Typically, neural density models are normalized a priori, but this can hinder flexibility and expressiveness. For instance, many flow-based density estimators(Dinh et al., 2017; Papamakarios et al., 2017; Huang et al., 2018) rely on invertible transformations with tractable Jacobian which map data to a simple base density, so that the log probability of an input point can be evaluated using a change of variables. Autoregressive density estimators (Uria et al., 2013; Germain et al., 2015) often rely on mixtures of parametric distributions to model each conditional. Such families can make it difficult to model the low-density regions or sharp transitions characterized by multi-modal or discontinuous densities, respectively.
The contributions of this work are shaped by two main observations.
An energy function, or unnormalized density, fully characterizes a probability distribution, and neural networks may be better suited to learning such an energy function rather than an explicit density.
Decomposing the density estimation task in an autoregressive manner makes it possible to train such an energy-based model by maximum likelihood, since it is easier to obtain reliable estimates of normalizing constants in low dimensions.
Based on these observations, we present a scalable and efficient learning algorithm for an autoregressive energy-based model, which we term the Autoregressive Energy Machine (AEM). Figure 3 provides a condensed overview of how an AEM approximates the density of an input point.
. The output is a pair of vectorswhich correspond to the proposal parameters and context vector, respectively, for the conditional distribution. (*fig:enn) The context vector and input are passed through the energy network to compute an unnormalized log probability for the conditional. (*fig:equations) The parameters define a tractable proposal distribution, such as a mixture of location-scale family distributions, which can be used to compute an estimate of the normalizing constant for the conditional by importance sampling.
A probability density function assigns a non-negative scalar value to each vector-valued input , with the property that over its support. Given a dataset of i.i.d. samples drawn from some unknown -dimensional distribution , the density estimation task is to determine a model such that . Neural density estimators are parametric models that make use of neural network components to increase their capacity to fit complex distributions, and autoregressive neural models are among the best performing of these.
The product rule of probability allows us to decompose any joint distribution into a product of conditional distributions:
Autoregressive density estimators model each conditional using parameters which are computed as a function of the preceding variables in a given ordering. In this paper, we use the term ARNN to describe any autoregressive neural network which computes an autoregressive function of a -dimensional input , where the output is denoted . The vector is often interpreted as the parameters of a tractable parametric density for the conditional, such as a mixture of location-scale distributions or the probabilities of a categorical distribution, but it is not restricted to this typical use-case.
Certain architectures, such as those found in recurrent models, perform the autoregressive computation sequentially, but more recent architectures exploit masking or causal convolution in order to output each conditional in a single pass of the network. Both types of architecture have found domain-specific (Sundermeyer et al., 2012; Theis & Bethge, 2015; Parmar et al., 2018), as well as general-purpose (Uria et al., 2013; Germain et al., 2015) use. In particular we highlight MADE (Germain et al., 2015), an architecture that masks weight matrices in fully connected layers to achieve causal structure. It is a building block in many models with autoregressive components (Kingma et al., 2016; Papamakarios et al., 2017; Huang et al., 2018), and we make use of a similar architecture in this work.
In addition to an autoregressive decomposition, we may also write any density as
where is the unnormalized density, is known as the energy function, and is the normalizing constant. Assuming is finite, specifying an energy function is equivalent to specifying a probability distribution, since the normalizing constant is also defined in terms of
. Models described in this way are known as energy-based models. Classic examples include Boltzmann machines(Hinton, 2002; Salakhutdinov & Hinton, 2009; Hinton, 2012), products of experts (Hinton, 2002) and Markov random fields (Osindero & Hinton, 2007; Köster et al., 2009).
In order to do maximum-likelihood estimation of the parameters of an energy-based model, we must be able to evaluate or estimate the normalizing constant
. This is problematic, as it requires the estimation of a potentially high-dimensional integral. As such, a number of methods have been proposed to train energy-based models, which either use cheap approximations of the normalizing constant, or side-step the issue entirely. These include contrastive divergence(Hinton, 2002)2010; Ceylan & Gutmann, 2018), and score matching (Hyvärinen, 2005). In our case, phrasing the density estimation problem in an autoregressive manner allows us to make productive use of importance sampling, a stochastic approximation method for integrals.
Importance sampling Given a proposal distribution which is non-zero whenever the target is non-zero, we can approximate the normalizing constant by
and this expression is an unbiased estimate of the normalizing constant for. The quotients in the summand are known as the importance weights. When does not closely match
, the importance weights will have high variance, and the importance sampling estimate will be dominated by those terms with largest weight. Additionally, whendoes not adequately cover regions of high density under , importance sampling underestimates the normalizing constant (Salakhutdinov & Murray, 2008). Finding a suitable distribution which closely matches is problematic, since estimating the potentially complex distribution is the original problem under consideration. This issue is exacerbated in higher dimensions, and importance sampling estimates may be unreliable in such cases. Figure 2
demonstrates how the accuracy of an importance sampling estimate of the normalizing constant for a standard normal distribution deteriorates as dimensionality increases.
The ability to specify a probability distribution using an energy function is enticing, since now a neural network can take on the more general role of an energy function in a neural density estimator. Further decomposing the task in an autoregressive manner means that importance sampling offers a viable method for maximum likelihood training, generally yielding reliable normalizing constant estimates in the one-dimensional case when the proposal distribution is reasonably well matched to the target (fig. 2). As such, the main contribution of this paper is to combine eq. (1) and eq. (2) in the context of a neural density estimator. We model a density function as a product of energy terms
where is the normalizing constant for the conditional. If we also specify an autoregressive proposal distribution , we can estimate the normalizing constant for each of the terms in the product by importance sampling:
This setup allows us to make use of arbitrarily complex energy functions, while relying on importance sampling to estimate normalizing constants in one dimension; a much more tractable problem than estimation of the full -dimensional integral.
We implement the energy function as a neural network ENN that takes as input a scalar as well as a context vector that summarizes the dependence of on the preceding variables . The ENN directly outputs the negative energy, so that
. In our experiments, the ENN is a fully-connected network with residual connections(He et al., 2016a). To incorporate the context vector , we found concatenation to the input to work well in practice. We share ENN parameters across dimensions, which reduces the total number of parameters, and allows us to learn features of densities which are common across dimensions. We also constrain the output of the ENN to be non-positive using a softplus non-linearity, so that the unnormalized density is bounded by one, since this improved training stability.
We denote by the vector of parameters for the proposal conditional, which, like the context vector , is computed as a function of . In our case, this quantity consists of the mixture coefficients, locations, and scales of a tractable parametric distribution, and we find that a mixture of Gaussians works well across a range of tasks. The normalizing constant for the conditional can thus be approximated by
leading to an expression for the approximate log density of an input data point
Estimation of log densities therefore requires the energy network to be evaluated times for each conditional; times for the importance samples, and once for the data point . In practice, we perform these evaluations in parallel, by passing large batches consisting of input data and importance samples for all conditionals to the energy net along with the relevant context vectors, and found to be sufficient. Although the importance sampling estimates of the normalizing constants are unbiased, by taking the logarithm of in eq. (9) we bias our estimates of . However, as we will show in Section 4.2, our estimates are well-calibrated, and can be made more accurate by increasing the number of importance samples. For the purposes of training, we did not find this bias to be an issue.
As illustrated in fig. 2(a), we obtain both context vectors and proposal parameters in parallel using an autoregressive network ARNN. The ARNN outputs proposal parameters and context vectors for each of the dimensions in a single forward pass. These quantities are used both to estimate as in eq. (9), as well as to evaluate , and we form a maximum-likelihood training objective
where refers collectively to the trainable parameters in both the ARNN and the ENN. We fit the AEM by maximizing eq. (10) across a training set using stochastic gradient ascent, and find that a warm-up period where the proposal is initially optimized without the energy model can improve stability, allowing the proposal to cover the data sufficiently before importance sampling begins.
It is important to note that we do not optimize the proposal distribution parameters with respect to the importance sampling estimate. This means that the proposal is trained independently of the energy-model, and estimates of treat the proposal samples and proposal density evaluations as constant values for the purposes of optimization. In practice, this is implemented by stopping gradients on variables connected to the proposal distribution in the computational graph. We find maximum-likelihood training of to be effective as a means to obtain a useful proposal distribution, but other objectives, such as minimization of the variance of the importance sampling estimate (Kuleshov & Ermon, 2017; Müller et al., 2018), might also be considered, although we do not investigate this avenue in our work.
Although it is not possible to sample analytically from our model, we can obtain approximate samples by first drawing samples from the proposal distribution and then resampling from this collection using importance weights computed by the energy model (eq. (4)). This method is known as sampling importance resampling (Rubin, 1988), and is consistent in the limit of infinite proposal samples, but we found results to be satisfactory using just 100 proposal samples.
For general purpose density estimation of tabular data, we present a modified version of the MADE architecture (Germain et al., 2015) that incorporates residual connections (He et al., 2016a). In the standard MADE architecture, causal structure is maintained by masking certain weights in the network layers. We observe that in consecutive hidden layers with the same number of units and shared masks, the connectivity of the hidden units with respect to the inputs is preserved. As such, incorporating skip connections will maintain the architecture’s autoregressive structure.
Following the documented success of residual blocks (He et al., 2016a, b) as a component in deep network architectures, we implement a basic residual block for MADE-style models that can be used as a drop-in replacement for typical masked layers. Each block consists of two masked-dense layers per residual block, and uses pre-activations following He et al. (2016b). We use the term ResMADE to describe an autoregressive architecture featuring these blocks, and make use of the ResMADE as an ARNN component across our experiments. For a more detailed description see Appendix A.
For our experiments, we use a ResMADE with four residual blocks for the ARNN, as well as a fully-connected residual architecture for the ENN, also with four residual blocks. The number of hidden units in the ResMADE is varied per task. We use the Adam optimizer (Kingma & Ba, 2014), and anneal the learning rate to zero over the course of training using a cosine schedule (Loshchilov & Hutter, 2016). For some tasks, we find regularization by dropout (Srivastava et al., 2014) to be beneficial. Full experimental details are available in Appendix B, and code is available at https://github.com/conormdurkan/autoregressive-energy-machines.
We first demonstrate that an AEM is capable of fitting complex two-dimensional densities. For each task, we generate data points for training. Results are displayed in fig. 3(a). We plot each AEM density by estimating the normalizing constant with 1000 importance samples for each conditional. AEM samples are obtained by resampling 100 proposal samples as described in section 3.3.
Spirals The spirals dataset is adapted from Grathwohl et al. (2019). Though capable of representing the spiral density, the ResMADE proposal fails to achieve the same quality of fit as an AEM, with notable regions of non-uniform density.
Checkerboard The checkerboard dataset is also adapted from Grathwohl et al. (2019). This task illustrates that in some cases a learned proposal distribution is not required; with a fixed, uniform proposal, an AEM is capable of accurately modeling the data, including the discontinuities at the boundary of each square.
Diamond The diamond task is adapted from the 100-mode square Gaussian grid of Huang et al. (2018), by expanding to 225 modes, and rotating 45 degrees. Although the ResMADE proposal does not have the capacity to represent all modes of the target data, an AEM is able to recover these lost modes.
Einstein We generate the Einstein data by sampling co-ordinates proportional to the pixel intensities in an image of Albert Einstein (Müller et al., 2018), before adding uniform noise, and re-scaling the resulting points to . This task in particular highlights the benefits of an energy-based approach. The distribution of light in an image features sharp transitions and edges, high-frequency components, and broad regions of near-constant density. Where a ResMADE proposal struggles with these challenges, an AEM is able to retain much more fine detail. In addition, samples generated by the AEM are difficult to distinguish from the original dataset.
Finally, fig. 3(b) presents an alternative visualization of the Einstein task. Each row corresponds to an unnormalized conditional proportional to for a fixed value of . While the ResMADE proposal, consisting of a mixture of 10 Gaussians for each conditional, achieves a good overall fit to the true pixel intensities, it is ultimately constrained by the smoothness of its mixture components.
Though importance sampling in one dimension is much less unwieldy than in high-dimensional space, it may still be the case that the proposal distributions do not adequately cover the support of the conditionals being modeled, leading to underestimates of the normalizing constants. Here we demonstrate that the normalizing constants learned by an AEM are well-calibrated by comparing to ‘true’ values computed with explicit numerical integration. In particular, we use a log-modified trapezoidal rule (Pitkin, 2017) to integrate the unnormalized log density output by the energy network over each dimension. This approach exploits the fast parallel computation of the energy net, allowing us to saturate the domain of integration, and compensate for the shortcomings of the trapezoidal rule compared to more advanced adaptive quadrature methods (Gander & Gautschi, 2000). We increase the number of integrand evaluations until the integral converges to seven significant figures.
For each trained model on the Einstein, Power, Hepmass, and BSDS300 tasks, we first randomly select 1000 dimensions with replacement according to the data dimensionality, disregarding the marginal so as to not repeatedly compute the same value. Then, we compute the integral of the log unnormalized density corresponding to that one-dimensional conditional, using a log-trapezoidal rule and context vectors generated from a held out-validation set of 1000 samples. This procedure results in 1000 ‘true’ integrals for each task. We then test the AEM by comparing this true value with estimates generated using increasing numbers of importance samples. Note that in each task, the AEM has never estimated the log normalizing constant for the conditional densities under consideration, since these conditionals are specified using the validation set, and not the training set. Figure 5 visualizes the results of our calibration experiments.
Test log likelihood (in nats) for UCI datasets and BSDS300, with error bars corresponding to two standard deviations. AEMresults are estimated with 20,000 importance samples. The best performing model for each dataset is shown in bold, as well the best performing model for which the exact log likelihood can be obtained. Results for non-AEM models taken from existing literature. MAF-DDSF report error bars across five repeated runs rather than across the test set.
We follow the experimental setup of Papamakarios et al. (2017) in using a selection of pre-processed datasets from the UCI machine learning repository (Dheeru & Karra Taniskidou, 2017), and BSDS300 datasets of natural images (Martin et al., 2001). AEM log likelihoods are estimated using 20000 importance samples from the proposal distribution.
As a normalized approximation to the AEM, we use a kernel density estimate in which Gaussian kernels are centered at proposal samples and weighted by the importance weights. We include the proposal itself as a mixture component in order to provide probability density in regions not well-covered by the samples. The KDE bandwidth and proposal distribution mixture weighting are optimized on the validation set. We call the model under this evaluation scheme AEM-KDE. KDE estimates also use 20000 samples from the proposal distribution for each conditional.
Table 1 shows the test-set log likelihoods obtained by our models and by other state-of-the-art models. We first note that the AEM proposal distribution (ResMADE-MoG) provides a strong baseline relative to previous work. In particular, it improves substantially on the the MADE-MoG results reported by Papamakarios et al. (2017), and improves on the state-of-the-art results reported by NAF (Huang et al., 2018) and TAN (Oliva et al., 2018). This demonstrates the benefits of adding residual connections to the MADE architecture. As such, practitioners may find ResMADE a useful component in many applications which require autoregressive computation.
AEM and AEM-KDE outperform both the proposal distribution and existing state-of-the-art methods on the Power, Gas and Hepmass datasets, demonstrating the potential benefit of flexible energy-based conditionals. Despite regularization, overfitting was an issue for Miniboone due to the size of the training set () relative to the data dimension (). This highlights the challenges associated with using very expressive models on domains with limited data, and the need for stronger regularization methods in these cases. On BSDS300, our models achieve the second highest scores relative to previous work. On this dataset we found that the validation scores () were substantially higher than test-set scores (), indicating differences between their empirical distributions. Overall, the AEM-KDE obtains improved scores relative to the proposal distribution, and these scores are close to those of the AEM.
We evaluate the AEM in the context of deep latent-variable models, where it can be used as an expressive prior. We train a convolutional variational autoencoder (VAE) (Kingma & Welling, 2013; Rezende et al., 2014), making use of residual blocks in the encoder and decoder in a similar architecture to previous work (Huang et al., 2018; Kingma et al., 2016). We initially train the encoder and decoder with a standard Gaussian prior, and then train an AEM post-hoc to maximize the likelihood of samples from the aggregate approximate posterior
This training method avoids an issue where maximum-likelihood training of the proposal distribution interferes with the lower bound objective. During pre-training, we set in the modified variational objective
This weighting of the KL-divergence term has the effect of boosting the reconstruction log probability at the expense of an aggregate posterior that is less well-matched to the standard Gaussian prior. This provides an opportunity for the AEM to improve on the original prior as a model of the aggregate posterior. Weighting of the KL-divergence has been used in previous work to reduce the occurrence of unused latent variables (Sønderby et al., 2016), and to control the types of representations encoded by the latent variables (Higgins et al., 2017).
Table 2 shows that the AEM-VAE improves substantially on the standard Gaussian prior, and that the results are competitive with existing approaches on dynamically-binarized MNIST (Burda et al., 2016). The AEM does not improve on the proposal distribution scores, possibly because the aggregate latent posterior is a mixture of Gaussians, which is well-modeled by the ResMADE-MoG. Figure 6 shows data examples and samples from the trained model.
Flow-based neural density estimation Together with autoregressive approaches, flow-based models have also seen widespread use in neural density estimation. Flow-based models consist of invertible transformations for which Jacobian determinants can be efficiently computed, allowing exact density evaluation through the change-of-variables formula. Multiple transformations are often stacked to enable more complex models.
Efficiently invertible flows Flows exploiting a particular type of transformation, known as a coupling layer, not only allow for one-pass density evaluation, but also one-pass sampling. Examples of such models include NICE (Dinh et al., 2014) and RealNVP (Dinh et al., 2017), and the approach has recently been extended to image (Kingma & Dhariwal, 2018) and audio (Prenger et al., 2018; Kim et al., 2018)
data. The case of continuous flows based on ordinary differential equations has also recently been explored by FFJORD(Grathwohl et al., 2019)
. However, efficient sampling for this class of models comes at the cost of density estimation performance, with autoregressive models generally achieving better log likelihood scores.
Autoregressive flows Originally proposed by Kingma et al. (2016) for variational inference, autoregressive flows were adapted for efficient density estimation by Papamakarios et al. (2017) with MAF. Subsequent models such as NAF (Huang et al., 2018) and TAN (Oliva et al., 2018) have developed on this idea, reporting state-of-the-art results for density estimation. Sampling in these models is expensive, since autoregressive flow inversion is inherently sequential. In some cases, such as Huang et al. (2018), the flows do not have an analytic inverse, and must be inverted numerically for sampling. Despite these caveats, autoregressive density estimators remain the best performing neural density estimators for general density estimation tasks.
Energy-based models In this work we describe energy-based models as unnormalized densities that define a probability distribution over random variables. However, there exist multiple notions of energy-based learning in the machine learning literature, including non-probabilistic interpretations (LeCun et al., 2006; Zhao et al., 2017). We focus here on recent work which includes applications to density estimation with neural energy functions. Deep energy estimator networks (Saremi et al., 2018) use an energy function implemented as a neural network, and train using the score-matching framework. This objective avoids the need to estimate the normalizing constant, but also makes it challenging to compare log-likelihood scores with other density estimators. Bauer & Mnih (2018) propose an energy-based approach for increasing the flexibility of VAE priors, in which a neural network energy function is used to mask a pre-specified proposal function. As in our work, training is performed using importance sampling, but due to the larger dimensionality of the problem, samples were used in the estimates during training.
Other related work Müller et al. (2018) propose neural importance sampling, in which a flow-based neural sampler is optimized in order to perform low-variance Monte Carlo integration of a given target function. This is similar to the goal of the proposal distribution in the AEM, but in our case the proposal is trained jointly with an energy model, and we do not assume that the target function is known a priori.
We proposed the Autoregressive Energy Machine, a neural density estimator that addresses the challenges of energy-based modeling in high dimensions through a scalable and efficient autoregressive estimate of the normalizing constant. While exact density evaluation is intractable for an AEM, we have demonstrated that the flexibility of an energy-based model enables us to model challenging synthetic data, as well as achieve state-of-the-art results on a suite of benchmark datasets.
The authors thank George Papamakarios, Iain Murray, and Chris Williams for helpful discussion. This work was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh.
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010.
A practical guide to training restricted Boltzmann machines.In Neural networks: Tricks of the trade, pp. 599–619. Springer, 2012.
Stochastic backpropagation and approximate inference in deep generative models.International Conference on Machine Learning, 2014.
On the quantitative analysis of deep belief networks.International Conference on Machine Learning, 2008.
Pixel Recurrent Neural Networks.International Conference on Machine Learning, 2016b.
Residual connections (He et al., 2016a) are widely used in deep neural networks, and have demonstrated favourable performance relative to standard networks. Residual networks typically consist of many stacked transformations of the form
where is a residual block. In this work, we observe that it is possible to equip a MADE (Germain et al., 2015) with residual connections when certain conditions on its masking structure are met.
At initialization, each unit in each hidden layer of a MADE is assigned a positive integer, termed its degree. The degree specifies the number of input dimensions to which that particular unit is connected. For example, writing for unit of layer , a value means that the unit depends only on the first dimensions of the input, when the input is prescribed a particular ordering (we always assume the ordering given by the data). In successive layers, this means that a unit may only be connected to units in the previous layer whose degrees strictly do not exceed its own. In other words, the degree assignment defines a binary mask matrix which multiplies the weight matrix of each layer in a MADE elementwise, maintaining autoregressive structure. Indeed, the mask for layer is given in terms of the degrees for layer and layer :
Through this masking process, it can be guaranteed that output units associated with data dimension only depend on the previous inputs . Though the original MADE paper considered a number of ways in which to assign degrees, we focus here on fixed sequential degree assignment, used also by MAF (Papamakarios et al., 2017). For an input of dimension , we define the degree
We also assume the dimension of each hidden layer is at least , so that no input information is lost. The result of sequential degree assignment for and is illustrated in fig. 7.
If all hidden layers in the MADE are of the same dimensionality , sequential degree assignment means that each layer will also have the same degree structure. In this way, two vectors of hidden units in the MADE may be combined using any binary element-wise operation, also shown in Figure fig. 7. In particular, a vector computed from a fully-connected block with masked layers can be added to the input to that block while maintaining the same autoregressive structure, allowing for the traditional residual connection to be added to the MADE architecture.
In all experiments, we use the same ResMADE and ENN architectures, each with 4 pre-activation residual blocks (He et al., 2016b). The number of hidden units and the dimensionality of the context vector for the ENN are fixed across all tasks at 128 and 64 respectively. The number of hidden units in the ResMADE is tuned per experiment. For the exact experimental settings used in each experiment, see Tables 3 and 4.
For the proposal distributions, we use a mixture of Gaussians in all cases except for the checkerboard experiment, where we use a fixed uniform distribution. We use 10 mixture components for the synthetic experiments, and 20 components for all other experiments. We use a minimum scale of
for the Gaussian distributions in order to prevent numerical issues.
For optimization we use Adam (Kingma & Ba, 2014) with a cosine annealing schedule (Loshchilov & Hutter, 2016) and an initial learning rate of . The number of training steps is adjusted per task. For Miniboone, we use only 6000 training updates, as we found overfitting to be a significant problem for this dataset. Early-stopping is used to select models, although in most cases the best models are obtained at the end of training. Normalizing constants are estimated using 20 importance samples, and dropout is applied to counter overfitting, with a rate that is tuned per task. Dropout is applied between the two layers of each residual block in both the ENN and ResMADE.
. For the encoder, we alternate between residual blocks that preserve spatial resolution, and downsampling residual blocks with strided convolutions. The input is projected to 16 channels using aconvolution, and the number of channels is doubled at each downsampling layer. After three downsampling layers we flatten the feature maps and use a linear layer to regress the means and log-variances of the approximate posterior with 32 latent units. The decoder mirrors the encoder, with the input latents being linearly projected and reshaped to a
spatial block, which is then upsampled using transpose-convolutions in order to output pixel-wise Bernoulli logits.
|ResMADE hidden dim.||256||256||256||256||512|
|ENN hidden dim.||128||128||128||128||128|
|Mixture comp. scale min.||1E-3||-||1E-3||1E-3||1E-3|
|ResMADE hidden dim.||512||512||512||512||1024|
|ENN hidden dim.||128||128||128||128||128|
|Mixture comp. scale min.||1E-3||1E-3||1E-3||1E-3||1E-3|