autoregressiveenergymachines
Autoregressive Energy Machines
view repo
Neural density estimators are flexible families of parametric models which have seen widespread use in unsupervised machine learning in recent years. Maximumlikelihood training typically dictates that these models be constrained to specify an explicit density. However, this limitation can be overcome by instead using a neural network to specify an energy function, or unnormalized density, which can subsequently be normalized to obtain a valid distribution. The challenge with this approach lies in accurately estimating the normalizing constant of the highdimensional energy function. We propose the Autoregressive Energy Machine, an energybased model which simultaneously learns an unnormalized density and computes an importancesampling estimate of the normalizing constant for each conditional in an autoregressive decomposition. The Autoregressive Energy Machine achieves stateoftheart performance on a suite of densityestimation tasks.
READ FULL TEXT VIEW PDF
Autoregressive models are among the best performing neural density
estim...
05/19/2017 ∙ by George Papamakarios, et al. ∙
0
∙
share
read it
The fundamental task of general density estimation has been of keen inte...
01/30/2018 ∙ by Junier B. Oliva, et al. ∙
0
∙
share
read it
Autoregressive (AR) models have become a popular tool for unsupervised
l...
10/17/2019 ∙ by Murtaza Dalal, et al. ∙
37
∙
share
read it
PixelCNN achieves stateoftheart results in density estimation for nat...
03/10/2017 ∙ by Scott Reed, et al. ∙
0
∙
share
read it
Deep autoregressive models have shown stateoftheart performance in de...
10/27/2017 ∙ by Scott Reed, et al. ∙
0
∙
share
read it
We propose an unsupervised model for novelty detection. The subject is
t...
07/04/2018 ∙ by Davide Abati, et al. ∙
2
∙
share
read it
We present a new model for singing synthesis based on a modified version...
04/12/2017 ∙ by Merlijn Blaauw, et al. ∙
0
∙
share
read it
Autoregressive Energy Machines
Modeling the joint distribution of highdimensional random variables is a key task in unsupervised machine learning. In contrast to other unsupervised approaches such as variational autoencoders
(Kingma & Welling, 2013; Rezende et al., 2014) or generative adversarial networks (Goodfellow et al., 2014), neural density estimators allow for exact density evaluation, and have enjoyed success in modeling natural images (van den Oord et al., 2016b; Dinh et al., 2017; Salimans et al., 2017; Kingma & Dhariwal, 2018), audio data (van den Oord et al., 2016a; Prenger et al., 2018; Kim et al., 2018), and also in variational inference (Rezende & Mohamed, 2015; Kingma et al., 2016). Neural density estimators are particularly useful where the focus is on accurate density estimation rather than sampling, and these models have seen use as surrogate likelihoods (Papamakarios et al., 2019) and approximate posterior distributions (Papamakarios & Murray, 2016; Lueckmann et al., 2017) for likelihoodfree inference.Neural networks are flexible function approximators, and promising candidates to learn a probability density function. Typically, neural density models are normalized a priori, but this can hinder flexibility and expressiveness. For instance, many flowbased density estimators
(Dinh et al., 2017; Papamakarios et al., 2017; Huang et al., 2018) rely on invertible transformations with tractable Jacobian which map data to a simple base density, so that the log probability of an input point can be evaluated using a change of variables. Autoregressive density estimators (Uria et al., 2013; Germain et al., 2015) often rely on mixtures of parametric distributions to model each conditional. Such families can make it difficult to model the lowdensity regions or sharp transitions characterized by multimodal or discontinuous densities, respectively.The contributions of this work are shaped by two main observations.
An energy function, or unnormalized density, fully characterizes a probability distribution, and neural networks may be better suited to learning such an energy function rather than an explicit density.
Decomposing the density estimation task in an autoregressive manner makes it possible to train such an energybased model by maximum likelihood, since it is easier to obtain reliable estimates of normalizing constants in low dimensions.
Based on these observations, we present a scalable and efficient learning algorithm for an autoregressive energybased model, which we term the Autoregressive Energy Machine (AEM). Figure 3 provides a condensed overview of how an AEM approximates the density of an input point.



. The output is a pair of vectors
which correspond to the proposal parameters and context vector, respectively, for the conditional distribution. (*fig:enn) The context vector and input are passed through the energy network to compute an unnormalized log probability for the conditional. (*fig:equations) The parameters define a tractable proposal distribution, such as a mixture of locationscale family distributions, which can be used to compute an estimate of the normalizing constant for the conditional by importance sampling.A probability density function assigns a nonnegative scalar value to each vectorvalued input , with the property that over its support. Given a dataset of i.i.d. samples drawn from some unknown dimensional distribution , the density estimation task is to determine a model such that . Neural density estimators are parametric models that make use of neural network components to increase their capacity to fit complex distributions, and autoregressive neural models are among the best performing of these.
The product rule of probability allows us to decompose any joint distribution into a product of conditional distributions:
(1) 
Autoregressive density estimators model each conditional using parameters which are computed as a function of the preceding variables in a given ordering. In this paper, we use the term ARNN to describe any autoregressive neural network which computes an autoregressive function of a dimensional input , where the output is denoted . The vector is often interpreted as the parameters of a tractable parametric density for the conditional, such as a mixture of locationscale distributions or the probabilities of a categorical distribution, but it is not restricted to this typical usecase.
Certain architectures, such as those found in recurrent models, perform the autoregressive computation sequentially, but more recent architectures exploit masking or causal convolution in order to output each conditional in a single pass of the network. Both types of architecture have found domainspecific (Sundermeyer et al., 2012; Theis & Bethge, 2015; Parmar et al., 2018), as well as generalpurpose (Uria et al., 2013; Germain et al., 2015) use. In particular we highlight MADE (Germain et al., 2015), an architecture that masks weight matrices in fully connected layers to achieve causal structure. It is a building block in many models with autoregressive components (Kingma et al., 2016; Papamakarios et al., 2017; Huang et al., 2018), and we make use of a similar architecture in this work.
In addition to an autoregressive decomposition, we may also write any density as
(2) 
where is the unnormalized density, is known as the energy function, and is the normalizing constant. Assuming is finite, specifying an energy function is equivalent to specifying a probability distribution, since the normalizing constant is also defined in terms of
. Models described in this way are known as energybased models. Classic examples include Boltzmann machines
(Hinton, 2002; Salakhutdinov & Hinton, 2009; Hinton, 2012), products of experts (Hinton, 2002) and Markov random fields (Osindero & Hinton, 2007; Köster et al., 2009).In order to do maximumlikelihood estimation of the parameters of an energybased model, we must be able to evaluate or estimate the normalizing constant
. This is problematic, as it requires the estimation of a potentially highdimensional integral. As such, a number of methods have been proposed to train energybased models, which either use cheap approximations of the normalizing constant, or sidestep the issue entirely. These include contrastive divergence
(Hinton, 2002), noisecontrastive estimation
(Gutmann & Hyvärinen, 2010; Ceylan & Gutmann, 2018), and score matching (Hyvärinen, 2005). In our case, phrasing the density estimation problem in an autoregressive manner allows us to make productive use of importance sampling, a stochastic approximation method for integrals.Importance sampling Given a proposal distribution which is nonzero whenever the target is nonzero, we can approximate the normalizing constant by
(3)  
(4) 
and this expression is an unbiased estimate of the normalizing constant for
. The quotients in the summand are known as the importance weights. When does not closely match, the importance weights will have high variance, and the importance sampling estimate will be dominated by those terms with largest weight. Additionally, when
does not adequately cover regions of high density under , importance sampling underestimates the normalizing constant (Salakhutdinov & Murray, 2008). Finding a suitable distribution which closely matches is problematic, since estimating the potentially complex distribution is the original problem under consideration. This issue is exacerbated in higher dimensions, and importance sampling estimates may be unreliable in such cases. Figure 2demonstrates how the accuracy of an importance sampling estimate of the normalizing constant for a standard normal distribution deteriorates as dimensionality increases.
The ability to specify a probability distribution using an energy function is enticing, since now a neural network can take on the more general role of an energy function in a neural density estimator. Further decomposing the task in an autoregressive manner means that importance sampling offers a viable method for maximum likelihood training, generally yielding reliable normalizing constant estimates in the onedimensional case when the proposal distribution is reasonably well matched to the target (fig. 2). As such, the main contribution of this paper is to combine eq. (1) and eq. (2) in the context of a neural density estimator. We model a density function as a product of energy terms
(5) 
where is the normalizing constant for the conditional. If we also specify an autoregressive proposal distribution , we can estimate the normalizing constant for each of the terms in the product by importance sampling:
(6)  
(7) 
This setup allows us to make use of arbitrarily complex energy functions, while relying on importance sampling to estimate normalizing constants in one dimension; a much more tractable problem than estimation of the full dimensional integral.
We implement the energy function as a neural network ENN that takes as input a scalar as well as a context vector that summarizes the dependence of on the preceding variables . The ENN directly outputs the negative energy, so that
. In our experiments, the ENN is a fullyconnected network with residual connections
(He et al., 2016a). To incorporate the context vector , we found concatenation to the input to work well in practice. We share ENN parameters across dimensions, which reduces the total number of parameters, and allows us to learn features of densities which are common across dimensions. We also constrain the output of the ENN to be nonpositive using a softplus nonlinearity, so that the unnormalized density is bounded by one, since this improved training stability.We denote by the vector of parameters for the proposal conditional, which, like the context vector , is computed as a function of . In our case, this quantity consists of the mixture coefficients, locations, and scales of a tractable parametric distribution, and we find that a mixture of Gaussians works well across a range of tasks. The normalizing constant for the conditional can thus be approximated by
(8) 
leading to an expression for the approximate log density of an input data point
(9) 
Estimation of log densities therefore requires the energy network to be evaluated times for each conditional; times for the importance samples, and once for the data point . In practice, we perform these evaluations in parallel, by passing large batches consisting of input data and importance samples for all conditionals to the energy net along with the relevant context vectors, and found to be sufficient. Although the importance sampling estimates of the normalizing constants are unbiased, by taking the logarithm of in eq. (9) we bias our estimates of . However, as we will show in Section 4.2, our estimates are wellcalibrated, and can be made more accurate by increasing the number of importance samples. For the purposes of training, we did not find this bias to be an issue.
As illustrated in fig. 2(a), we obtain both context vectors and proposal parameters in parallel using an autoregressive network ARNN. The ARNN outputs proposal parameters and context vectors for each of the dimensions in a single forward pass. These quantities are used both to estimate as in eq. (9), as well as to evaluate , and we form a maximumlikelihood training objective
(10) 
where refers collectively to the trainable parameters in both the ARNN and the ENN. We fit the AEM by maximizing eq. (10) across a training set using stochastic gradient ascent, and find that a warmup period where the proposal is initially optimized without the energy model can improve stability, allowing the proposal to cover the data sufficiently before importance sampling begins.
It is important to note that we do not optimize the proposal distribution parameters with respect to the importance sampling estimate. This means that the proposal is trained independently of the energymodel, and estimates of treat the proposal samples and proposal density evaluations as constant values for the purposes of optimization. In practice, this is implemented by stopping gradients on variables connected to the proposal distribution in the computational graph. We find maximumlikelihood training of to be effective as a means to obtain a useful proposal distribution, but other objectives, such as minimization of the variance of the importance sampling estimate (Kuleshov & Ermon, 2017; Müller et al., 2018), might also be considered, although we do not investigate this avenue in our work.


Although it is not possible to sample analytically from our model, we can obtain approximate samples by first drawing samples from the proposal distribution and then resampling from this collection using importance weights computed by the energy model (eq. (4)). This method is known as sampling importance resampling (Rubin, 1988), and is consistent in the limit of infinite proposal samples, but we found results to be satisfactory using just 100 proposal samples.
For general purpose density estimation of tabular data, we present a modified version of the MADE architecture (Germain et al., 2015) that incorporates residual connections (He et al., 2016a). In the standard MADE architecture, causal structure is maintained by masking certain weights in the network layers. We observe that in consecutive hidden layers with the same number of units and shared masks, the connectivity of the hidden units with respect to the inputs is preserved. As such, incorporating skip connections will maintain the architecture’s autoregressive structure.
Following the documented success of residual blocks (He et al., 2016a, b) as a component in deep network architectures, we implement a basic residual block for MADEstyle models that can be used as a dropin replacement for typical masked layers. Each block consists of two maskeddense layers per residual block, and uses preactivations following He et al. (2016b). We use the term ResMADE to describe an autoregressive architecture featuring these blocks, and make use of the ResMADE as an ARNN component across our experiments. For a more detailed description see Appendix A.
For our experiments, we use a ResMADE with four residual blocks for the ARNN, as well as a fullyconnected residual architecture for the ENN, also with four residual blocks. The number of hidden units in the ResMADE is varied per task. We use the Adam optimizer (Kingma & Ba, 2014), and anneal the learning rate to zero over the course of training using a cosine schedule (Loshchilov & Hutter, 2016). For some tasks, we find regularization by dropout (Srivastava et al., 2014) to be beneficial. Full experimental details are available in Appendix B, and code is available at https://github.com/conormdurkan/autoregressiveenergymachines.
We first demonstrate that an AEM is capable of fitting complex twodimensional densities. For each task, we generate data points for training. Results are displayed in fig. 3(a). We plot each AEM density by estimating the normalizing constant with 1000 importance samples for each conditional. AEM samples are obtained by resampling 100 proposal samples as described in section 3.3.
Spirals The spirals dataset is adapted from Grathwohl et al. (2019). Though capable of representing the spiral density, the ResMADE proposal fails to achieve the same quality of fit as an AEM, with notable regions of nonuniform density.
Checkerboard The checkerboard dataset is also adapted from Grathwohl et al. (2019). This task illustrates that in some cases a learned proposal distribution is not required; with a fixed, uniform proposal, an AEM is capable of accurately modeling the data, including the discontinuities at the boundary of each square.
Diamond The diamond task is adapted from the 100mode square Gaussian grid of Huang et al. (2018), by expanding to 225 modes, and rotating 45 degrees. Although the ResMADE proposal does not have the capacity to represent all modes of the target data, an AEM is able to recover these lost modes.
Einstein We generate the Einstein data by sampling coordinates proportional to the pixel intensities in an image of Albert Einstein (Müller et al., 2018), before adding uniform noise, and rescaling the resulting points to . This task in particular highlights the benefits of an energybased approach. The distribution of light in an image features sharp transitions and edges, highfrequency components, and broad regions of nearconstant density. Where a ResMADE proposal struggles with these challenges, an AEM is able to retain much more fine detail. In addition, samples generated by the AEM are difficult to distinguish from the original dataset.
Finally, fig. 3(b) presents an alternative visualization of the Einstein task. Each row corresponds to an unnormalized conditional proportional to for a fixed value of . While the ResMADE proposal, consisting of a mixture of 10 Gaussians for each conditional, achieves a good overall fit to the true pixel intensities, it is ultimately constrained by the smoothness of its mixture components.
Though importance sampling in one dimension is much less unwieldy than in highdimensional space, it may still be the case that the proposal distributions do not adequately cover the support of the conditionals being modeled, leading to underestimates of the normalizing constants. Here we demonstrate that the normalizing constants learned by an AEM are wellcalibrated by comparing to ‘true’ values computed with explicit numerical integration. In particular, we use a logmodified trapezoidal rule (Pitkin, 2017) to integrate the unnormalized log density output by the energy network over each dimension. This approach exploits the fast parallel computation of the energy net, allowing us to saturate the domain of integration, and compensate for the shortcomings of the trapezoidal rule compared to more advanced adaptive quadrature methods (Gander & Gautschi, 2000). We increase the number of integrand evaluations until the integral converges to seven significant figures.
For each trained model on the Einstein, Power, Hepmass, and BSDS300 tasks, we first randomly select 1000 dimensions with replacement according to the data dimensionality, disregarding the marginal so as to not repeatedly compute the same value. Then, we compute the integral of the log unnormalized density corresponding to that onedimensional conditional, using a logtrapezoidal rule and context vectors generated from a held outvalidation set of 1000 samples. This procedure results in 1000 ‘true’ integrals for each task. We then test the AEM by comparing this true value with estimates generated using increasing numbers of importance samples. Note that in each task, the AEM has never estimated the log normalizing constant for the conditional densities under consideration, since these conditionals are specified using the validation set, and not the training set. Figure 5 visualizes the results of our calibration experiments.
Model  POWER  GAS  HEPMASS  MINIBOONE  BSDS300 

MADEMoG  
MAF  
MAFDDSF  
TAN (various)  
ResMADEMoG (proposal)  
AEMKDE  
AEM 
Test log likelihood (in nats) for UCI datasets and BSDS300, with error bars corresponding to two standard deviations. AEM
results are estimated with 20,000 importance samples. The best performing model for each dataset is shown in bold, as well the best performing model for which the exact log likelihood can be obtained. Results for nonAEM models taken from existing literature. MAFDDSF report error bars across five repeated runs rather than across the test set.We follow the experimental setup of Papamakarios et al. (2017) in using a selection of preprocessed datasets from the UCI machine learning repository (Dheeru & Karra Taniskidou, 2017), and BSDS300 datasets of natural images (Martin et al., 2001). AEM log likelihoods are estimated using 20000 importance samples from the proposal distribution.
As a normalized approximation to the AEM, we use a kernel density estimate in which Gaussian kernels are centered at proposal samples and weighted by the importance weights. We include the proposal itself as a mixture component in order to provide probability density in regions not wellcovered by the samples. The KDE bandwidth and proposal distribution mixture weighting are optimized on the validation set. We call the model under this evaluation scheme AEMKDE. KDE estimates also use 20000 samples from the proposal distribution for each conditional.
Table 1 shows the testset log likelihoods obtained by our models and by other stateoftheart models. We first note that the AEM proposal distribution (ResMADEMoG) provides a strong baseline relative to previous work. In particular, it improves substantially on the the MADEMoG results reported by Papamakarios et al. (2017), and improves on the stateoftheart results reported by NAF (Huang et al., 2018) and TAN (Oliva et al., 2018). This demonstrates the benefits of adding residual connections to the MADE architecture. As such, practitioners may find ResMADE a useful component in many applications which require autoregressive computation.
Model  ELBO  

IAFDSF  
VAE  
ResMADEVAE  
AEMKDEVAE  
AEMVAE 
AEM and AEMKDE outperform both the proposal distribution and existing stateoftheart methods on the Power, Gas and Hepmass datasets, demonstrating the potential benefit of flexible energybased conditionals. Despite regularization, overfitting was an issue for Miniboone due to the size of the training set () relative to the data dimension (). This highlights the challenges associated with using very expressive models on domains with limited data, and the need for stronger regularization methods in these cases. On BSDS300, our models achieve the second highest scores relative to previous work. On this dataset we found that the validation scores () were substantially higher than testset scores (), indicating differences between their empirical distributions. Overall, the AEMKDE obtains improved scores relative to the proposal distribution, and these scores are close to those of the AEM.
We evaluate the AEM in the context of deep latentvariable models, where it can be used as an expressive prior. We train a convolutional variational autoencoder (VAE) (Kingma & Welling, 2013; Rezende et al., 2014), making use of residual blocks in the encoder and decoder in a similar architecture to previous work (Huang et al., 2018; Kingma et al., 2016). We initially train the encoder and decoder with a standard Gaussian prior, and then train an AEM posthoc to maximize the likelihood of samples from the aggregate approximate posterior
(11) 
This training method avoids an issue where maximumlikelihood training of the proposal distribution interferes with the lower bound objective. During pretraining, we set in the modified variational objective
(12) 
This weighting of the KLdivergence term has the effect of boosting the reconstruction log probability at the expense of an aggregate posterior that is less wellmatched to the standard Gaussian prior. This provides an opportunity for the AEM to improve on the original prior as a model of the aggregate posterior. Weighting of the KLdivergence has been used in previous work to reduce the occurrence of unused latent variables (Sønderby et al., 2016), and to control the types of representations encoded by the latent variables (Higgins et al., 2017).


Table 2 shows that the AEMVAE improves substantially on the standard Gaussian prior, and that the results are competitive with existing approaches on dynamicallybinarized MNIST (Burda et al., 2016). The AEM does not improve on the proposal distribution scores, possibly because the aggregate latent posterior is a mixture of Gaussians, which is wellmodeled by the ResMADEMoG. Figure 6 shows data examples and samples from the trained model.
Flowbased neural density estimation Together with autoregressive approaches, flowbased models have also seen widespread use in neural density estimation. Flowbased models consist of invertible transformations for which Jacobian determinants can be efficiently computed, allowing exact density evaluation through the changeofvariables formula. Multiple transformations are often stacked to enable more complex models.
Efficiently invertible flows Flows exploiting a particular type of transformation, known as a coupling layer, not only allow for onepass density evaluation, but also onepass sampling. Examples of such models include NICE (Dinh et al., 2014) and RealNVP (Dinh et al., 2017), and the approach has recently been extended to image (Kingma & Dhariwal, 2018) and audio (Prenger et al., 2018; Kim et al., 2018)
data. The case of continuous flows based on ordinary differential equations has also recently been explored by FFJORD
(Grathwohl et al., 2019). However, efficient sampling for this class of models comes at the cost of density estimation performance, with autoregressive models generally achieving better log likelihood scores.
Autoregressive flows Originally proposed by Kingma et al. (2016) for variational inference, autoregressive flows were adapted for efficient density estimation by Papamakarios et al. (2017) with MAF. Subsequent models such as NAF (Huang et al., 2018) and TAN (Oliva et al., 2018) have developed on this idea, reporting stateoftheart results for density estimation. Sampling in these models is expensive, since autoregressive flow inversion is inherently sequential. In some cases, such as Huang et al. (2018), the flows do not have an analytic inverse, and must be inverted numerically for sampling. Despite these caveats, autoregressive density estimators remain the best performing neural density estimators for general density estimation tasks.
Energybased models In this work we describe energybased models as unnormalized densities that define a probability distribution over random variables. However, there exist multiple notions of energybased learning in the machine learning literature, including nonprobabilistic interpretations (LeCun et al., 2006; Zhao et al., 2017). We focus here on recent work which includes applications to density estimation with neural energy functions. Deep energy estimator networks (Saremi et al., 2018) use an energy function implemented as a neural network, and train using the scorematching framework. This objective avoids the need to estimate the normalizing constant, but also makes it challenging to compare loglikelihood scores with other density estimators. Bauer & Mnih (2018) propose an energybased approach for increasing the flexibility of VAE priors, in which a neural network energy function is used to mask a prespecified proposal function. As in our work, training is performed using importance sampling, but due to the larger dimensionality of the problem, samples were used in the estimates during training.
Other related work Müller et al. (2018) propose neural importance sampling, in which a flowbased neural sampler is optimized in order to perform lowvariance Monte Carlo integration of a given target function. This is similar to the goal of the proposal distribution in the AEM, but in our case the proposal is trained jointly with an energy model, and we do not assume that the target function is known a priori.
We proposed the Autoregressive Energy Machine, a neural density estimator that addresses the challenges of energybased modeling in high dimensions through a scalable and efficient autoregressive estimate of the normalizing constant. While exact density evaluation is intractable for an AEM, we have demonstrated that the flexibility of an energybased model enables us to model challenging synthetic data, as well as achieve stateoftheart results on a suite of benchmark datasets.
The authors thank George Papamakarios, Iain Murray, and Chris Williams for helpful discussion. This work was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh.
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, 2010.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2016a.A practical guide to training restricted Boltzmann machines.
In Neural networks: Tricks of the trade, pp. 599–619. Springer, 2012.Stochastic backpropagation and approximate inference in deep generative models.
International Conference on Machine Learning, 2014.On the quantitative analysis of deep belief networks.
International Conference on Machine Learning, 2008.Pixel Recurrent Neural Networks.
International Conference on Machine Learning, 2016b.Residual connections (He et al., 2016a) are widely used in deep neural networks, and have demonstrated favourable performance relative to standard networks. Residual networks typically consist of many stacked transformations of the form
(13) 
where is a residual block. In this work, we observe that it is possible to equip a MADE (Germain et al., 2015) with residual connections when certain conditions on its masking structure are met.
At initialization, each unit in each hidden layer of a MADE is assigned a positive integer, termed its degree. The degree specifies the number of input dimensions to which that particular unit is connected. For example, writing for unit of layer , a value means that the unit depends only on the first dimensions of the input, when the input is prescribed a particular ordering (we always assume the ordering given by the data). In successive layers, this means that a unit may only be connected to units in the previous layer whose degrees strictly do not exceed its own. In other words, the degree assignment defines a binary mask matrix which multiplies the weight matrix of each layer in a MADE elementwise, maintaining autoregressive structure. Indeed, the mask for layer is given in terms of the degrees for layer and layer :
(14) 
Through this masking process, it can be guaranteed that output units associated with data dimension only depend on the previous inputs . Though the original MADE paper considered a number of ways in which to assign degrees, we focus here on fixed sequential degree assignment, used also by MAF (Papamakarios et al., 2017). For an input of dimension , we define the degree
(15) 
We also assume the dimension of each hidden layer is at least , so that no input information is lost. The result of sequential degree assignment for and is illustrated in fig. 7.
If all hidden layers in the MADE are of the same dimensionality , sequential degree assignment means that each layer will also have the same degree structure. In this way, two vectors of hidden units in the MADE may be combined using any binary elementwise operation, also shown in Figure fig. 7. In particular, a vector computed from a fullyconnected block with masked layers can be added to the input to that block while maintaining the same autoregressive structure, allowing for the traditional residual connection to be added to the MADE architecture.
In all experiments, we use the same ResMADE and ENN architectures, each with 4 preactivation residual blocks (He et al., 2016b). The number of hidden units and the dimensionality of the context vector for the ENN are fixed across all tasks at 128 and 64 respectively. The number of hidden units in the ResMADE is tuned per experiment. For the exact experimental settings used in each experiment, see Tables 3 and 4.
For the proposal distributions, we use a mixture of Gaussians in all cases except for the checkerboard experiment, where we use a fixed uniform distribution. We use 10 mixture components for the synthetic experiments, and 20 components for all other experiments. We use a minimum scale of
for the Gaussian distributions in order to prevent numerical issues.
For optimization we use Adam (Kingma & Ba, 2014) with a cosine annealing schedule (Loshchilov & Hutter, 2016) and an initial learning rate of . The number of training steps is adjusted per task. For Miniboone, we use only 6000 training updates, as we found overfitting to be a significant problem for this dataset. Earlystopping is used to select models, although in most cases the best models are obtained at the end of training. Normalizing constants are estimated using 20 importance samples, and dropout is applied to counter overfitting, with a rate that is tuned per task. Dropout is applied between the two layers of each residual block in both the ENN and ResMADE.
For the VAE, we use an architecture similar to those used in IAF and NAF (Kingma et al., 2016; Huang et al., 2018)
. For the encoder, we alternate between residual blocks that preserve spatial resolution, and downsampling residual blocks with strided convolutions. The input is projected to 16 channels using a
convolution, and the number of channels is doubled at each downsampling layer. After three downsampling layers we flatten the feature maps and use a linear layer to regress the means and logvariances of the approximate posterior with 32 latent units. The decoder mirrors the encoder, with the input latents being linearly projected and reshaped to aspatial block, which is then upsampled using transposeconvolutions in order to output pixelwise Bernoulli logits.
Hyperparameter  SPIRALS  CHECKERBOARD  DIAMOND  EINSTEIN  VAE 
Batch size  256  256  256  256  256 
ResMADE hidden dim.  256  256  256  256  512 
ResMADE activation  ReLU  ReLU  ReLU  ReLU  ReLU 
ResMADE dropout  0  0  0  0  0.5 
Context dim.  64  64  64  64  64 
ENN hidden dim.  128  128  128  128  128 
ENN activation  ReLU  ReLU  ReLU  ReLU  ReLU 
ENN dropout  0  0  0  0  0.5 
Mixture comps.  10    10  10  20 
Mixture comp. scale min.  1E3    1E3  1E3  1E3 
Learning rate  5e4  5e4  5e4  5e4  5e4 
Total steps  400000  400000  400000  3000000  100000 
Warmup steps  5000  0  0  0  0 
Hyperparameter  POWER  GAS  HEPMASS  MINIBOONE  BSDS300 

Batch size  512  512  512  512  512 
ResMADE hidden dim.  512  512  512  512  1024 
ResMADE activation  ReLU  ReLU  ReLU  ReLU  ReLU 
ResMADE dropout  0.1  0  0.2  0.5  0.2 
Context dim.  64  64  64  64  64 
ENN hidden dim.  128  128  128  128  128 
ENN activation  ReLU  Tanh  ReLU  ReLU  ReLU 
ENN dropout  0.1  0  0.2  0.5  0.2 
Mixture comps.  20  20  20  20  20 
Mixture comp. scale min.  1E3  1E3  1E3  1E3  1E3 
Learning rate  5e4  5e4  5e4  5e4  5e4 
Total steps  800000  400000  400000  6000  400000 
Warmup steps  5000  5000  5000  0  5000 