Today, virtually all media is handled digitally. As such, it is stored in bits and is therefore discrete. Deep distributions models (larochelle2011nade; kingma2013autoencoding) aim to learn a distribution model
for high-dimensional data. Many of these models aredensity models (uria2013rnade; oord2014factoring; dinh2017density; papamakarios2017masked), meaning they learn a distribution of a continuous random variable.
Problematically, the naïve maximum likelihood solution for a continuous density model on discrete data, may place arbitrarily high likelihood on the discrete locations (theis2016anote) (for an example see Figure 1(a)
). Since discrete and continuous spaces are topologically different, a probability density does not necessarily approximate a probability mass. After all, the total probability at a single point under a density is always zero.
To deal with this issue, it has become common practice to add noise to datapoints which dequantizes the data. theis2016anote show that if noise is added in a particular way, the likelihood from the continuous model is a lowerbound of the discrete model (for an example see Figure 1(b)). This is important as it allows comparison of discrete and continuous models directly using the likelihood. Recently ho2019flowpp show that improving the flexibility of the noise distribution allows tighter bounds which improves modelling performance.
Although the benefits of learned dequantization have been demonstrated in a specific case, the effects of dequantization are not yet fully understood. How do dequantization and density model interact? What is the effect of increased dequantization flexibility? Are there more sophisticated optimization objectives?
In this paper, we present a general framework for dequantization via latent variable modelling. In this framework, we are able to recover existing dequantization schemes as special cases, and we derive two new objectives: importance-weighted dequantization, and Rényi dequantization. In addition, we propose autoregressive flows to learn dequantization distributions. Although autoregressive flows are computationally expensive to invert, in this particular case the dequantization noise does not have to be inverted. Experimentally we show how density and dequantization distributions with varying flexibility interact on a 2-dimensional problem. In addition, we find that our methods improve likelihood modelling on binary MNIST and CIFAR10.
The contribution of the paper is threefold:
We outline a latent variable framework for latent variable dequantization. We recover variational inference (vi) dequantization (ho2019flowpp) and propose two new dequantization approaches based on the weighted importance sampling (iw dequantization) and variational Rényi divergence (Rényi dequantization).
We outline different dequantization distributions. We opt for using autoregressive flow dequantization (ARD) which consistently improves variational inference and log-likelihood evaluation of density models. Even though ARD utilizes autoregressive modules, it is possible to sample from the model without computing the inverse of autoregressive modules.
We evaluate iw, Rényi and vi dequantization and test different dequantization distributions on image datasets (binary MNIST and CIFAR10) quantitatively. Furthermore, we analyze the learned densities for a 2d data problem qualitatively. Using experimental results, we describe recommendations for which dequantization methods to utilize.
2 Related Work
or directly by applying autoregressive models (ARMs)(oord2016pixel). VAEs are rather easy to train and could be parameterized using different neural network architectures, however, they provide a lowerbound to the log-likelihood. On the other hand, ARMs provide an exact value of the marginal likelihood in a fast manner, but they are typically slow to sample from.
Further, flow-based models have recently also been applied to discrete variable modelling (tran2019discrete; hoogeboom2019integer). tran2019discrete
consider binary and categorical variables in text analysis, but their performance on image data is currently unknown.hoogeboom2019integer show competitive performance on image data.
However, a large number of distribution models learn a density, a distribution over a continuous variable (uria2013rnade; oord2014factoring; dinh2017density; papamakarios2017masked; kingma2018glow; huang2018neural; cao2019block; grathwohl2019ffjord; hoogeboom2019emerging; ho2019flowpp; chen2019residual; song2019mintnet; ma2019macow). A standard approach adds uniform noise to discrete values (theis2016anote; uria2013rnade; oord2014factoring). Very recently, it was proposed to consider a learnable dequantization treated as a variational posterior over latent variables (i.e., continuous variables) (ho2019flowpp; winkler2019learning). In this paper, we derive a new framework for dequantization using latent variable modelling and we present two new dequantization objectives. We provide more in-depth analysis and aim at understanding how different choices of dequantization objectives and dequantization distributions affect the final performance in the log-likelihood.
denote a vector of
observable discrete random variables andbe its (unknown) distribution. We assume there is a set of data given, or, equivalently, an empirical distribution is provided. The likelihood-based approach to learning a distribution is about finding values of parameters of a model that maximize the log-likelihood function:
3.1 Dequantization as a latent variable model
Frequently, a discrete distribution models a proxy of a continuous variable in the physical world. For instance, a digital photograph of an observed scene represents the light that is reflected from observed objects, quantized to a certain precision. In other words, we can consider a latent variable model where continuous latent variables correspond to a continuous representation of the world and observable discrete variables are measured quantities. This suggests the following model:
where is an indicator function of being contained in a volume , namely, , and is a continuous distribution, which may be modeled using a flexible density model (mackay1999density; dinh2017density; rippel2013high). We refer to as a quantizer. Note that in principle the volumes can be constructed to induce any type of partition of a volume space, where care should be taken that for different do not overlap. When we set for
we recover half-infinite dequantization for binary variables fromwinkler2019learning. In this paper, since image data is often represented on a square grid we will focus on hypercubes, namely, .
Calculating the integral in (2) is troublesome, and thus, learning is infeasible especially in high dimensional cases. Therefore, in order to alleviate this issue, we introduce a new distribution with parameters , a dequantizing distribution or dequantizer. In fact, the dequantizer should have the same support as , otherwise it would assign probability mass to regions outside the volume . Therefore, we will use instead of in the dequantizing distribution to highlight the fact that the support of equals , where we define . Including the dequantizer in our model yields:
Existing methods in literature define to relate a discrete and continuous model. Important differences with our method is that ours can be derived directly without this definition, and the quantizer volume generalizes to any volumetric partition.
Introducing the dequantizer allows us to derive three approaches to approximate the integral using i) variational inference, ii) weighted importance sampling and iii) variational Rényi approximation.
3.2 Variational Dequantization
We can interpret the dequantizing distribution as a variational distribution and apply Jensen’s inequality to obtain the lower-bound on the log-likelihood function:
The dequantizing distribution must be restricted to assign probability mass to only, otherwise the lower-bound is undefined. Thus, for our choice of
being a hypercube, we can apply the sigmoid function to the output of the dequantizer to ensure the lower-bound has appropriate support. As a result, we can re-write (5) as follows:
which recovers the variational dequantization from ho2019flowpp. Note that is the entropy of the dequantizing distribution for a given , which prevents the dequantizer from collapsing to a delta peak. We refer to this dequantization scheme as vi dequantization.
3.3 Importance-Weighted Dequantization
Alternatively, we can interpret the dequantizing distribution as a proposal distribution and instead of using Jensen’s inequality we sample times from , which directly approximates the log-likelihood:
where and for . If we constrain the proposal distribution (the dequantizer) in the same manner as we did in the case of the variational dequantization (i.e., the probability mass should be assigned only to ), we obtain:
where is an importance weight.
In general, if , then we obtain an equality in (7) and (8). But since we take a finite sample, the approximate gives a lower-bound to the log-likelihood function (iw-bound). Importantly, the iw-bound is tighter than the variational lower-bound (burda2015importance; domke2018importance). Hence, the importance-weighting is preferable over the variational inference and in practice it leads to a better log-likelihood performance. We refer to this dequantization scheme as iw dequantization.
3.4 Rényi Dequantization
The variational inference and importance-weighting sampling for a latent variable model could be generalized by noticing that both approaches are special cases of the variational Rényi bounds. It has been shown in (li2016renyi) that the log-likelihood function could be lower-bounded by the Rényi divergence approximated with the sample from of size , namely:
is a hyperparameter. Interestingly, forwe obtain the variational lower-bound and for we get the iw-bound.
li2016renyi have further shown that it is advantageous to consider , because it may give tighter bounds than the iw-bound when the sample size is low.111To be precise, if we consider the infinite sample for , we get an upper-bound on the log-likelihood function. However, taking results in a tight lower-bound according to Corollary 1 in (li2016renyi). Setting corresponds to picking the largest importance weight value. Using the notation introduced in 8, we can obtain the variational Rényi max approximation (VR-max):
The maximum weight dominates the contributions of all the gradients (li2016renyi). Therefore, the VR-max approach could be seen as a fast approximation to the importance-weighting. The VR-max approximation speeds up computations by considering only one example instead of in calculating gradients. We refer to this whole dequantization scheme as Rényi dequantization.
3.5 Dequantizing distributions
The dequantizing distribution plays an important role in the framework and its flexibility allows to obtain better log-likelihood scores. As already noticed by (ho2019flowpp)
, replacing a simple uniform distribution with a more sophisticated bipartite flow gives much better results. Importantly, the dequantizing distribution is a conditional distribution and we use it for sampling instead of calculating probabilities. Therefore, we can utilize models that are more powerful, but typically slow for evaluating probabilities,e.g., autoregressive flows (kingma2016improving).
The special case in which is a uniform distribution over , is equivalent to the setting introduced in (theis2016anote; uria2013rnade; oord2014factoring), termed uniform dequantization.
Instead of using a certain family of distribution, we can define the quantizer by applying the change of variables formula, that is:
where is a bijective map to a simple base distribution , and denotes a Jacobian matrix. Notice we highlight the need of using the (inverse) sigmoid function on top of the bijective map in order to ensure that .
There are two important parts of a flow-based model, namely, a choice of a base distribution and a form of the bijective map . Here we decide to use a diagonal Gaussian base distribution (dinh2017density) and we present two common choices of constructing : i) bipartite bijective maps, and ii) autoregressive bijective maps.
Bipartite Dequantization The idea behind the bipartite bijective maps is to ensure invertibility by splitting an input into two parts (e.g., along channels), , and processing only the second part (dinh2017density), namely:
where is a scaling transformation, is a translation, and denotes an element-wise multiplication. We explicitly write the dependency on to indicate how we use the conditioning in the dequantizer. The transformation in (13) is called a coupling layer.
Further, in order to ensure that all random variables are processed, the outputs of a coupling layer are permuted and another coupling layer is applied.
Autoregressive Dequantization We can model
with an ‘expensive to invert’ bijective map. In this paper, we find that an autoregressive model as proposed for variational autoencoders(kingma2016improving) is an appealing choice for dequantization. The model could be formulated as follows:
where is an autoregressive model (an autoregressive neural network), is a context variable that is calculated based on the conditioning using a neural network, . We refer to this dequantization scheme as Autoregressive Dequantization (ARD).
3.6 Continuous distributions
The continuous model is the crucial component in the presented framework since the better performance depends on the flexibility of this model. In principle, any continuous density model could be used as , e.g., any model mentioned in subsection 3.5. In practice, however, has to be evaluated during training and we are interested in sampling
. Hence, utilizing models with autoregressive components would be prohibitively slow. Therefore, in our experiments, we consider a Gaussian distribution with diagonal covariance, full covariance, and a bipartite flow-based model (a series of coupling layers and a factored-out base distribution) as a continuous distribution.
To understand and evaluate different dequantization schemes, they are tested on three different data problems: i) a 2-dimensional binary problem, ii)
(statically) binarized MNIST(larochelle2011nade) which is derived directly from MNIST and iii) CIFAR10 (krizhevsky2009learning) (8 bit and 5 bit). Generally we find that for problems with lower bit depths dequantization matters more for performance, as the range of dequantization noise is relatively large with respect to the range of the data . For MNIST data we use the given split of train, validation and test images. For CIFAR10 we split the 50000 training images into the first for train and the last for validation, we use the test images as provided.
Performance is evaluated on a held-out test-set using negative log-likelihood. This method of evaluation is common in deep distribution learning literature because it allows for an information theoretic interpretation: the negative log-likelihood is expressed in bits or bits per dimension (bpd), where the latter is an average over dimensions. Interestingly, this number represents the theoretical lossless compression limit when this model is used to compress the data.
In the experiments we consider diagonal Gaussian, covariance Gaussian and flows as distribution models, since these models admit exact likelihood evaluation. The diagonal Gaussian is parametrized striaghtforwardly using parameters for mean and log scale. The covariance Gaussian is parametrized using a Cholesky decomposition, i.e., the precision where is the learnable parameter. The diagonal of is modelled separately using a log diagonal parameter, which ensures positive-definiteness of . The covariance matrix is defined then as . Further, flows have an architecture as described in (kingma2018glow) using the coupling networks from (hoogeboom2019integer). For more details regarding architecture and training details, see Appendix A.
4.1 Analysis in 2d
First, we analyze different dequantization methods and objectives in two dimensions. In two dimensions the learned dequantization and model distribution can be visualized. We construct a binary problem named the binary checkerboard, which places uniform probability over two of the four states in the binary space :
The theoretical likelihood limit of a dataset is typically unknown, however, the binary checkerboard is artificially constructed. Hence, the theoretical limit of the average negative log-likelihood is known and it is exactly 1 bit for this problem, because there is an equal probability over two events. In particular:
where the first equality becomes an inequality when approximated with variational inference () or a finite sample importance-weighting (). The optimum of the objective is reached when , in that case the cross-entropy equals the entropy . For the binary checkerboard specifically, bit.
Since the problem is two dimensional, the learned distributions can be visualized. Figure 3 depicts the probability density of the dequantizer and the density model , for models trained using vi-dequantization. Since by construction only places density on a bin corresponding to , the distribution can be visualized without overlap in the marginal distribution .
When the model is a flow and the dequantizer is uniform (Figure 2(a)), the model is struggling to adequately model the boundaries of the dequantized density regions. When the model is a simple diagonal Gaussian and the dequantizer is a flow (Figure 2(b)), the flexible dequantizer compensates the limitations of the density model by shaping itself to the limitations of the simple distribution. A variant where the density model is slightly more flexible can be seen when the dequantizer is a Gaussian with covariance, and is still a flow (Figure 2(c)). Aided by the dequantizer, the model aims to place density on the diagonal line which improves the performance to 1.08 bits, which is already close to the theoretical limit. The best performance is achieved when both and are flexible (Figure 2(d)). For this problem we observe the density contracts somewhat away from boundaries, and the center has relatively high density.
The effects seen in Figure 3 also translate quantitatively with the likelihood performance of these models (Table 1). Note that the more flexible the distributions, the better the performance. Another interesting observation is that when is a flow distribution, a Gaussian and a flow have equal performance. Presumably, the flexibility of does not require a more complicated dequantizer for this relatively simple problem, where performance is already close to the theoretical limit of 1 bit.
Next to vi dequantization, we study the effects when models are optimized using iw and Rényi dequantization (Table 2). We find that uniformly dequantized models that are trained using Rényi dequantization are considerably better than in terms of likelihood. Furthermore, when trained using iw dequantization the model achieves performance close to the theoretical limit. For more complicated dequantizers though, we find that improvements are negligible. Therefore, these sophisticated objectives appear to be particularly useful when the dequantization distribution is simple. An interesting remark specific to the binary checkerboard, is that we found that Rényi dequantization for larger values than = 2 tends to diverge from the iw dequantization. The model fits to this divergence which results in a worse likelihood score. Presumably, this occurs because the binary checkerboard is low-dimensional, as this effect is not seen on binary MNIST and CIFAR10 (see the following subsections). Additionally, we note that ARD and bipartite dequantization are equivalent in two dimensions, and hence comparison is only meaningful on higher dimensional problems.
vi is equivalent to iw or Rényi with .
4.2 Image distribution modelling
In this section different dequantizer distributions and objectives are tested on binary MNIST and CIFAR10 (8 bit and 5 bit). Problems with lower bit depths are interesting, because dequantization noise is relatively large with respect to the range of values that the data takes.
Importance weighted and Renyi dequantization
Similar to the 2d example, more sophisticated training objectives are most advantageous when dequantization distributions are simple, which can be seen in Table 3. On binary MNIST, training using iw
dequantization improves negative likelihood performance from 0.162 bpd to 0.159 bpd. Again, for more expressive learnable dequantizers, we find that the added benefit of these objectives is negligible. For 5 and 8 bit CIFAR10 we train only last 100 epochs with the sophisticated objectives and the first withvi to reduce the computational cost. We find that for CIFAR10 the performance gains are minimal, possibly due to the higher bit depth. For simple dequantizers on data with low bit depth, -dequantization considerably improves performance. Rényi dequantization achieves similar but slightly worse performance, which is acceptable since it is a faster approximation.
|bit depth||1 bit||5 bit||8 bit|
vi is equivalent to iw or Rényi with .
Experiments show that ARD outperforms all other dequantization distributions, when trained using comparable architectures. On binary MNIST we consider two density models, a Gaussian with covariance and a flow based model. Both these models benefit from ARD as presented in Table 4. Similar to findings on the 2D binary checkerboard, when is a Gaussian and therefore less flexible, more flexible dequantizers can compensate. Consider for instance the performance improvement in log-likelihood from uniform dequantization to ARD: the improvement for a flow is 0.015 bpd, whereas the improvement for the Gaussian is approximately 0.3 bpd.
On CIFAR10, again our proposed ARD outperforms other dequantization methods in likelihood modelling (Table 5). Notice that dequantization distribution seems to matter more when bit-depths are smaller. To see this, consider the log-likelihood improvement when comparing uniform dequantization and ARD: For the 8 bit data the improvement is 0.12 bpd, which is about 3.7% relative to the total bpd. However, for 5 bit data the improvement is already 0.20 bpd which is about 12% relatively. We observe that log-likelihood modelling of lower bit depth data may especially benefit from more expressive dequantizers.
In this experiment the model using ARD is compared with other methods in the literature. Experiments show that our model outperforms other methods in the literature on both variational inference objective and negative likelihood (Table 6). In general we compare to models that do not require an autoregressive inverse to sample from, where the exception is marked . In particular, we report vi evaluation, also referred to as Expected Lower Bound (ELBO), and we report the approximate negative likelihood - using importance weighted samples following maaloe2019biva. Note that ho2019flowpp use
samples, which skews the experiment in their favour forvi and - , but against them for . Architecturally, the density model in ARD is most similar to IDF (hoogeboom2019integer), where 1 1 convolutions from Glow (kingma2018glow) and scale transformations from RealNVP (dinh2017density) are added. Flow++ (ho2019flowpp) has additional attention layers and MintNet (song2019mintnet) has autoregressive transformations instead of coupling layers in the density model. Note that even though our model utilizes autoregressive components similar to MintNet (song2019mintnet), our model is computationally cheap to invert since it does not require the solution to autoregressive inverses. Residual Flow (chen2019residual) utilizes invertible ResNets instead of coupling layers.
|Residual Flow (chen2019residual)||n/a||3.28||n/a|
Sampling from model requires autoregressive inverse.
Sampling from model requires other iterative procedures.
n/a not available, this value exists but was not reported in the literature.
This section aims to give the reader recommendations on what dequantization methods to use and what gains are to be expected. When dequantization noise is fixed, iw or Rényi dequantization objectives generally improve log-likelihood performance. When these objectives are too expensive to utilize for the complete procedure, it is generally enough to train the first epochs/iterations with vi dequantization, and then finetune the last epochs using iw or Rényi. Note that by design of the objectives, the approximate posterior will diverge more from the (unknown) true posterior . Therefore, a downside of these objectives is that a single sample iw dequantization (equivalent to vi) will be a poor approximation to the log-likelihood, and instead multiple samples are required.
When dequantization noise can be learned, the vi dequantization objective is generally sufficient. If the reader either has a simple density model or is interested in obtaining the highest log-likelihood performance possible, we recommend using ARD, as its flexibility improves the modelling performance. However when computational resources are scarce and some performance decrease is acceptable, Gaussian dequantization might be a good simple alternative.
In this paper we propose two dequantization objectives: importance-weighted (iw) dequantization and Rényi dequantization. In addition, we improve the flexibility of dequantization distributions with autoregressive dequantization (ARD). Empirically we demonstrate improved likelihood modelling for models trained with iw and Rényi dequantization when dequantization distributions are simple. Furthermore we demonstrate that ARD achieves a negative log-likelihood of 3.06 bits per dimension on CIFAR10, which to the best of our knowledge is state-of-the-art among distribution models that do not require autoregressive inverses for sampling.
Appendix A Architecture and Optimization details
|Experiment||levels||subflows||net. depth||net. channels||context channels||levels||subflows||batch size|
|CIFAR10 (Literature comparison)||2||18||12||768||16||1||2||128|
Models were all optimized using (kingma2015adam) with a learning rate of and standard parameters. Furthermore, during initial 10 epochs the learning rate is multiplied by epoch divided by 10, referred to as warmup (kingma2018glow)
. All our code was implemented in PyTorch(paszke2017automatic). The basic architecture was built following (kingma2018glow): The flow is divided in multiple levels with a decreasing number of dimensions. At the end of every level, half of the representation is modelled using a factor out layer (splitprior) (dinh2017density; kingma2018glow). Every level consists of subflows, i.e. a coupling layer followed by a 1 1 convolution (kingma2018glow). The coupling layers utilize neural networks as described in (hoogeboom2019integer). For the autoregressive transformation, we utilize the masking as described in (song2019mintnet). In terms of autoregressive order, this is equivalent to reshaping a C H W image to a vector and applying the autoregressive mask. This is opposed to masking in (kingma2016improving), which is equivalent to a mask on a reshaped H W C image. In practice, the autoregressive transformation is obtained by masking convolutions.
Appendix B Samples from trained model
Visualization of samples from a density model, and the quantizer are depicted in Figure 4. The quantizer is simply a Kronecker delta peak and amounts to applying a floor function. The density model is a flow trained with autoregressive dequantization on standard 8 bit CIFAR10. Notice that although the method is trained using autoregressive dequantization, the density model uses bipartite transformations and does not require the solution to autoregressive inverses.
Appendix C Visualizations on Binary Checkerboard
In this section a comprehensive overview of the distributions dequantizer and density model pairs is visualized. The models trained using variational inference are displayed in Table 8. In general, the dequantizer and density model try to compensate for each other where they are lacking flexibility. This effect can be seen when is a flow and is a diagonal Gaussian, a covariance Gaussian and lastly a flow. When is a flow, it is generally difficult to capture the edges of the squares when dequantization noise is uniform. However, both Gaussian and flow dequantization perform equally when the model is a flow. In this simple problem, Gaussian dequantization is sufficiently flexible when combined with a flow.
The models trained using iw and Rényi dequantization objectives are depicted in Table 9. An important difference with vi-dequantization is that it is much less important for and to match completely. Rather, more emphasis is placed so that places distribution somewhere in the appropriate bin, where the exact location in the bin matters less. As a result, when is uniform the model is not forced to learn the uniform square and retracts somewhat away from the edges.