In this paper we focus on the issue of quantization. Practical lossy compression schemes rely on quantization to compute a discrete representation which can be transmitted digitally. But quantization is a non-differentiable operation and as such prevents us from optimizing encoders directly via backpropagationwerbos1974bp. A common workaround is to replace quantization with a differentiable approximation during training but to use quantization at test time [e.g., toderici2016rnn, balle2016end, agustsson2017soft]. However, it is unclear how much this mismatch between training and test phases is hurting performance.
A promising alternative is to get rid of quantization altogether havasi2018miracle. That is, to communicate information in a differentiable manner both at training and at test time. At the heart of this approach is the insight that we can communicate a sample from a possibly continuous distribution using a finite number of bits, also known as the reverse Shannon theorem bennett2002. However, existing realizations of this approach tend to be either computationally costly or statistically inefficient, that is, they require more bits than they transmit information.
Here, we bridge the gap between the two approaches of dealing with quantization. A popular approximation for quantization is additive uniform noise balle2016end, balle2017end. In Section 3.2, we show that additive uniform noise can be viewed as an instance of compression without quantization and describe a technique for implementing it at test time. Unlike other approaches to quantizationless compression, this technique is both statistically and computationally efficient. In Section 4.1
, we show how to smoothly interpolate between uniform noise and hard quantization while maintaining differentiability. We further show that it is possible to analytically integrate out noise when calculating gradients and in some cases drastically reduce their variance (Section4.2). Finally, we evaluate our approach empirically in Section 5 and find that a better match between training and test phases leads to improved performance especially in models of lower complexity.
2 Related work
Most prior work on end-to-end trained lossy compression optimizes a rate-distortion loss of the form
Here, is an encoder, is a decoder,
is a probability mass function and they may all depend on parameters we want to optimize. The distortionmeasures the discrepancy between inputs and reconstructions and the parameter controls the trade-off between it and the number of bits. The rounding function used for quantization and the discreteness of pose challenges for optimizing the encoder.
Several papers have proposed methods to deal with quantization for end-to-end trained lossy compression. toderici2016rnn replaced rounding with stochastic rounding to the nearest integer. theis2017cae
applied hard quantization during both training and inference but used straight-through gradient estimates to obtain a training signal for the encoder.agustsson2017soft
used a smooth approximation of vector quantization that was annealed towards hard quantization during training.
Most relevant for our work is the approach taken by balle2016end, who proposed to add uniform noise during training,
as an approximation to rounding at test time. Here, is a density and is a sample of uniform noise drawn from
. If the distortion is a mean-squared error, then this approach is equivalent to a variational autoencoderkingma2014vae with a uniform encoder balle2017end, theis2017cae.
Another line of research studies the simulation of noisy channels using a noiseless channel, that is, the reverse of channel coding. In particular, how can we communicate a sample from a conditional distribution (the noisy channel), , using as few bits as possible (the noiseless channel)? The reverse Shannon theorem of bennett2002 shows that it is possible to communicate a sample using a number of bits not much larger than the mutual information between and , .
Existing implementations of reverse channel coding operate on the same principle. First, a large number of samples is generated from a fixed distribution . Importantly, this distribution does not depend on and the same samples can be generated on both the sender’s and the receiver’s side using a shared source of randomness (for our purposes this would be a pseudorandom number generator with a fixed seed). One of these samples is then selected and its index communicated digitally. The various methods differ in how this index is selected.
cuff2008 provided a constructive achievability proof for the mutual information bound using an approach which was later dubbed the likelihood encoder cuff2013. In this approach the index is picked stochastically with a probability proportional to . An equivalent approach dubbed MIRACLE was later derived by havasi2018miracle using importance sampling. In contrast to cuff2013, havasi2018miracle considered communication of a single sample from instead of a sequence of samples. MIRACLE also represents the first application of quantizationless compression in the context of neural networks. Originally designed for model compression, it was recently adapted to the task of lossy image compression flamich2020cwq. An earlier but computationally more expensive method based on rejection sampling was described by harsha2007.
li2017 described a simple yet efficient approach. The authors proved that it uses at most
bits on average. To our knowledge, this is the lowest known upper bound on the bits required to communicate a single sample. The overhead is still significant if we want to communicate a small amount of information but becomes negligible as the mutual information increases.
Finally, we will rely heavily on results related to universal quantization ziv1985universal, zamir1992universal
to communicate a sample from a uniform distribution (Section3.2). choi2019uq used universal quantization as a relaxation of hard quantization. However, universal quantization was used in a manner that still produced a non-differentiable loss, which the authors dealt with by using straight-through gradient estimates bengio2013straight. In contrast, here we will use fully differentiable losses during training and use the same method of encoding at training and at test time.
3 Compression without quantization
Instead of approximating quantization or relying on straight-through gradient estimates, we would like to use a differentiable channel and thus eliminate any need for approximations during training. Existing methods to simulate a noisy channel
require simulating a number of random variableswhich is exponential in for every we wish to communicate [e.g., havasi2018miracle].
Since the mutual information
is a lower bound on the average Kullback-Leibler divergence, this creates a dilemma. On the one hand, we would like to keep the divergence small to limit the computational cost. For example, by encoding blocks of coefficients (sometimes also referred to as “latents”) separatelyhavasi2018miracle, flamich2020cwq. On the other hand, the information transmitted should be large to keep the statistical overhead small (Equation 3).
One might hope that more efficient algorithms exist which can quickly identify an index without having to explicitly generate all samples. However, such an algorithm would allow us to efficiently sample distributions which are known to be hard to simulate even approximately (in terms of total variation distance, ) long2010rbm, leading to the following lemma.
Consider an algorithm which receives a description of an arbitrary probability distribution
Consider an algorithm which receives a description of an arbitrary probability distributionas input and is also given access to an unlimited number of i.i.d. random variables . It outputs such that its distribution is approximately in the sense that . If , then there is no such algorithm whose time complexity is polynomial in .
A proof and details are provided in Appendix B. In order to design efficient algorithms for communicating samples, the lemma implies we need to make assumptions about the distributions involved.
3.1 Uniform noise channel
A particularly simple channel is the additive uniform noise channel,
Replacing quantization with uniform noise during training is a popular strategy for end-to-end trained compression [e.g., balle2016end, balle2017end, zhou2018tucodec]. In the following, however, we are no longer going to view this as an approximation to quantization but as a differentiable channel for communicating information. The uniform noise channel turns out to be easy to simulate computationally and statistically efficiently.
3.2 Universal quantization
For a fixed , universal quantization is quantization with a random offset,
ziv1985universal showed that this form of quantization has the remarkable property of being equal in distribution to adding uniform noise directly. That is,
where is another source of identical uniform noise. This property has made universal quantization a useful tool for studying quantization, especially in settings where quantization noise is roughly uniform. Here, we are interested in it not as an approximation but as a way to simulate a differentiable channel for communicating information. At training time, we will add uniform noise as in prior work balle2016end, balle2017end. For deployment, we propose to use universal quantization instead of switching to hard quantization, thereby eliminating the mismatch between training and test phases.
If is a random variable representing a coefficient produced by a transform, the encoder calculates discrete and transmits it to the decoder. The decoder has access to and computes . How many bits are required to encode ? zamir1992universal showed that the conditional entropy of given is
This bound on the coding cost has two important properties. First, being equivalent to the differential entropy of means it is differentiable if the density of is differentiable. Second, the cost of transmitting is equivalent to the amount of information gained by the decoder. In contrast to other methods for compression without quantization (Equation 3), the number of bits required is only bounded by the amount of information transmitted. In practice, we will use a model to approximate the distribution of from which the distribution of can be derived, . Here, is the same density that is occurs in the loss in Equation 2.
Another advantage of universal quantization over the more general reverse channel coding schemes is that it is much more computationally efficient. Its computational complexity grows only linearly with the number of coefficients to be transmitted instead of exponentially with the number of bits.
Universal quantization has previously been applied to neural networks using the same shift for all coefficients, choi2019uq. We note that this form of universal quantization is not equivalent to adding either dependent or independent noise during training. Adding dependent noise would not create an information bottleneck, since a single coefficient which is always zero could be used by the decoder to recover the noise and therefore the exact values of the other coefficients. In the following, we will always assume independent noise as in Equation 4.
Generalizations to other forms of noise such as Gaussian noise are possible and are discussed in Appendix C. Here, we will focus on a simple uniform noise channel (Section 3.2) as frequently used in the neural compression literature balle2016end, balle2017end, minnen2018joint, zhou2018tucodec.
4 Compression with quantization
While the uniform noise channel has the advantage of being differentiable, there are still scenarios where we may want to use quantization. For instance, under some conditions universal quantization is known to be suboptimal with respect to mean squared error (MSE) [zamir2014book, Theorem 5.5.1]. However, this assumes a fixed encoder and decoder. In the following, we show that quantization is a limiting case of universal quantization if we allow flexible encoders and decoders. Hence it is possible to recover any benefits quantization might have while maintaining a differentiable loss function.
4.1 Simulating quantization with uniform noise
We first observe that applying rounding as the last step of an encoder and again as the first step of a decoder would eliminate the effects of any offset ,
This suggests that we may be able to recover some of the benefits of hard quantization without sacrificing differentiability by using a smooth approximation to rounding,
We are going to use the following function which is differentiable everywhere (Appendix C):
The function is visualized in Figure 1A. Its parameter controls the fidelity of the approximation:
After observing a value for random variable , we can do slightly better if our goal is to minimize the MSE of . Instead of soft rounding twice, the optimal reconstruction is obtained with
It is not difficult to see that
where evaluates to if its argument is true and otherwise. That is, the posterior over is a truncated version of the prior distribution. If we assume that the prior is smooth enough to be approximately uniform in each interval, we have
where we have used that . We will assume this form for going forward for which we still have that
that is, we recover hard quantization as a limiting case. Thus in cases where quantization is desirable, we can anneal towards hard quantization during training while still having a differentiable loss.
Smooth approximations to quantization have been used previously though without the addition of noise agustsson2017soft. Note that soft rounding without noise does not create a bottleneck since the function is invertible and the input coefficients can be fully recovered by the decoder. Thus, Equation 15 offers a more principled approach to approximating quantization.
4.2 Reducing the variance of gradients
When is large, the derivatives of and tend to be close to zero with high probability and very large with low probability. This leads to gradients for the encoder with potentially large variance. To compensate we propose to analytically integrate out the uniform noise as follows.
Let be a differentiable function and, as before, let be a uniform random variable. We are interested in computing the following derivative:
To get a low-variance estimate of the expectation’s derivative we could average over many samples of . However, note that we also have
That is, the gradient of the expectation can be computed analytically with finite differences. Furthermore, Equation 17 allows us to evaluate the derivative of the expectation even when is not differentiable.
Now consider the case where we apply pointwise to a vector with followed by a multivariable function . Then
where the approximation in (19) is obtained by assuming the partial derivative is uncorrelated with . This would hold, for example, if were locally linear around such that its derivative is the same for any possible perturbed value .
Equation 20 corresponds to the following modification of backpropagation: the forward pass is computed in a standard manner (that is, evaluating for a sampled instance ), but in the backward pass we replace the derivative with its expected value, .
Consider a model where soft-rounding follows the encoder, , and a factorial entropy model is used. The rate-distortion loss becomes
We can apply Equation 17 directly to the rate term to calculate the gradient of (Figure 1B). For the distortion term we use Equation 20 where takes the role of . Interestingly, for the soft-rounding function and its inverse the expected derivative takes the form of a straight-through gradient estimate bengio2013straight. That is, the expected derivative is always 1.
Given a cumulative distribution for , the density of can be shown to be
We use this result to parametrize the density of (see Appendix E for details). Figure 1B illustrates such a model where is assumed to have a logistic distribution.
We conduct experiments with two models: (a) a simple linear model and (b) a more complex model based on the hyperprior architecture proposed byballe2018variational and extended by minnen2018joint.
The linear model operates on 8x8 blocks similar to JPEG/JFIF itu1992jpeg. It is implemented by setting the encoder
to be a convolution with a kernel size of 8, a stride of 8, and 192 output channels. The decoderis set to the corresponding transposed convolution. Both are initialized independently with random orthogonal matrices saxe2014orthogonal
. For the density model we use the non-parametric model ofballe2018variational but adjusted for soft-rounding (Appendix E).
The hyperprior model is a much stronger model and is based on the (non-autoregressive) "Mean & Scale Hyperprior" architecture described by minnen2018joint. Here, the coefficients produced by a neural encoder , , are mapped by a second encoder to “hyper latents” . Uniform noise is then applied to both sets of coefficients. A sample is first transmitted and subsequently used to conditionally encode a sample . Finally, a neural decoder computes the reconstruction as . Following previous work, the conditional distribution is assumed to be Gaussian,
where and are two neural networks. When integrating soft quantization into the architecture, we center the quantizer around the mean prediction ,
and adjust the conditional density accordingly. This corresponds to transmitting the residual between and the mean prediction (soft rounded) across the uniform noise channel. As for the linear model, we use a non-parametric model for the density of .
We consider the following three approaches for each model:
Uniform Noise + Quantization: The model is trained with uniform noise but uses quantization for inference. This is the approach that is widely used in neural compression [e.g., balle2016end, balle2017end, minnen2018joint, zhou2018tucodec]. We refer to this setting as UN + Q or as the "test-time quantization baseline".
Uniform Noise + Universal Quantization: Here the models use the uniform noise channel during training as well as for inference, eliminating the train-test mismatch. We refer to this setting as UN + UQ. As these models have the same training objective as UN + Q, we can train a single model and evaluate it for both settings.
Uniform Noise + Universal Quantization + Soft Rounding: Here we integrate a soft quantizer (Section 4.1) into the uniform noise channel (both during training and at test time), recovering the potential benefits of quantization while maintaining the match between training and test phases using universal quantization. We refer to this setting as UN + UQ + SR.
The training examples are 256x256 pixel crops extracted from a set of 1M high resolution JPEG images collected from the internet. The images’ initial height and width ranges from 3,000 to 5,000 pixels but images were randomly resized such that the smaller dimension is between 533 and 1,200 pixels before taking crops.
We optimized all models for mean squared error (MSE). The Adam optimizer kingma2014adam was applied for 2M steps with a batch size of 8 and a learning rate of which is reduced to after 1.6M steps. For the first 5,000 steps only the density models were trained and the learning rates of the encoder and decoder transforms were kept at zero. The training time was about hours for the linear models and about hours for the hyperprior models on an Nvidia V100 GPU.
For the hyperprior models we set for and decayed it by a factor of after 200k steps. For the linear models we use slightly smaller and reduced it by a factor of after 100k steps and again after 200k steps.
For soft rounding we linearly annealed the parameter from to over the full 2M steps. At the end of training, is large enough that soft rounding gives near identical results to rounding.
We evaluate all models on the Kodak kodak
dataset by computing the rate-distortion (RD) curve in terms of bits-per-pixel (bpp) versus peak signal-to-noise ratio (PSNR).
In Figure 5A we show results for the linear model. When comparing the UN + UQ model which uses universal quantization to the test-time quantization baseline UN + Q, we see that despite the train-test mismatch using quantization improves the RD-performance at test-time (hatched area). However, looking at UN + UQ + SR, we obtain an improvement in terms of RD performance (shaded area) over the test-time quantization baseline.
In Figure 3A we can observe similar albeit weaker effects for the hyperprior model. There is again a performance gap between UN + Q and UN + UQ. Introducing soft rounding again improves the RD performance, outperforming the test-time quantization baseline at low bitrates. The smaller difference can be explained by the deeper networks’ ability to imitate functionality otherwise performed by soft-rounding. For example, has a denoising effect which a powerful enough decoder can absorb.
In Figure 5B we illustrate the effect of using expected gradients on the linear model. We did not observe big differences when using the same with and (not shown). However, using and we saw significant speedups in convergence and gaps in performance at high bitrates.
For the hyperprior model we find expected gradients beneficial both in terms of performance and stability of training. In Figure 3B, we consider the UN + UQ + SR setting using either the linear schedule or alternatively a fixed , with and without expected gradients. We found that for , the models would not train stably (especially at the higher bitrates) without expected gradients (both when annealing and for fixed and obtained poorer performance.
In summary, for both the linear model and the hyperprior we observe that despite a train-test mismatch, the effect of the quantization is positive (UN + Q vs UN + UQ), but that further improvements can be gained by introducing soft rounding (UN + UQ + SR) into the uniform noise channel. Furthermore we find that expected gradients are helpful to speed up convergence and stabilize training.
The possibility to efficiently communicate samples has only recently been studied in information theory cover2007capacity, cuff2008
and even more recently been recognized in machine learninghavasi2018miracle, flamich2020cwq. We connected this literature to an old idea from rate-distortion theory, universal quantization ziv1985universal, zamir1992universal, which allows us to efficiently communicate a sample from a uniform distribution. Unlike more general approaches, universal quantization is also computationally efficient. This is only possible because it considers a constrained class of distributions, as discussed in Lemma 1.
Intriguingly, universal quantization makes it possible to implement an approach at test time which was already popular for training neural networks balle2016end. This allowed us to study and eliminate existing gaps between training and test losses. Furthermore, we showed that interpolating between the two approaches in a principled manner is possible using soft-rounding functions.
For ease of training and evaluation our empirical findings were based on MSE. We found that already here a simple change can lead to improved performance, especially for models of low complexity. However, we expect that generative compression rippel2017waveone, agustsson2019extreme will benefit more strongly from compression without quantization. theis2017cae showed that uniform noise and quantization can be perceptually very different, suggesting that adversarial and other perceptual training losses may be more sensitive to a mismatch between training and test phases.
Finally, here we only studied one-dimensional uniform noise. Two generalizations are discussed in Appendix C and may provide additional advantages. However, we also hope that our paper will inspire work into richer classes of distributions which are easy to communicate in a computationally efficient manner.
Poor internet connectivity and high traffic costs are still a reality in many developing countries akamai2017. But also in developed countries internet connections are often poor due to congestion in crowded areas or insufficient mobile network coverage. By improving compression rates, neural compression has the potential to make information more broadly available. About 79% of global IP traffic is currently made up of videos barnett2018cisco. This means that work on image and video compression in particular has the potential to impact a lot of people.
Assigning fewer bits to one image is only possible by simultaneously assigning more bits to other images. Care needs to be taken to make sure that training sets are representative. Generative compression in particular bears the risk of misrepresenting content but is beyond the scope of this paper.
We would like to thank Johannes Ballé for thoughtful discussions and valuable comments on this manuscript.
Appendix A: Notation
Appendix B: Computational complexity of reverse channel coding
Existing algorithms for lossy compression without quantization communicate a sample by simulating a large number of random variables and then identifying an index such that is distributed according to , at least approximately in a total variation sense [e.g., cuff2008, havasi2018miracle]. Here we show that no polynomial time algorithm exists which achieves this, assuming , where is the class of randomized polynomial time algorithms.
Our result depends on the results of long2010rbm, who showed that simulating restricted Boltzmann machines (RBMs) [smolensky1987rbm] approximately is computationally hard. For completeness, we repeat it in slightly weakened but simpler form here:
If , then there is no polynomial-time algorithm with the following property: Given as input such that is an matrix, the algorithm outputs an efficiently evaluatable representation of a distribution whose total variation distance from an RBM with parameters is at most .
Here, an efficiently evaluatable representation of a distribution is defined as a Boolean function
with the following two properties. First, if is a random vector of uniformly random bits. Second, and the function’s computational complexity are bounded by a polynomial in .
One might hope that having access to samples from a similar distribution would help to efficiently simulate an RBM. The following lemma shows that additional samples quickly become unhelpful as the Kullback-Leibler divergence between the two distributions increases.
Consider an algorithm which receives a description of an arbitrary probability distribution as input and is also given access to an unlimited number of i.i.d. random variables . It outputs such that its distribution is approximately in the sense that . If , then there is no such algorithm whose time complexity is polynomial in .
Let be a binary vector and let
be the probability distribution of an RBM with normalization constant and parameters and . Further, let be the uniform distribution. Then
If there is an algorithm which generates an approximate sample from an RBM’s distribution in a number of steps which is polynomial in , then its computational complexity is also bounded by a polynomial in . In that time the algorithm can take into account at most random variables where is polynomial in , that is, for some polynomial . Since the input random variables are independent and identical, we can assume without loss of generality that the algorithm simply uses the first random variables. The random variables correspond to an input of uniformly random bits. Note that is still polynomial in .
However, Theorem 1 states that there is no such polynomial time algorithm if . ∎
Appendix C: Generalizations of universal quantization
While the approach discussed in the main text is statistically and computationally efficient, it only allows us to communicate samples from a simple uniform distribution. We briefly discuss two possible avenues for generalizing this approach.
One such generalization to lattice quantizers was already discussed by ziv1985universal. Let be a lattice and be the nearest neighbor of in the lattice. Further let be a Voronoi cell of the lattice and be a random vector which is uniformly distributed over the Voronoi cell. Then [zamir2014book, Theorem 4.1.1]
An example is visualized in Figure 4. For certain lattices and in high dimensional spaces, will be distributed approximately like a Gaussian [zamir2014book]. This means universal quantization could be used to approximately simulate an additive white Gaussian noise channel.
Another possibility to obtain Gaussian noise would be the following. Let be a positive random variable independent of and . We assume that like is known to both the encoder and the decoder. It follows that
for another uniform random variable . If and , then has a Gaussian distribution with variance
has a Gaussian distribution with variance[qin2003scalemixture]. More generally, this approach allows us to implement any noise which can be represented as a uniform scale mixture. However, the average number of bits required for transmitting can be shown to be (Appendix B)
where . This means we require more bits than we would like to if all we want to transmit is . However, if we consider to be the message, then again we are using only as many bits as we transmit information.
Appendix D: Differentiability of soft-rounding
For , we defined a soft rounding function as
The soft-rounding function is differentiable everywhere. First, we show that the derivative exists at . The right derivative of at exists and is given by
Similarly, the left derivative at exists and is given by
Since the left and right derivatives are equal, is differentiable at . Since , the derivative also exists for other integers and it is easy to see that is differentiable for . Hence, is differentiable everywhere.
Appendix E: Adapting density models
For the rate term we need to model the density of . When is assumed to have independent components, we only need to model individual components . Following balle2018variational, we parameterize the model through the cumulative distribution of , as we have
We can generalize this to model the density of , where is an invertible function. Since , we have
This means we can easily adjust a model for the density of to model the density of . In addition to being a suitable density, creating an explicit dependency on has the added advantage of automatically adapting the density if we choose to change during training.
Appendix F: Additional experimental results
Appendix G: Qualitative results
Below we include reconstructions of images from the Kodak dataset [kodak] for the three approaches UN + UQ, UN + Q, and UN + UQ + SR trained with the same trade-off parameter . We chose a low bit-rate to make the differences more easily visible.
For the linear model (Figures 7-9), reconstructions using UN + Q and UN + UQ + SR have visible blocking artefacts as would be expected given their similarity to JPEG/JFIF [itu1992jpeg]. UN + UQ masks the blocking artefacts almost completely at the expense of introducing grain.
|UN + UQ, bpp: 0.762 (113%), PSNR: 32.79|
|UN + Q, bpp: 0.672 (100%), PSNR: 34.33|
|UN + UQ + SR, bpp: 0.562 (83%), PSNR: 33.60|
|UN + UQ, bpp: 0.836 (111%), PSNR: 32.22|
|UN + Q, bpp: 0.750 (100%), PSNR: 33.09|
|UN + UQ + SR, bpp: 0.623 (83%), PSNR: 32.54|
|UN + UQ, bpp: 0.778 (115%), PSNR: 33.05|
|UN + Q, bpp: 0.674 (100%), PSNR: 34.71|
|UN + UQ + SR, bpp: 0.572 (84%), PSNR: 33.98|
|UN + UQ, bpp: 0.091 (143%), PSNR: 29.36|
|UN + Q, bpp: 0.063 (100%), PSNR: 28.78|
|UN + UQ + SR, bpp: 0.059 (93%), PSNR: 29.55|
|UN + UQ, bpp: 0.099 (137%), PSNR: 29.31|
|UN + Q, bpp: 0.073 (100%), PSNR: 28.82|
|UN + UQ + SR, bpp: 0.072 (99%), PSNR: 29.56|
|UN + UQ, bpp: 0.156 (130%), PSNR: 27.11|
|UN + Q, bpp: 0.120 (100%), PSNR: 26.75|
|UN + UQ + SR, bpp: 0.123 (102%), PSNR: 27.08|