Code for paper "Variable-Bitrate Neural Compression via Bayesian Arithmetic Coding"
Deep Bayesian latent variable models have enabled new approaches to both model and data compression. Here, we propose a new algorithm for compressing latent representations in deep probabilistic models, such as variational autoencoders, in post-processing. The approach thus separates model design and training from the compression task. Our algorithm generalizes arithmetic coding to the continuous domain, using adaptive discretization accuracy that exploits estimates of posterior uncertainty. A consequence of the "plug and play" nature of our approach is that various rate-distortion trade-offs can be achieved with a single trained model, eliminating the need to train multiple models for different bit rates. Our experimental results demonstrate the importance of taking into account posterior uncertainties, and show that image compression with the proposed algorithm outperforms JPEG over a wide range of bit rates using only a single machine learning model. Further experiments on Bayesian neural word embeddings demonstrate the versatility of the proposed method.READ FULL TEXT VIEW PDF
The bits-back argument suggests that latent variable models can be turne...
Variable rate is a requirement for flexible and adaptable image and vide...
We present an information-theoretic framework for understanding trade-of...
The traditional image compressors, e.g., BPG and H.266, have achieved gr...
We give an algorithm that learns a representation of data through
Early works on medical image compression date to the 1980's with the imp...
Compression is at the heart of effective representation learning. Howeve...
Code for paper "Variable-Bitrate Neural Compression via Bayesian Arithmetic Coding"
Probabilistic latent-variable models have become a mainstay of modern machine learning. Scalable approximate Bayesian inference methods, in particular Black Box Variational Inference(Ranganath et al., 2014; Rezende et al., 2014), have spurred the development of increasingly large and expressive probabilistic models, including deep generative probabilistic models such as variational autoencoders (Kingma and Welling, 2014a)
and Bayesian neural networks(MacKay, 1992; Blundell et al., 2015). One natural application of deep latent variable modeling is data compression, and recent work has focused on end-to-end procedures that optimize a model for a particular compression objective. Here, we study a related but different problem: given a trained model, what is the best way to encode the information contained in its continuous latent variables?
As we demonstrate, our proposed solution provides an entirely new “plug & play” approach to lossy compression that separates the compression task from modeling and training. Our method can be applied to both model and data compression, and allows tuning the trade-off between bitrate and reconstruction quality without the need to retrain the model.
Compression aims to best describe some data in as few bits as possible. For continuous-valued data like natural images, videos, or distributed representations, digital compression is necessarilylossy, as arbitrary real numbers cannot be perfectly represented by a finite number of bits. Lossy compression algorithms therefore typically find a discrete approximation of some semantic representation of the data, which is then encoded with a lossless compression method.
In classical lossy compression methods such as JPEG or MP3, the semantic representation is carefully designed to support compression at variable bitrates. By contrast, state-of-the-art deep learning based approaches to lossy data compression(Ballé et al., 2017, 2018; Rippel and Bourdev, 2017; Mentzer et al., 2018; Lombardo et al., 2019) are trained to minimize a distortion metric at a fixed bitrate. To support variable-bitrate compression, one has to train several models for different bitrates. While training several models may be viable in many cases, a bigger issue of this approach is the increase in decoder size as the decoder has to store the parameters of not one but several deep neural networks for each bitrate setting. In applications like video streaming under fluctuating connectivity, the decoder further has to load a new deep learning model into memory every time a change in bandwidth requires adjusting the bitrate.
By contrast, we propose a lossy neural compression method that decouples training from compression, and that enables variable-bitrate compression with a single model. We generalize a classical entropy coding algorithm, Arithmetic Coding (Witten et al., 1987; MacKay, 2003), from discrete data to the continuous domain. At the heart of the proposed Bayesian Arithmetic Coding algorithm is an adaptive quantization scheme that exploits posterior uncertainty estimates to automatically reduce the accuracy of latent variables for which the model is uncertain anyway. This strategy is analogous to the way humans communicate quantitative information. For example, Wikipedia lists the population of Rome in 2017 with the specific number . By contrast, its population in the year AD is estimated by the rounded number because the high uncertainty would make a more precise number meaningless. Our ablation studies show that this posterior-informed quantization scheme is crucial to obtaining competitive performance.
In detail, our contributions are as follows:
A new algorithm. We present a fundamentally novel approach to compressing latent variables in a variational inference framework. Our approach generalizes arithmetic coding from discrete to continuous distributions and takes posterior uncertainty into account.
Single-model compression at variable bitrates. The decoupling of modeling and compression allows us to adjust the trade-off between bitrate and distortion in post-processing. This is in contrast to existing approaches to both data and model compression, which often require specialized models for each bitrate.
Automatic self-pruning. Deep latent variable models often exhibit posterior collapse, i.e., the variational posterior collapses to the model prior. In our approach, latent dimensions with collapsed posteriors require close to zero bits, thus don’t require manual pruning.
Competitive experimental performance. We show that our method outperforms JPEG over a wide range of bitrates using only a single model. We also show that we can successfully compress a word embeddings with minimal loss, as evaluated on semantic reasoning task.
Compressing continuous-valued data is a classical problem in the signal processing community. Typically, a distortion measure (often the squared error) and a source distribution are assumed, and the goal is to design a quantizer that optimizes the rate-distortion (R-D) performance (Lloyd, 1982; Berger, 1972; Chou et al., 1989)
. Optimal vector quantization, although theoretically well-motivated(Gallager, 1968), is not tractable in high-dimensional spaces (Gersho and Gray, 2012) and not scalable in practice. Therefore most classical lossy compression algorithms map data to a suitably designed semantic representation, in such a way that coordinate-wise scalar quantization can be fruitfully applied.
Recent machine-learning-based data compression methods learn such hand-designed representation from data, but similar to classical methods, most such ML methods directly take quantization into account in the generative model design or training. Various approaches replace the non-differentiable quantization operation with either stochastic binarization(Toderici et al., 2016, 2017), additive uniform noise (Ballé et al., 2017, 2018; Habibian et al., 2019), or other differentiable approximation (Agustsson et al., 2017; Theis et al., 2017; Mentzer et al., 2018; Rippel and Bourdev, 2017); many such schemes result in uniform quantization of the latent variables, with the exception of (Agustsson et al., 2017), which optimizes for quantization grid points.
We depart from such approaches by considering quantization as a post-processing step that decouples quantization from model design and training. An important feature of our algorithm is a new quantization scheme that automatically adapts to different length scales in the representation space by exploiting posterior uncertainty estimates. To the best of our knowledge, the only prior work that uses posterior uncertainty for compression is in the context of bits-back coding (Honkela and Valpola, 2004; Townsend et al., 2019), but these works focus on lossless compression.
Most existing neural image compression methods require training a separate machine learning model for each desired bitrate setting (Ballé et al., 2017, 2018; Mentzer et al., 2018; Theis et al., 2017; Lombardo et al., 2019). In fact, Alemi et al. (2018) showed that any particular fitted VAE model only targets one specific point on the rate-distortion curve. One approach has the same goal of variable-bitrate single-model compression in mind as methods based on recurrent VAEs (Gregor et al., 2016; Toderici et al., 2016, 2017; Johnston et al., 2018), which use dedicated model architecture for progressive image reconstruction; but instead focus more broadly on lossy compression for any given generative model, designed and trained for specific application purposes (possibly other than compression).
We now propose an algorithm for compressing latent variables in trained models. After describing the problem setup and assumptions (Subsection 3.1), we briefly review Arithmetic Coding (Subection 3.2). Subsection 3.3 describes our proposed lossy compression algorithm, which generalizes Arithmetic Coding to the continuous domain.
We consider a wide class of generative probabilistic models with data and unknown (or “latent”) variables from some continuous latent space with dimension
. The generative model is defined by a joint probability distribution,
with a prior and a likelihood . Although our presentation focuses on unsupervised representation learning, our framework also captures the supervised setup.111 For supervised learning with labels
For supervised learning with labels, we would consider a conditional generative model with conditional likelihood , where are the model parameters, treated as a Bayesian latent variable with associated prior .
Our proposed compression method uses as a proxy to describe the data . This requires “solving” Eq. 1 for given , i.e., inferring the posterior . Since exact Bayesian inference is often intractable, we resort to Variational Inference (VI) (Jordan et al., 1999; Blei et al., 2017; Zhang et al., 2019), which approximates the posterior by a so-called variational distribution
by minimizing the Kullback-Leibler divergenceover a set of variational parameters .
We assume that both the prior and the variational distribution are fully factorized (mean-field assumption). For concreteness, our examples use a Gaussian variational distribution. Thus,
where is a prior for the th component of , and the meanstogether comprise the variational parameters over which VI optimizes.222These parameters are often amortized by a neural network (in which case and depend on ), but don’t have to (in which case and do not depend on and are directly optimized).
Prominently, the model class defined by Eqs. 1-3 includes variational autoencoders (VAEs) (Kingma and Welling, 2014b) for data compression, but we stress that the class is much wider, capturing also Bayesian neural nets (MacKay, 2003), probabilistic word embeddings (Barkan, 2017; Bamler and Mandt, 2017), matrix factorization (Mnih and Salakhutdinov, 2008), and topic models (Blei et al., 2003).
We consider two parties in communication, a sender and a receiver. Given a probabilistic model (Eq. 1), the goal is to transmit a data sample as efficiently as possible. Both parties have access to the model, but only the sender has access to , which it uses to fit a variational distribution . It then uses the algorithm proposed below to select a latent variable vector that has high probability under , and that can be encoded into a compressed bitstring, which gets transmitted to the receiver. The receiver losslessly decodes the compressed bitstring back into and uses the likelihood to generate a reconstructed data point , typically setting .
The rest of this section describes how the proposed algorithm selects and encodes it into a compressed bitstring.
Our lossy compression algorithm, introduced in Section 3.3 below, generalizes a lossless compression algorithm, arithmetic coding (AC) (Witten et al., 1987; MacKay, 2003), from discrete data to the continuous space of latent variables . To get there, we first review the main idea of AC that our proposed algorithm borrows.
AC is an instance of so-called entropy coding. It uniquely maps messages from a discrete set to a compressed bitstring of some length (the “bitrate”). Entropy coding exploits prior knowledge of the distribution of messages to map probable messages to short bitstrings while spending more bits on improbable messages. This way, entropy coding algorithms aim to minimize the expected rate . For lossless compression, the expected rate has a fundamental lower bound, the entroy , where is the Shannon information content of . AC provides near optimal lossless compression as it maps each message to a bitstring of length , where denotes the ceiling function.
AC is usually discussed in the context of streaming compression where is a sequence of symbols from a finite alphabet, as AC improves on this task over the more widely known Huffman coding (Huffman, 1952)
. In our work, we focus on a different aspect of AC: its use of a cumulative probability distribution function to map a nonuniformly distributed random variableto a number
that is nearly uniformly distributed over the interval.
(left) illustrates AC for a binomial-distributed message
(the number of ‘heads’ in a sequence of ten coin flips). The solid and dashed orange lines show the left and right sided cumulative distribution function,333If is a sequence of symbols, and are defined by lexicographical order and can be constructed in a streaming manner. and , respectively. They define a partitioning of the interval (vertical axis in Figure 1 (left)) into pairwise disjoint subintervals (orange squares). Since the intervals are disjoint for all , any number uniquely identifies a given message . AC picks such a number and encodes it into a string of bits , by writing it in binary representation,
Since any may be used to identify the message , we can interpret the interval as an uncertainty region in -space. AC picks the number with the shortest binary representation. This requires at most bits because the numbers that can be represented by Eq. 4 with form a uniform grid with spacing , which is at most as wide as the size of the interval, . The red arrows in Figure 1 (left) illustrate how AC would encode the message in the toy example into the bitstring “”. Decoding works in the opposite direction and maps back to .
In the next section, we generalize AC to the continuous domain. As we will show, the concept of an “uncertainty region” in -space becomes again crucial.
We now present our proposed algorithm, Bayesian Arithmetic Coding (Bayesian AC), which generalizes standard AC from the domain of discrete messages to the domain of continuous latent variables
. Similar to AC, Bayesian AC exploits knowledge of a prior probability distributionin combination with a (soft) uncertainty region to encode probable values of into short bitstrings.
The main ideas that Bayesian AC borrows from standard AC are as follows: (1) the use of a cumulative distribution function to map any non-uniformly distributed random variable to a uniformly distributed random variable over the interval , and (2) the use of an “uncertainty region” to select a number on this interval to encode the message with as few bits as possible. While AC was characterized by an interval with hard boundaries, Bayesian AC softens this uncertainty region by drawing on posterior uncertainty.
We consider a single continuous latent variable with arbitrary prior . The cumulative (CDF) of the prior,
, note that the CDF of a continuous random variable maps real numbers to real numbers.
Since is almost surely an irrational number, its binary representation is infinitely long, and a practical encoding method has to truncate it to some finite length. We find an optimal truncation by generalizing the idea of the uncertainty region to the continuous space. To this end, we consider the posterior uncertainty in -space and map it to -space. Approximating the posterior by the variational distribution , see Eq. 3, we thus consider the function
is the inverse CDF (the quantile function), which mapsback to . Note that is not a normalized probability distribution, as Eq. 6 deliberately does not include the Jacobian because the final objective will be to maximize at a single point (see Eq. 7 below).
The solid and dashed purple curves in figure Figure 1 (right) plot and on the horizontal and vertical axis, respectively. The red arrows illustrate how a finite uncertainty region in -space is mapped to a finite width of in -space. Bayesian AC encodes a quantile that has high value under while at the same time having a short binary representation. The two purple arrowheads on the vertical axis point to two viable candidates, and , that both lie within the uncertainty region. The choice between these two points poses a rate-distortion trade-off: while has higher value under (i.e., it identifies a point
with higher approximate posterior probability), the alternative can be encoded in fewer bits.
Rather than considering a hard uncertainty region, Bayesian AC simply tries to find a point that identifies latent variables with high probability under the variational distribution while being expressible in few bits. We thus express in terms of the coordinates using Eq. 3,
For each dimension , we restrict the quantile to the set of code points that can be represented in binary via Eq. 4 with a finite but arbitrary bitlength . We define the total bitlength , i.e., the length of the concatenation of all codes , neglecting, for now, an overhead for delimiters (see below). Using a rate penalty parameter that is shared across all dimensions , we minimize the rate-distortion objective
The optimization thus decouples across all latent dimensions , and can be solved efficiently and in parallel by minimizing the independent objective functions
Although the bitlength is discontinuous (it counts the number of binary digits, see Eq. 4), can be efficiently minimized over using Algorithm 1. The algorithm iterates over all rates and searches for the code point that minimizes . For each , the algorithm only needs to consider the two code points and with rate at most that enclose the optimum and are closest to it; these two code points can be easily computed in constant time. The iteration terminates as soon as the maximally possible remaining increase in is smaller than the minimum penalty for an increasing bitlength (in practice, the iteration rarely exceeds ).
After finding the optimal code points , they have to be encoded into a single bitstring. Simply concatenating the binary representations (Eq. 4) of all would be ambiguous due to their variable lengths (see detailed discussion in the Supplementary Material). Instead, we treat the code points as symbols from a discrete vocabulary and encode them via lossless entropy coding, e.g., standard Arithmetic Coding. The entropy coder requires a probabilistic model over all code points; here we simply use their empirical distribution. When using our method for model compression, this empirical distribution has to be transmitted to the receiver as additional header information that counts towards the total bitrate. For data compression, by contrast, we obtain the empirical distribution of code points on training data and include it in the decoder.
|Input:||Prior CDF , rate penalty ,|
|Output:||Optimal code point .|
The proposed algorithm adjusts the accuracy for each latent variable based on two factors: (i) a global rate setting that is shared across all dimensions ; and (ii) a per-dimension posterior uncertainty estimate . Point (i) allows tuning the rate-distortion trade-off whereas (ii) takes the anisotropy of the latent space into account.
Figure 2 illustrates the effect of anisotropy in latent space. The right panel plots the posterior of a toy Bayesian linear regression model (see left panel) with only two latent variables . Due to the elongated shape of the posterior, Bayesian AC uses a higher accuracy for than for . As a result, the algorithm encodes a point (orange dot in right panel) that is closer to the optimal (MAP) solution (green dot) along the -axis than along the -axis.
The purple dot in Figure 2 (right) compares to a more common quantization method, which simply rounds the MAP solution to the nearest point (which is then entropy coded) from a fixed grid with spacing . We tuned so that the resulting encoded point (purple dot) has the same distance to the optimum as our proposed solution (orange dot). Despite the equal distance to the optimum, Bayesian AC encodes model parameters with higher posterior probability. The resulting model fits the data better (left panel).
This concludes the description of the proposed Bayesian Arithmetic Coding algorithm. In the next section, we analyze the algorithm’s behaviour experimentally and demonstrate its performance for variable-bitrate compression on both word embeddings and images.
We tested our approach in two very different domains: word embeddings and images. For word embeddings, we measured the performance drop on a semantic reasoning task due to lossy compression. Our proposed Bayesian AC method significantly improves model performance over uniform discretization and compression with either Arithmetic Coding (AC), gzip, bzip2, or lzma at equal bitrate. For image compression, we show that a single standard VAE, compressed with Bayesian AC, outperforms JPEG and other baselines at a wide range of bitrates, both quantitatively and visually.
We consider the Bayesian Skip-gram model for neural word embeddings (Barkan, 2017), a probabilistic generative formulation of word2vec (Mikolov et al., 2013b) which interprets word and context embedding vectors as latent variables and associates them with Gaussian approximate posterior distributions. Point estimating the latent variables would result in classical word2vec. Even though the model was not specifically designed or trained with model compression taken into consideration, the proposed algorithm can successfully compress it in post-processing.
We implemented the Black Box VI version of the Bayesian Skip-gram model proposed in (Bamler and Mandt, 2017),444 See Supplement for hyperparameters. Our code is available at
See Supplement for hyperparameters. Our code is available athttps://github.com/mandt-lab/bayesian-ac/. and trained the model on books published between and from the Google Books corpus (Michel et al., 2011), following the preprocessing described in (Bamler and Mandt, 2017) with a vocabulary of words and embedding dimension .
In the trained model, we observed that the distribution of posterior modes across all words and all dimensions of the embedding space was quite different from the prior. To improve the bitrate of our method, we used an “empirical prior” for encoding that is shared across all and ; we chose a Gaussian where is the empirical variance of all variational means .
We compare our method’s performance to a baseline that quantizes to a uniform grid and then uses the empirical distribution of quantized coordinates for lossless entropy coding. We also compare to uniform quantization baselines that replace the entropy coding step with the standard compression libraries gzip, bzip2, and lzma. These methods are not restricted by a factorized distribution of code points and could therefore detect and exploit correlations between quantized code points across words or dimensions.
We evaluate performance on the semantic and syntactic reasoning task proposed in (Mikolov et al., 2013a), a popular dataset of semantic relations like “Japan : yen = Russia : ruble” and syntactic relations like “amazing : amazingly = lucky : luckily”, where the goal is to predict the last word given the first three words. We report Hits@10, i.e., the fraction of challenges for which the compressed model ranks the correct prediction among the top ten.
Figure 3 shows the model performance on the semantic and syntactic reasoning tasks as a function of compression rate. Our proposed Bayesian AC significantly outperforms all baselines and reaches the same Hits@10 at less than half the bitrate over a wide range.555 The uncompressed model performance (dotted gray line in Figure 3) is not state of the art. This is not a shortcoming of the compression method but merely of the model, and can be attributed to the smaller vocabulary and training set used compared to (Mikolov et al., 2013b) due to hardware constraints.
While Section 4.1 demonstrated the proposed Bayesian AC method for model compression, we now apply the same method to data compression using a variational autoencoder (VAE). We first provide a qualitative evaluation on MNIST, and then quantitative results on full resolution color images.
For simplicity, we consider regular VAEs with a standard normal prior and Gaussian variational posterior. The generative network parameterizes a factorized Bernoulli or Gaussian likelihood model in the two experiments, respectively. Network architectures are described below and in more detail in Supplementary Material.
We consider the following baselines:
Uniform quantization: for a given image , we quantize each dimension of the posterior mean vector to a uniform grid. We report the bitrate for encoding the resulting quantized latent representation via standard entropy coding (e.g., arithmetic coding). Entropy coding requires prior knowledge of the probabilities of each grid point. Here, we use the empirical frequencies of grid points over a subset of the training set;
-means quantization: similar to “uniform quantization”, but with the placement of grid points optimized via -means clustering on a subset of the training data;
JPEG: we used the libjpeg implementation packaged with the Python Pillow library, using default configurations (e.g., 4:2:0 subsampling), and we adjust the quality parameter to vary the rate-distortion trade-off;
Deep learning baseline: we compare to Ballé et al. (2017), who directly optimized for the rate and distortion, training a separate model for each point on the R-D curve. In our large-scale experiment, we adopte their model architecture, so their performance essentially represents the end-to-end optimized performance upper bound for our method (which uses a single model).
We trained a simple VAE on MNIST digits, and compared our method to uniform quantization and JPEG.
Figure 4 shows example compressed digits, ranging from the highest to the lowest bitrate that each method allows. As the rate decreases, we see that unlike JPEG, which introduces pixel-level artifacts, both VAE-based methods were able to preserve semantic aspects of the original image. It is interesting to see how the performance degrades in the strongly compressed regime for Bayesian AC. With aggressive decrease in bitrate (as ), our method gradually “confuses” the original image with a generic image (8 being in the center of the embedding space), while preserving approximately the same level of sharpness.
We apply our Bayesian AC method to a VAE trained on color images and demonstrate its practical image compression performance rivaling JPEG.
The inference and generative networks of the VAE are identical to the analysis and synthesis networks of Ballé et al. (2017), using 3 layers of 256 filters each in a convolutional architecture. We used a diagonal Gaussian likelihood model, whose mean is computed by the generative net and the variance is fixed as a hyper-parameter, similar to a -VAE (Higgins et al., 2017) approach ( was tuned to 0.001 to ensure the VAE achieved overall good R-D trade-off; see (Alemi et al., 2018)
). We trained the model on the same subset of the ImageNet dataset as used in(Ballé et al., 2017). We evaluated performance on the standard Kodak dataset (35), a separate set of 24 uncompressed color images. As in the word embedding experiment, we also observed that using an empirical prior for our method improved the bitrate. We used the same generic density model as in (Ballé et al., 2018), fitting a different distribution for each latent channel, on samples of posterior means (treating spatial dimensions as i.i.d.).
As common in image compression work, we measure the distortion between the original and compressed image under two quality metrics (the higher the better): Peak Signal-to-Noise ratio (PSNR), and MS-SSIM(Wang et al., 2003), over all RGB channels. Figure 6 shows rate-distortion performance, where we averaged both the bits per pixel (BPP) and quality measure across all images in the Kodak dataset, for each fixed R-D trade-off setting (we obtained similar results when averaging only over the quality metrics for fixed bitrates). The results for Ballé et al. (2017) are taken from the paper and the authors’ website.
We found that our method generally produced images with higher quality, both in terms of PSNR and perceptual quality, compared to JPEG and uniform quantization. Similar to (Ballé et al., 2017), our method avoids jarring artifacts, and introduces blurriness at low bitrate. See Figure 5 for example image reconstructions. For more examples and R-D curves on individual images, see Supplementary Material.
Although our results fall short of the end-to-end optimized rate-distortion performance of Ballé et al. (2017), it is worth emphasizing that our method allows operating anywhere on the R-D curve with a single trained VAE model, unlike Ballé et al. (2017), which requires costly optimization and storage of individual models for each point on the R-D curve.
A known issue in deep generative models such as VAEs is the phenomenon of posterior collapse, where the model ignores some subset of latent variables, and the corresponding variational posterior distributions collapse to closely match the prior. Since such collapsed dimensions do not contribute to the model’s performance, they constitute an overhead in regular neural compression approaches and may need to be pruned.
One curious consequence of our approach is that it spends close to zero bits encoding the collapsed latent dimensions. As an illustration, we trained a VAE as used in the color image compression experiment with a high setting to purposefully induce posterior collapse, and examine the average number of bits spent on various latent channels.
Figure 7 shows the prior , aggregated (approximate) posterior , and histograms of posterior means for the first six channels of the VAE; all the quantities were averaged over an image batch and across latent spatial dimensions. We observe that channels 2, 3, and 5 appear to exhibit posterior collapse, as the aggregated posteriors closely match the prior while the posterior means tightly cluster at zero; this is also reflected by low average KL-divergence between the variational posterior and prior , see text inside each panel. We observe that, for these collapsed channels, our method spends fewer bits on average than uniform quantization (baseline) at the same total bitrate, and more bits instead on channels 1, 4, and 6, which do not exhibit posterior collapse. The explanation is that a collapsed posterior has unusually high variance , causing our model to refrain from long code words due to the high penalty per bitrate in Eq. 9.
We proposed a novel algorithm for lossy compression, based on a new quantization scheme that automatically adapts encoding accuracy to posterior uncertainty estimates. This decouples the task of compression from model design and training, and enables variable-bitrate compression for probabilistic generative models with mean-field variational distributions, in post-processing.
We empirically demonstrated the effectiveness of our approach for both model and data compression. Our proposed algorithm can be readily applied to many existing models. In particular, we believe it holds promise for compressing Bayesian neural networks.
Stephan Mandt acknowledges funding from DARPA (HR001119S0038), NSF (FW-HTF-RM), and Qualcomm.
Variational image compression with a scale hyperprior. In ICLR. Cited by: §1, §2, §2, §4.2.2.
Association for the Advancement of Artificial Intelligence, pp. 3135–3143. Cited by: §3.1, §4.1.
Proceedings of the IEEE International Conference on Computer Vision, pp. 7033–7042. Cited by: §2.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4385–4393. Cited by: §2.
A practical bayesian framework for backpropagation networks. Neural computation 4 (3), pp. 448–472. Cited by: §1.
Variable rate image compression with recurrent neural networks. International Conference on Learning Representations. Cited by: §2, §2.