miracle
This repository contains the code for our recent paper `Minimal Random Code Learning: Getting Bits Back from Compressed Model Parameters'
view repo
While deep neural networks are a highly successful model class, their large memory footprint puts considerable strain on energy consumption, communication bandwidth, and storage requirements. Consequently, model size reduction has become an utmost goal in deep learning. A typical approach is to train a set of deterministic weights, while applying certain techniques such as pruning and quantization, in order that the empirical weight distribution becomes amenable to Shannon-style coding schemes. However, as shown in this paper, relaxing weight determinism and using a full variational distribution over weights allows for more efficient coding schemes and consequently higher compression rates. In particular, following the classical bits-back argument, we encode the network weights using a random sample, requiring only a number of bits corresponding to the Kullback-Leibler divergence between the sampled variational distribution and the encoding distribution. By imposing a constraint on the Kullback-Leibler divergence, we are able to explicitly control the compression rate, while optimizing the expected loss on the training set. The employed encoding scheme can be shown to be close to the optimal information-theoretical lower bound, with respect to the employed variational family. Our method sets new state-of-the-art in neural network compression, as it strictly dominates previous approaches in a Pareto sense: On the benchmarks LeNet-5/MNIST and VGG-16/CIFAR-10, our approach yields the best test performance for a fixed memory budget, and vice versa, it achieves the highest compression rates for a fixed test performance.
READ FULL TEXT VIEW PDFThis repository contains the code for our recent paper `Minimal Random Code Learning: Getting Bits Back from Compressed Model Parameters'
With the celebrated success of deep learning models and their ever increasing presence, it has become a key challenge to increase their efficiency. In particular, the rather substantial memory requirements in neural networks can often conflict with storage and communication constraints, especially in mobile applications. Moreover, as discussed in Han et al. (2015), memory accesses are up to three orders of magnitude more costly than arithmetic operations in terms of energy consumption. Thus, compressing deep learning models has become a priority goal with a beneficial economic and ecological impact.
Traditional approaches to model compression usually rely on three main techniques: pruning, quantization and coding. For example, Deep Compression (Han et al., 2016) proposes a pipeline employing all three of these techniques in a systematic manner. From an information-theoretic perspective, the central routine is coding
, while pruning and quantization can be seen as helper heuristics to reduce the entropy of the empirical weight-distribution, leading to shorter encoding lengths
(Shannon, 1948). Also, the recently proposed Bayesian Compression (Louizos et al., 2017) falls into this scheme, despite being motivated by the so-called bits-back argument (Hinton & Van Camp, 1993) which theoretically allows for higher compression rates.^{1}^{1}1 Recall that the bits-back argument states that, assuming a large dataset and a neural network equipped with a weight-prior , the effective coding cost of the network weights is , where is a variational posterior. However, in order to realize this effective cost, one needs to encode both the network weights and the training targets, while it remains unclear whether it can also be achieved for network weights alone. While the bits-back argument certainly motivated the use of variational inference in Bayesian Compression, the downstream encoding is still akin to Deep Compression (and other approaches). In particular, the variational distribution is merely used to derive a deterministic set of weights, which is subsequently encoded with Shannon-style coding. This approach, however, does not fully exploit the coding efficiency postulated by the bits-back argument.In this paper, we step aside from the pruning-quantization pipeline and propose a novel coding method which approximately realizes bits-back efficiency. In particular, we refrain from constructing a deterministic weight-set but rather encode a random weight-set from the full variational posterior. This is fundamentally different from first drawing a weight-set and subsequently encoding it – this would be no more efficient than previous approaches. Rather, the coding scheme developed here is allowed to pick a random weight-set which can be cheaply encoded. By using results from Harsha et al. (2010), we show that such an coding scheme always exists and that the bits-back argument indeed represents a theoretical lower bound for its coding efficiency. Moreover, we propose a practical scheme which produces an approximate sample from the variational distribution and which can indeed be encoded with this efficiency. Since our algorithm learns a distribution over weight-sets and derives a random message from it, while minimizing the resulting code length, we dub it Minimal Random Code Learning (MIRACLE).
From a practical perspective, MIRACLE has the advantage that it offers explicit control over the expected loss and the compression size. This is distinct from previous techniques, which require tedious tuning of various hyper-parameters and/or thresholds in order to achieve a certain coding goal. In our method, we can simply control the -divergence using a penalty factor, which directly reflects the achieved code length (plus a small overhead), while simultaneously optimizing the expected training loss. As a result, we were able to trace the trade-off curve for compression size versus classification performance (Figure 1). We clearly outperform previous state-of-the-art in a Pareto sense: For any desired compression rate, our encoding achieves better performance on the test set; vice versa, for a certain performance on the test set, our method achieves the highest compression. To summarize, our main contributions are:
We introduce MIRACLE, an innovative compression algorithm that exploits the noise resistance of deep learning models by training a variational distribution and efficiently encodes a random set of weights.
Our method is easy to implement and offers explicit control over the loss and the compression size.
We provide theoretical justification that our algorithm gets close to the theoretical lower bound on the encoding length.
The potency of MIRACLE is demonstrated on two common compression tasks, where it clearly outperforms previous state-of-the-art methods for compressing neural networks.
There is an ample amount of research on compressing neural networks, so that we will only discuss the most prominent ones, and those which are related to our work. An early approach is Optimal Brain Damage (LeCun et al., 1990) which employs the Hessian of the network weights in order to determine whether weights can be pruned without significantly impacting training performance. A related but simpler approach was proposed in Han et al. (2015), where small weights are truncated to zero, alternated with re-training. This simple approach yielded – somewhat surprisingly – networks which are one order of magnitude smaller, without impairing performance. The approach was refined into a systematic pipeline called Deep Compression, where magnitude-based weight pruning is followed by weight quantization (clustering weights) and Huffman coding (Huffman, 1952). While its compression ratio () has been surpassed since, many of the subsequent works took lessons from this paper.
HashNet proposed by Chen et al. (2015) also follows a simple and surprisingly effective approach: They exploit the fact that training of neural networks is resistant to imposing random constraints on the weights. In particular, they use hashing to enforce groups of weights to share the same value, yielding memory reductions of up to with gracefully degrading performance. Weightless encoding by Reagen et al. (2018) demonstrates that neural networks are resilient to weight noise, and exploits this fact for a lossy compression algorithm. The recently proposed Bayesian Compression (Louizos et al., 2017) uses a Bayesian variational framework and is motivated by the bits-back argument (Hinton & Van Camp, 1993). Since this work is the closest to ours, albeit with important differences, we discuss Bayesian Compression and the bits-back argument in more detail.
The basic approach is to equip the network weights with a prior and to approximate the posterior using the standard variational framework, i.e. maximize the evidence lower bound (ELBO) for a given dataset
(1) |
w.r.t. the variational distribution , parameterized by . The bits-back argument (Hinton & Van Camp, 1993) establishes a connection between the Bayesian variational framework and the Minimum Description Length (MDL) principle (Grünwald, 2007). Assuming a large dataset of input-target pairs, we aim to use the neural network to transmit the targets with a minimal message, while the inputs are assumed to be public. To this end, we draw a weight-set from , which has been obtained by maximizing (1); note that knowing a particular weight set conveys a message of length ( refers to the Shannon entropy of the distribution). The weight-set is used to encode the residual of the targets, and is itself encoded with the prior distribution , yielding a message of length . This message allows the receiver to perfectly reconstruct the original targets, and consequently the variational distribution , by running the same (deterministic) algorithm as used by the sender. Consequently, with at hand, the receiver is able to retrieve an auxiliary message encoded in . When subtracting the length of this “free message” from the original ,^{2}^{2}2Unless otherwise stated, we refer to information theoretic measures in nats. For reference, we yield a net cost of for encoding the weights, i.e. we recover the ELBO (1) as negative MDL. For further details, see Hinton & Van Camp (1993).
In (Hinton & Zemel, 1994; Frey & Hinton, 1997) coding schemes were proposed which practically exploited the bits-back argument for the purpose of coding data. However, it is not clear how these free bits can be spent solely for the purpose of model compression, as we only want to store a representation of our model, while discarding the training data. Therefore, while Bayesian Compression is certainly motivated by the bits-back argument, it actually does not strive for the postulated coding efficiency
. Rather, this method imposes a sparsity inducing prior distribution to aid the pruning process. Moreover, high posterior variance is translated into reduced precision which constitutes a heuristic for quantization. In the end, Bayesian Compression merely produces a
deterministic weight-set which is encoded similar as in preceding works.In particular, all previous approaches essentially use the following coding scheme, or a (sometimes sub-optimal) variant of it. After a deterministic weight-set has been obtained, involving potential pruning and quantization techniques, one interprets as a sequence of i.i.d. variables and assumes the coding distribution (i.e. a dictionary) , where denotes the Dirac delta at . According to Shannon’s source coding theorem (Shannon, 1948), can be coded with no less than , which is asymptotically achieved by Huffman coding, like in Han et al. (2016). However, note that the Shannon lower bound can also be written as
(2) |
where we set . Thus, these Shannon-style coding schemes are in some sense optimal, when the variational family is restricted to point-measures, i.e. deterministic weights. By extending the variational family to comprise more general distributions , the coding length could potentially be drastically reduced. In the following, we develop one such method which exploits the uncertainty represented by in order to encode a random weight-set with short coding length.
Consider the scenario where we want to train a neural network but our memory budget is constrained to . As illustrated in the previous section, a variational approach offers – in principle – a simple and elegant solution. Before we proceed, we note that we do not consider our approach to be a strictly Bayesian one, but rather based on the MDL principle, although these two are of course highly related (Grünwald, 2007). In particular, we refer to as an encoding distribution rather than a prior, and moreover we will use a framework akin to the -VAE (Higgins et al., 2017) which better reflects our goal of efficient coding. The crucial difference to the -VAE being that we encode parameters rather than data.
Now, similar to Louizos et al. (2017), we first fix a suitable network architecture, select an encoding distribution and a parameterized variational family for the network weights . We consider, however, a slightly different variational objective related to the -VAE:
(3) |
This objective directly reflects our goal of achieving both a good training performance (loss term) and being able to represent our model with a short code (model complexity), at least according to the bits-back argument. After obtaining by maximizing (3), a weight-set drawn from will perform comparable to a deterministically trained network, since the variance of the negative loss term will be comparatively small to the mean, and since the term regularizes the model. Thus, our declared goal is to draw a sample from such that this sample can be encoded as efficiently as possible. This problem can be formulated as the following communication problem.
Alice observes a training data set drawn from an unknown distribution . She trains a variational distribution by optimizing (3) for a given using a deterministic algorithm. Subsequently, she wishes to send a message to Bob, which allows him to generate a sample distributed according to . How long does this message need to be?
The answer to this question depends on the unknown data distribution , so we need to make an assumption about it. Since the variational parameters depend on the realized dataset , we can interpret the variational distribution as a conditional distribution , giving rise to the joint . Now, our assumption about is that , that is, the variational distribution yields the assumed encoding distribution , when averaged over all possible datasets. Note that this a similar strong assumption as in a Bayesian setting, where we assume that the data distribution is given as . In this setting, it follows immediately from the data processing inequality (Harsha et al., 2010) that in expectation the message length cannot be smaller than :
(4) |
where
refers to the mutual information and in the third inequality we applied the data processing inequality for Markov chain
. As discussed by Harsha et al. (2010), the inequality can be very loose. However, as they further show, the message length can be brought close to the lower bound, if Alice and Bob are allowed to share a source of randomness:Given random variables
, and a random string , let a protocol be defined via a message function and a decoder function , i.e. . Let be the expected message length for data , and let the minimal expected message length be defined as(5) |
where ranges over all protocols such that and have the same distribution. Then
(6) |
The results of Harsha et al. (2010) establish a characterization of the mutual information in terms of minimal coding a conditional sample. For our purposes, Theorem 3.1 guarantees that in principle there is an algorithm which realizes near bits-back efficiency. Furthermore, the theorem shows that this is indeed a fundamental lower bound, i.e. that such an algorithm is optimal for the considered setting. To this end, we need to refer to a “common ground”, i.e. a shared random source , where w.l.o.g. we can assume that this source is an infinite list of samples from our encoding distribution . In practice, this can be realized via a pseudo-random generator with a public seed.
While Harsha et al. (2010) provide a constructive proof using a variant of rejection sampling (see Appendix A
), this algorithm is in fact intractable, because it requires keeping track of the acceptance probabilities over the whole sample domain. Therefore, we propose an alternative method to produce an approximate sample from
, depicted in Algorithm 1. This algorithm takes as inputs the trained variational distribution and the encoding distribution . We first draw samples from , using the shared random generator. Subsequently, we craft a discrete proxy distribution , which has support only on these samples, and where the probability mass for each sample is proportional to the importance weights . Finally, we draw a sample from and return its index and the sample itself. Since any number can be easily encoded with , we achieve our aimed coding efficiency. Decoding the sample is easy: simply draw the sample from the shared random generator (e.g. by resetting the random seed).While this algorithm is remarkably simple and easy to implement, there is of course the question of whether it is a correct thing to do. Moreover, an immediate caveat is that the number of required samples grows exponentially in , which is clearly infeasible for encoding a practical neural network. The first point is addressed in the next section, while the latter is discussed in Section 3.3, together with other practical considerations.
The proxy distribution in Algorithm 1 is based on an importance sampling scheme, as its probability masses are defined to be proportional to the usual importance weights . Under mild assumptions (, continuous; ) it is easy to verify that converges to in total variation distance for ; thus in the limit, Algorithm 1 samples from the correct distribution. However, since we collect only samples in order to achieve a short coding length, will be biased. Fortunately, it turns out that is just in the right order for this bias to be small.
Let , be distributions over . Let and be a discrete distribution constructed by drawing samples from and defining . Furthermore, let be a measurable function and be its 2-norm under . Then it holds that
(7) |
where
(8) |
Theorem 3.2 is a corollary of Chatterjee & Diaconis (2018), Theorem 1.2, by noting that
(9) |
which is precisely the importance sampling estimator for unnormalized distributions (denoted as
in (Chatterjee & Diaconis, 2018)), i.e. their Theorem 1.2 directly yields Theorem 3.2. Note that the term decays quickly with , and, since is typically concentrated around its expected value , the second term in (8) also quickly becomes negligible. Thus, roughly speaking, Theorem 3.2 establishes that with high probability, for any measurable function . This is in particular true for the function . Note that the expectation of this function is just the variational objective (3) we optimized to yield in the first place. Thus, since , replacing by is well justified. Thereby, any sample of can trivially be encoded with , and decoded by simple reference to a pseudo-random generator.Note that according to Theorem 3.2 we should actually take a number of samples somewhat larger than in order to make sufficiently small. In particular, the results in (Chatterjee & Diaconis, 2018) also imply that a too small number of samples will typically be quite off the targeted expectation (for the worst-case ). However, although our choice of number of samples is at a critical point, in our experiments this number of samples yielded very good results.
In this section, we describe the application of Algorithm 1 within a practical learning algorithm – Minimal Random Code Learning (MIRACLE) – depicted in Algorithm 2. For both and we used Gaussians with diagonal covariance matrices. For
, all means and standard deviations constituted the variational parameters
. The mean of was fixed to zero, and the standard deviation was shared within each layer of the encoded network. These shared parameters of where learned jointly with , i.e. the encoding distribution was also adapted to the task. This choice of distributions allowed us to use the reparameterization trick for effective variational training and furthermore, can be computed analytically.Since generating samples is infeasible for any reasonable , we divided the overall problem into sub-problems. To this end, we set a global coding goal of and a local coding goal of
. We randomly split the weight vector
into equally sized blocks, and assigned each block an allowance of . For example, fixing to , corresponds to samples which need to be drawn per block. We imposed block-wise constraints using block-wise penalty factors , which were automatically annealed via multiplication/division with during the variational updates (see Algorithm 2). Note that the random splitting into blocks can be efficiently coded via the shared random generator, and only the number needs communicated.Before encoding any weights, we made sure that variational learning had converged by training for a large number of iterations . After that, we alternated between encoding single blocks and updating the variational distribution not-yet coded weights, by spending intermediate variational iterations. To this end, we define a variational objective w.r.t. to blocks which have not been coded yet, while weights of already encoded blocks were fixed to their encoded value. Intuitively, this allows to compensate for poor choices in earlier encoded blocks, and was crucial for good performance. Theoretically, this amounts to a rich auto-regressive variational family , as the blocks which remain to be updated are effectively conditioned on the weights which have already been encoded. We also found that the hashing trick (Chen et al., 2015) further improves performance (not depicted in Algorithm 2 for simplicity). The hashing trick randomly conditions weights to share the same value. While Chen et al. (2015) apply it to reduce the entropy, in our case it helps to restrict the optimization space and reduces the dimensionality of both and . We found that this typically improves the compression rate by a factor of .
The experiments^{3}^{3}3 The code is publicly available at https://github.com/cambridge-mlg/miracle were conducted on two common benchmarks: LeNet-5 on MNIST and VGG-16 on CIFAR-10. As baselines we used three recent state-of-the-art methods, namely Deep Compression (Han et al., 2016), Weightless encoding (Reagen et al., 2018) and Bayesian Compression (Louizos et al., 2017). The performance of the baseline methods are quoted from their respective source materials.
For training MIRACLE, we used Adam (Kingma & Ba, 2014) with the default learning rate () and we set and . For VGG, the means of the weights were initialized using a pretrained model.^{4}^{4}4For preprocessing the data and pretraining, we followed an open source implementation that can be found at https://github.com/chengyangfu/pytorch-vgg-cifar10 We recommend applying the hashing trick mainly to reduce the size of the largest layers. In particular, we applied the hashing trick was to layers 2 and 3 in LeNet-5 to reduce their sizes by and respectively and to layers 10-16 in VGG to reduce their sizes . The local coding goal was fixed at for LeNet-5 and it was varied between and for VGG ( was kept constant). For the number of intermediate variational updates , we used for LeNet-5 and for VGG, in order to keep training time reasonable ( 1 day on a single NVIDIA P100 for VGG).
The performance trade-offs (test error rate and compression size) of MIRACLE along with the baseline methods and the uncompressed model are shown in Figure 1 and Table 1. For MIRACLE we can easily construct the Pareto frontier, by starting with a large coding goal (i.e. allowing a large coding length) and successively reducing it. Constructing such a Pareto frontier for other methods is delicate, as it requires re-tuning hyper-parameters which are often only indirectly related to the compression size – for MIRACLE it is directly reflected via the -term. We see that MIRACLE is Pareto-better than the competitors: for a given test error rate, we achieve better compression, while for a given model size we achieve lower test error.
Model | Compression | Size | Ratio | Test error |
LeNet-5 on MNIST | Uncompressed model | 1720 kB | 1 | 0.7 % |
Deep Compression | 44 kB | 39 | 0.8 % | |
Weightless ^{5}^{5}5Weighless encoding only reports the size of the two largest layers so we assumed that the size of the rest of the network is negligible in this case. | 4.52 kB | 382 | 1.0 % | |
Bayesian Compression | 2.3 kB | 771 | 1.0 % | |
MIRACLE (Lowest error) | 3.03 kB | 555 | 0.69 % | |
MIRACLE (Highest compression) | 1.52 kB | 1110 | 0.96 % | |
VGG-16 on CIFAR-10 | Uncompressed model | 60 MB | 1 | 6.5 % |
Bayesian Compression | 642 kB | 95 | 8.6 % | |
Bayesian Compression | 525 kB | 116 | 9.2 % | |
MIRACLE (Lowest error) | 384 kB | 159 | 6.57 % | |
MIRACLE (Highest compression) | 135 kB | 452 | 10.0 % |
In this paper we followed through the philosophy of the bits-back argument for the goal of coding model parameters. The basic insight here is that restricting to a single deterministic weight-set and aiming to coding it in a classic Shannon-style is greedy and in fact sub-optimal. Neural networks – and other deep learning models – are highly overparameterized, and consequently there are many “good” parameterizations. Thus, rather than focusing on a single weight set, we showed that this fact can be exploited for coding, by selecting a “cheap” weight set out of the set of “good” ones. Our algorithm is backed by solid recent information-theoretic insights, yet it is simple to implement. We demonstrated that the presented coding algorithm clearly outperforms previous state-of-the-art. An important question remaining for future work is how efficient MIRACLE can be made in terms of memory accesses and consequently for energy consumption and inference time. There lies clear potential in this direction, as any single weight can be recovered by its block-index and relative index within each block. By smartly keeping track of these addresses, and using pseudo-random generators as algorithmic lookup-tables, we could design an inference machine which is able to directly run our compressed models, which might lead to considerable savings in memory accesses.
We want to thank Christian Steinruecken, Olivér Janzer, Kris Stensbo-Smidt and Siddharth Swaroop for their helpful comments. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Grant Agreement No. 797223 — HYBSPN. Furthermore, we acknowledge EPSRC and Intel for their financial support.
Efficient stochastic source coding and an application to a bayesian network source model.
The Computer Journal, 40(2_and_3):157–165, 1997.Proceedings of the sixth annual conference on Computational learning theory
, pp. 5–13. ACM, 1993.International Conference on Machine Learning
, 2018.In order to prove the upper bound, to which Harsha et al. (2010) refer as the ‘one-shot reverse Shannon theorem’, they exhibit a rejection sampling procedure. However, instead of using the classical rejection with acceptance probabilities where , they propose a greedier version. The core idea is that every sample should be accepted with as high probability as possible while keeping the overall acceptance probability of each element below the target distribution.
For this algorithm we assume discrete and over the set .
Let with and be the probability that the procedure outputs the th sample with . For the sampling method to be unbiased, we have to ensure that
(10) |
Let be the probability that the procedure halts within iteration and it outputs . Let be the probability that procedure halts within iterations. Let
(11) | ||||
Since , can be at most . The proposed strategy is greedy because it accepts the th sample with as high probability as possible under the constraint that .
Under the proposed formula for , the acceptance probability for the th sample is
(12) |
The pseudo code is shown in Algorithm 3. Note that the algorithm requires computing for the whole set in every iteration which makes it intractable for large .
For the details of the proof, please refer to the source material (Harsha et al., 2010).
To show that the procedure is unbiased, one has to prove that
(13) |
This is shown by proving that for .
In order to bound the encoding length, one has to first show that if the accepted sample has index , then
(14) |
Following this, one can employ the prefix-free binary encoding of Vitanyi & Li (1997). Let be the length of the encoding for using the encoding scheme proposed by Vitanyi & Li (1997). Their method is proven to have , from which the upper bound follows:
(15) |