1 Introduction
In recent years, deep neural networks (DNNs) have led to many breakthrough results in machine learning and computer vision
krizhevsky2012imagenet ; silver2016mastering ; esteva2017dermatologist , and are now widely deployed in industry. Modern DNN models often have millions or tens of millions of parameters, leading to highly redundant structures, both in the intermediate feature representations they generate and in the model itself. Although overparametrization of DNN models can have a favorable effect on training, in practice it is often desirable to compress DNN models for inference, e.g., when deploying them on mobile or embedded devices with limited memory. The ability to learn compressible feature representations, on the other hand, has a large potential for the development of (dataadaptive) compression algorithms for various data types such as images, audio, video, and text, for all of which various DNN architectures are now available.DNN model compression and lossy image compression using DNNs have both independently attracted a lot of attention lately. In order to compress a set of continuous model parameters or features, we need to approximate each parameter or feature by one representative from a set of quantization levels (or vectors, in the multidimensional case), each associated with a symbol, and then store the assignments (symbols) of the parameters or features, as well as the quantization levels. Representing each parameter of a DNN model or each feature in a feature representation by the corresponding quantization level will come at the cost of a distortion , i.e., a loss in performance (e.g.
, in classification accuracy for a classification DNN with quantized model parameters, or in reconstruction error in the context of autoencoders with quantized intermediate feature representations). The rate
, i.e., the entropy of the symbol stream, determines the cost of encoding the model or features in a bitstream.To learn a compressible DNN model or feature representation we need to minimize , where controls the ratedistortion tradeoff. Including the entropy into the learning cost function can be seen as adding a regularizer that promotes a compressible representation of the network or feature representation. However, two major challenges arise when minimizing for DNNs: i) coping with the nondifferentiability (due to quantization operations) of the cost function
, and ii) obtaining an accurate and differentiable estimate of the entropy (
i.e., ). To tackle i), various methods have been proposed. Among the most popular ones are stochastic approximations williams1992simple ; krizhevsky2011using ; courbariaux2015binaryconnect ; toderici2015variable ; balle2016end and rounding with a smooth derivative approximation hubara2016quantized ; theis2017lossy. To address ii) a common approach is to assume the symbol stream to be i.i.d. and to model the marginal symbol distribution with a parametric model, such as a Gaussian mixture model
theis2017lossy ; ullrich2017soft , a piecewise linear model balle2016end, or a Bernoulli distribution
toderici2016full (in the case of binary symbols).In this paper, we propose a unified endtoend learning framework for learning compressible representations, jointly optimizing the model parameters, the quantization levels, and the entropy of the resulting symbol stream to compress either a subset of feature representations in the network or the model itself (see inset figure). We address both challenges i) and ii) above with methods that are novel in the context DNN model and feature compression. Our main contributions are:

[leftmargin=*]

We provide the first unified view on endtoend learned compression of feature representations and DNN models. These two problems have been studied largely independently in the literature so far.

Our method is simple and intuitively appealing, relying on soft assignments of a given scalar or vector to be quantized to quantization levels. A parameter controls the “hardness” of the assignments and allows to gradually transition from soft to hard assignments during training. In contrast to roundingbased or stochastic quantization schemes, our coding scheme is directly differentiable, thus trainable endtoend.

Our method does not force the network to adapt to specific (given) quantization outputs (e.g., integers) but learns the quantization levels jointly with the weights, enabling application to a wider set of problems. In particular, we explore vector quantization for the first time in the context of learned compression and demonstrate its benefits over scalar quantization.

Unlike essentially all previous works, we make no assumption on the marginal distribution of the features or model parameters to be quantized by relying on a histogram of the assignment probabilities rather than the parametric models commonly used in the literature.

We apply our method to DNN model compression for a 32layer ResNet model He_2016_CVPR and fullresolution image compression using a variant of the compressive autoencoder proposed recently in theis2017lossy . In both cases, we obtain performance competitive with the stateoftheart, while making fewer model assumptions and significantly simplifying the training procedure compared to the original works theis2017lossy ; choi2016towards .
The remainder of the paper is organized as follows. Section 2 reviews related work, before our softtohard vector quantization method is introduced in Section 3. Then we apply it to a compressive autoencoder for image compression and to ResNet for DNN compression in Section 4 and 5, respectively. Section 6 concludes the paper.
2 Related Work
There has been a surge of interest in DNN models for fullresolution image compression, most notably toderici2015variable ; toderici2016full ; balle2016code ; balle2016end ; theis2017lossy , all of which outperform JPEG jpeg1992wallace and some even JPEG 2000 jpeg2000taubman The pioneering work toderici2015variable ; toderici2016full
showed that progressive image compression can be learned with convolutional recurrent neural networks (RNNs), employing a stochastic quantization method during training.
balle2016code ; theis2017lossy both rely on convolutional autoencoder architectures. These works are discussed in more detail in Section 4.In the context of DNN model compression, the line of works han2015learning ; han2015deep ; choi2016towards adopts a multistep procedure in which the weights of a pretrained DNN are first pruned and the remaining parameters are quantized using a means like algorithm, the DNN is then retrained, and finally the quantized DNN model is encoded using entropy coding. A notable different approach is taken by ullrich2017soft , where the DNN compression task is tackled using the minimum description length principle, which has a solid informationtheoretic foundation.
It is worth noting that many recent works target quantization of the DNN model parameters and possibly the feature representation to speed up DNN evaluation on hardware with lowprecision arithmetic, see, e.g., hubara2016quantized ; rastegari2016xnor ; wen2016learning ; zhou2017incremental . However, most of these works do not specifically train the DNN such that the quantized parameters are compressible in an informationtheoretic sense.
Gradually moving from an easy (convex or differentiable) problem to the actual harder problem during optimization, as done in our softtohard quantization framework, has been studied in various contexts and falls under the umbrella of continuation methods (see allgower2012numerical for an overview). Formally related but motivated from a probabilistic perspective are deterministic annealing methods for maximum entropy clustering/vector quantization, see, e.g., rose1992vector ; yair1992competitive . Arguably most related to our approach is Wohlhart_2013_CVPR
, which also employs continuation for nearest neighbor assignments, but in the context of learning a supervised prototype classifier. To the best of our knowledge, continuation methods have not been employed before in an endtoend learning framework for neural networkbased image compression or DNN compression.
3 Proposed SofttoHard Vector Quantization
3.1 Problem Formulation
Preliminaries and Notations. We consider the standard model for DNNs, where we have an architecture composed of layers where layer maps , and has parameters . We refer to as the parameters of the network and we denote the intermediate layer outputs of the network as such that and is the feature vector produced by layer .
The parameters of the network are learned w.r.t. training data and labels , by minimizing a realvalued loss . Typically, the loss can be decomposed as a sum over the training data plus a regularization term,
(1) 
where is the sample loss, sets the regularization strength, and is a regularizer (e.g., for
regularization). In this case, the parameters of the network can be learned using stochastic gradient descent over minibatches. Assuming that the data
on which the network is trained is drawn from some distribution , the loss (1) can be thought of as an estimator of the expected loss . In the context of image classification, would correspond to the input image space and to the classification probabilities, and would be the categorical cross entropy.We say that the deep architecture is an autoencoder when the network maps back into the input space, with the goal of reproducing the input. In this case, and is trained to approximate , e.g., with a mean squared error loss . Autoencoders typically condense the dimensionality of the input into some smaller dimensionality inside the network, i.e., the layer with the smallest output dimension, , has , which we refer to as the “bottleneck”.
Compressible representations. We say that a weight parameter or a feature has a compressible representation if it can be serialized to a binary stream using few bits. For DNN compression, we want the entire network parameters to be compressible. For image compression via an autoencoder, we just need the features in the bottleneck, , to be compressible.
Suppose we want to compress a feature representation in our network (e.g., of an autoencoder) given an input . Assuming that the data is drawn from some distribution ,
will be a sample from a continuous random variable
.To store with a finite number of bits, we need to map it to a discrete space. Specifically, we map to a sequence of symbols using a (symbol) encoder , where each symbol is an index ranging from to , i.e., . The reconstruction of is then produced by a (symbol) decoder , which maps the symbols back to . Since is a sample from , the symbol stream
is drawn from the discrete probability distribution
. Thus, given the encoder , according to Shannon’s source coding theorem cover2012elements , the correct metric for compressibility is the entropy of :(2) 
Our generic goal is hence to optimize the rate distortion tradeoff between the expected loss and the entropy of :
(3) 
where is the architecture where has been replaced with , and controls the tradeoff between compressibility of and the distortion it imposes on .
However, we cannot optimize (3) directly. First, we do not know the distribution of and . Second, the distribution of depends in a complex manner on the network parameters and the distribution of . Third, the encoder is a discrete mapping and thus not differentiable. For our first approximation we consider the sample entropy instead of . That is, given the data and some fixed network parameters , we can estimate the probabilities for via a histogram. For this estimate to be accurate, we however would need . If is the bottleneck of an autoencoder, this would correspond to trying to learn a single histogram for the entire discretized data space. We relax this by assuming the entries of are i.i.d. such that we can instead compute the histogram over the distinct values. More precisely, we assume that for we can approximate where is the histogram estimate
(4) 
where we denote the entries of and is the output feature for training data point . We then obtain an estimate of the entropy of by substituting the approximation (3.1) into (2),
(5) 
where the first (exact) equality is due to cover2012elements , Thm. 2.6.6, and is the sample entropy for the (i.i.d., by assumption) components of ^{1}^{1}1In fact, from cover2012elements , Thm. 2.6.6, it follows that if the histogram estimates are exact, (5) is an upper bound for the true (i.e., without the i.i.d. assumption)..
We now can simplify the ideal objective of (3), by replacing the expected loss with the sample mean over and the entropy using the sample entropy , obtaining
(6) 
We note that so far we have assumed that is a feature output in , i.e., for some . However, the above treatment would stay the same if is the concatenation of multiple feature outputs. One can also obtain a separate sample entropy term for separate feature outputs and add them to the objective in (6).
In case is composed of one or more parameter vectors, such as in DNN compression where , and cease to be random variables, since is a parameter of the model. That is, opposed to the case where we have a source that produces another source which we want to be compressible, we want the discretization of a single parameter vector to be compressible. This is analogous to compressing a single document, instead of learning a model that can compress a stream of documents. In this case, (3) is not the appropriate objective, but our simplified objective in (6) remains appropriate. This is because a standard technique in compression is to build a statistical model of the (finite) data, which has a small sample entropy. The only difference is that now the histogram probabilities in (4) are taken over instead of the dataset , i.e., and in (4), and they count towards storage as well as the encoder and decoder .
Challenges. Eq. (6) gives us a unified objective that can well describe the tradeoff between compressible representations in a deep architecture and the original training objective of the architecture.
However, the problem of finding a good encoder , a corresponding decoder , and parameters that minimize the objective remains. First, we need to impose a form for the encoder and decoder, and second we need an approach that can optimize (6) w.r.t. the parameters . Independently of the choice of , (6) is challenging since is a mapping to a finite set and, therefore, not differentiable. This implies that neither is differentiable nor is differentiable w.r.t. the parameters of and layers that feed into . For example, if is an autoencoder and , the output of the network will not be differentiable w.r.t. and .
These challenges motivate the design decisions of our softtohard annealing approach, described in the next section.
3.2 Our Method
Encoder and decoder form. For the encoder we assume that we have centers vectors . The encoding of is then performed by reshaping it into a matrix , and assigning each column to the index of its nearest neighbor in . That is, we assume the feature can be modeled as a sequence of points in , which we partition into the Voronoi tessellation over the centers . The decoder then simply constructs from a symbol sequence by picking the corresponding centers from which is formed by reshaping back into . We will interchangeably write and .
The idea is then to relax and into continuous mappings via soft assignments instead of the hard nearest neighbor assignment of .
Soft assignments. We define the soft assignment of to as
(7) 
where is the standard softmax operator, such that has positive entries and . We denote the th entry of with and note that
such that
converges to a onehot encoding of the nearest center to
in . We therefore refer to as the hard assignment of to and the parameter as the hardness of the soft assignment .Using soft assignment, we define the soft quantization of as
where we write the centers as a matrix . The corresponding hard assignment is taken with , where is the center in nearest to . Therefore, we can now write:
Now, instead of computing via hard nearest neighbor assignments, we can approximate it with a smooth relaxation by using the soft assignments instead of the hard assignments. Denoting the corresponding vector form by , this gives us a differentiable approximation of the quantized architecture , by replacing in the network with .
Entropy estimation. Using the soft assignments, we can similarly define a soft histogram, by summing up the partial assignments to each center instead of counting as in (4):
This gives us a valid probability mass function , which is differentiable but converges to as .
We can now define the “soft entropy” as the cross entropy between and :
where
denotes the Kullback–Leibler divergence. Since
, this establishes as an upper bound for , where equality is obtained when .We have therefore obtained a differentiable “soft entropy” loss (w.r.t. ), which is an upper bound on the sample entropy . Hence, we can indirectly minimize by minimizing , treating the histogram probabilities of as constants for gradient computation. However, we note that while is additive over the training data and the symbol sequence, is not. This prevents the use of minibatch gradient descent on , which can be an issue for large scale learning problems. In this case, we can instead redefine the soft entropy as . As before, as , but ceases to be an upper bound for . The benefit is that now can be decomposed as
(8) 
such that we get an additive loss over the samples and the components .
Softtohard deterministic annealing. Our soft assignment scheme gives us differentiable approximations and of the discretized network and the sample entropy , respectively. However, our objective is to learn network parameters that minimize (6) when using the encoder and decoder with hard assignments, such that we obtain a compressible symbol stream which we can compress using, e.g., arithmetic coding Witten1987 .
To this end, we anneal from some initial value to infinity during training, such that the soft approximation gradually becomes a better approximation of the final hard quantization we will use. Choosing the annealing schedule is crucial as annealing too slowly may allow the network to invert the soft assignments (resulting in large weights), and annealing too fast leads to vanishing gradients too early, thereby preventing learning. In practice, one can either parametrize as a function of the iteration, or tie it to an auxiliary target such as the difference between the network losses incurred by soft quantization and hard quantization (see Section 4 for details).
For a simple initialization of and the centers , we can sample the centers from the set and then cluster by minimizing the cluster energy using SGD.
4 Image Compression
We now show how we can use our framework to realize a simple image compression system. For the architecture, we use a variant of the convolutional autoencoder proposed recently in theis2017lossy (see Appendix A.1 for details). We note that while we use the architecture of theis2017lossy , we train it using our softtohard entropy minimization method, which differs significantly from their approach, see below.
Our goal is to learn a compressible representation of the features in the bottleneck of the autoencoder. Because we do not expect the features from different bottleneck channels to be identically distributed, we model each channel’s distribution with a different histogram and entropy loss, adding each entropy term to the total loss using the same parameter. To encode a channel into symbols, we separate the channel matrix into a sequence of dimensional patches. These patches (vectorized) form the columns of , where , such that contains dimensional points. Having or greater than one allows symbols to capture local correlations in the bottleneck, which is desirable since we model the symbols as i.i.d. random variables for entropy coding. At test time, the symbol encoder then determines the symbols in the channel by performing a nearest neighbor assignment over a set of centers , resulting in , as described above. During training we instead use the soft quantized , also w.r.t. the centers .
0.20bpp / 0.91 / 0.69 / 23.88dB  0.20bpp / 0.90 / 0.67 / 24.19dB  0.20bpp / 0.88 / 0.63 / 23.01dB  0.22bpp / 0.77 / 0.48 / 19.77dB 
SHA (ours)  BPG  JPEG 2000  JPEG 
We trained different models using Adam kingmaB14 , see Appendix A.2. Our training set is composed similarly to that described in balle2016code
. We used a subset of 90,000 images from ImageNET
imagenet_cvpr09 , which we downsampled by a factor 0.7 and trained on crops of pixels, with a batch size of 15. To estimate the probability distribution for optimizing (8), we maintain a histogram over 5,000 images, which we update every 10 iterations with the images from the current batch. Details about other hyperparameters can be found in Appendix
A.2.The training of our autoencoder network takes place in two stages, where we move from an identity function in the bottleneck to hard quantization. In the first stage, we train the autoencoder without any quantization. Similar to theis2017lossy we gradually unfreeze the channels in the bottleneck during training (this gives a slight improvement over learning all channels jointly from the start). This yields an efficient weight initialization and enables us to then initialize and as described above. In the second stage, we minimize (6), jointly learning network weights and quantization levels. We anneal by letting the gap between soft and hard quantization error go to zero as the number of iterations goes to infinity. Let be the soft error, be the hard error. With we can denote the error between the actual the desired gap with , such that the gap is halved after iterations. We update according to , where denotes at iteration . Fig. 3 in Appendix A.4 shows the evolution of the gap, soft and hard loss as sigma grows during training. We observed that both vector quantization and entropy loss lead to higher compression rates at a given reconstruction MSE compared to scalar quantization and training without entropy loss, respectively (see Appendix A.3 for details).
Evaluation.
To evaluate the image compression performance of our SofttoHard Autoencoder (SHA) method we use four datasets, namely Kodak kodakurl , B100 Timofte2014 , Urban100 HuangCVPR2015 , ImageNET100 (100 randomly selected images from ImageNET ImageNet_ILSVRC15 ) and three standard quality measures, namely peak signaltonoise ratio (PSNR), structural similarity index (SSIM) SSIM , and multiscale SSIM (MSSSIM), see Appendix A.5 for details. We compare our SHA with the standard JPEG, JPEG 2000, and BPG bpgurl , focusing on compression rates bits per pixel (bpp) (i.e., the regime where traditional integral transformbased compression algorithms are most challenged). As shown in Fig. 1, for high compression rates ( bpp), our SHA outperforms JPEG and JPEG 2000 in terms of MSSSIM and is competitive with BPG. A similar trend can be observed for SSIM (see Fig. 4 in Appendix A.6 for plots of SSIM and PSNR as a function of bpp). SHA performs best on ImageNET100 and is most challenged on Kodak when compared with JPEG 2000. Visually, SHAcompressed images have fewer artifacts than those compressed by JPEG 2000 (see Fig. 1, and Appendix A.7).
Related methods and discussion.
JPEG 2000 jpeg2000taubman uses waveletbased transformations and adaptive EBCOT coding. BPG bpgurl , based on a subset of the HEVC video compression standard, is the current stateofthe art for image compression. It uses contextadaptive binary arithmetic coding (CABAC) marpe2003context .
SHA (ours)  Theis et al. theis2017lossy  

Quantization  vector quantization  rounding to integers 
Backpropagation  grad. of soft relaxation  grad. of identity mapping 
Entropy estimation  (soft) histogram  Gaussian scale mixtures 
Training material  ImageNET  high quality Flickr images 
Operating points  single model  ensemble 
The recent works of theis2017lossy ; balle2016end also showed competitive performance with JPEG 2000. While we use the architecture of theis2017lossy , there are stark differences between the works, summarized in the inset table. The work of balle2016end build a deep model using multiple generalized divisive normalization (GDN) layers and their inverses (IGDN), which are specialized layers designed to capture local joint statistics of natural images. Furthermore, they model marginals for entropy estimation using linear splines and also use CABACmarpe2003context coding. Concurrent to our work, the method of google_newpaper builds on the architecture proposed in toderici2016full , and shows that impressive performance in terms of the MSSSIM metric can be obtained by incorporating it into the optimization (instead of just minimizing the MSE).
In contrast to the domainspecific techniques adopted by these stateoftheart methods, our framework for learning compressible representation can realize a competitive image compression system, only using a convolutional autoencoder and simple entropy coding.
5 DNN Compression
For DNN compression, we investigate the ResNet He_2016_CVPR architecture for image classification. We adopt the same setting as choi2016towards and consider a 32layer architecture trained for CIFAR10 krizhevsky2009learning . As in choi2016towards , our goal is to learn a compressible representation for all 464,154 trainable parameters of the model.
We concatenate the parameters into a vector and employ scalar quantization (), such that . We started from the pretrained original model, which obtains a accuracy on the test set. We implemented the entropy minimization by using centers and chose such that the converged entropy would give a compression factor , i.e., giving bits per weight. The training was performed with the same learning parameters as the original model was trained with (SGD with momentum ). The annealing schedule used was a simple exponential one, with
. After 4 epochs of training, when
has increased by a factor , we switched to hard assignments and continued finetuning at a lower learning rate. ^{2}^{2}2 We switch to hard assignments since we can get large gradients for weights that are equally close to two centers asconverges to hard nearest neighbor assignments. One could also employ simple gradient clipping.
Adhering to the benchmark of choi2016towards ; han2015learning ; han2015deep , we obtain the compression factor by dividing the bit cost of storing the uncompressed weights as floats ( bits) with the total encoding cost of compressed weights (i.e., bits for the centers plus the size of the compressed index stream).Acc  Comp.  
Method  [%]  ratio 
Original model  92.6  1.00 
Pruning + ft. + index coding + H. Coding han2015learning  92.6  4.52 
Pruning + ft. + kmeans + ft. + I.C. + H.C. han2015deep 
92.6  18.25 
Pruning + ft. + Hessianweighted kmeans + ft. + I.C. + H.C.  92.7  20.51 
Pruning + ft. + Uniform quantization + ft. + I.C. + H.C.  92.7  22.17 
Pruning + ft. + Iterative ECSQ + ft. + I.C. + H.C.  92.7  21.01 
SofttoHard Annealing + ft. + H. Coding (ours)  92.1  19.15 
SofttoHard Annealing + ft. + A. Coding (ours)  92.1  20.15 
Our compressible model achieves a comparable test accuracy of while compressing the DNN by a factor with Huffman and using arithmetic coding. Table 1 compares our results with stateoftheart approaches reported by choi2016towards . We note that while the top methods from the literature also achieve accuracies above and compression factors above , they employ a considerable amount of handdesigned steps, such as pruning, retraining, various types of weight clustering, special encoding of the sparse weight matrices into an indexdifference based format and then finally use entropy coding. In contrast, we directly minimize the entropy of the weights in the training, obtaining a highly compressible representation using standard entropy coding.
In Fig. 5 in Appendix A.8, we show how the sample entropy decays and the index histograms develop during training, as the network learns to condense most of the weights to a couple of centers when optimizing (6). In contrast, the methods of han2015learning ; han2015deep ; choi2016towards manually impose as the most frequent center by pruning of the network weights. We note that the recent works by ullrich2017soft also manages to tackle the problem in a single training procedure, using the minimum description length principle. In contrast to our framework, they take a Bayesian perspective and rely on a parametric assumption on the symbol distribution.
6 Conclusions
In this paper we proposed a unified framework for endtoend learning of compressed representations for deep architectures. By training with a softtohard annealing scheme, gradually transferring from a soft relaxation of the sample entropy and network discretization process to the actual nondifferentiable quantization process, we manage to optimize the rate distortion tradeoff between the original network loss and the entropy. Our framework can elegantly capture diverse compression tasks, obtaining results competitive with stateoftheart for both image compression as well as DNN compression. The simplicity of our approach opens up various directions for future work, since our framework can be easily adapted for other tasks where a compressible representation is desired.
References
 [1] Kodak PhotoCD dataset. http://r0k.us/graphics/kodak/, 1999.
 [2] Eugene L Allgower and Kurt Georg. Numerical continuation methods: an introduction, volume 13. Springer Science & Business Media, 2012.
 [3] Johannes Ballé, Valero Laparra, and Eero P Simoncelli. Endtoend optimization of nonlinear transform codes for perceptual quality. arXiv preprint arXiv:1607.05006, 2016.
 [4] Johannes Ballé, Valero Laparra, and Eero P Simoncelli. Endtoend optimized image compression. arXiv preprint arXiv:1611.01704, 2016.
 [5] Yoojin Choi, Mostafa ElKhamy, and Jungwon Lee. Towards the limit of network quantization. arXiv preprint arXiv:1612.01543, 2016.
 [6] Matthieu Courbariaux, Yoshua Bengio, and JeanPierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pages 3123–3131, 2015.
 [7] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
 [8] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR09, 2009.
 [9] Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. Dermatologistlevel classification of skin cancer with deep neural networks. Nature, 542(7639):115–118, 2017.
 [10] Bellard Fabrice. BPG Image format. https://bellard.org/bpg/, 2014.
 [11] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 [12] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.

[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, June 2016. 
[14]
JiaBin Huang, Abhishek Singh, and Narendra Ahuja.
Single image superresolution from transformed selfexemplars.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5197–5206, 2015.  [15] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.
 [16] Nick Johnston, Damien Vincent, David Minnen, Michele Covell, Saurabh Singh, Troy Chinen, Sung Jin Hwang, Joel Shor, and George Toderici. Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. arXiv preprint arXiv:1703.10114, 2017.
 [17] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [18] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.

[19]
Alex Krizhevsky and Geoffrey E Hinton.
Using very deep autoencoders for contentbased image retrieval.
In ESANN, 2011. 
[20]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classification with deep convolutional neural networks.
In Advances in neural information processing systems, pages 1097–1105, 2012.  [21] Detlev Marpe, Heiko Schwarz, and Thomas Wiegand. Contextbased adaptive binary arithmetic coding in the h. 264/avc video compression standard. IEEE Transactions on circuits and systems for video technology, 13(7):620–636, 2003.
 [22] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. Int’l Conf. Computer Vision, volume 2, pages 416–423, July 2001.
 [23] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
 [24] Kenneth Rose, Eitan Gurewitz, and Geoffrey C Fox. Vector quantization by deterministic annealing. IEEE Transactions on Information theory, 38(4):1249–1257, 1992.
 [25] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 [26] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Realtime single image and video superresolution using an efficient subpixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1874–1883, 2016.
 [27] Wenzhe Shi, Jose Caballero, Lucas Theis, Ferenc Huszar, Andrew Aitken, Christian Ledig, and Zehan Wang. Is the deconvolution layer the same as a convolutional layer? arXiv preprint arXiv:1609.07009, 2016.
 [28] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 [29] David S. Taubman and Michael W. Marcellin. JPEG 2000: Image Compression Fundamentals, Standards and Practice. Kluwer Academic Publishers, Norwell, MA, USA, 2001.
 [30] Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszar. Lossy image compression with compressive autoencoders. In ICLR 2017, 2017.
 [31] Radu Timofte, Vincent De Smet, and Luc Van Gool. A+: Adjusted Anchored Neighborhood Regression for Fast SuperResolution, pages 111–126. Springer International Publishing, Cham, 2015.
 [32] George Toderici, Sean M O’Malley, Sung Jin Hwang, Damien Vincent, David Minnen, Shumeet Baluja, Michele Covell, and Rahul Sukthankar. Variable rate image compression with recurrent neural networks. arXiv preprint arXiv:1511.06085, 2015.
 [33] George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor, and Michele Covell. Full resolution image compression with recurrent neural networks. arXiv preprint arXiv:1608.05148, 2016.
 [34] Karen Ullrich, Edward Meeds, and Max Welling. Soft weightsharing for neural network compression. arXiv preprint arXiv:1702.04008, 2017.
 [35] Gregory K Wallace. The JPEG still picture compression standard. IEEE transactions on consumer electronics, 38(1):xviii–xxxiv, 1992.
 [36] Z. Wang, E. P. Simoncelli, and A. C. Bovik. Multiscale structural similarity for image quality assessment. In Asilomar Conference on Signals, Systems Computers, 2003, volume 2, pages 1398–1402 Vol.2, Nov 2003.
 [37] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, April 2004.
 [38] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082, 2016.

[39]
Ronald J Williams.
Simple statistical gradientfollowing algorithms for connectionist reinforcement learning.
Machine learning, 8(34):229–256, 1992.  [40] Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data compression. Commun. ACM, 30(6):520–540, June 1987.
 [41] Paul Wohlhart, Martin Kostinger, Michael Donoser, Peter M. Roth, and Horst Bischof. Optimizing 1nearest prototype classifiers. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2013.
 [42] Eyal Yair, Kenneth Zeger, and Allen Gersho. Competitive learning and soft competition for vector quantizer design. IEEE transactions on Signal Processing, 40(2):294–309, 1992.
 [43] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless cnns with lowprecision weights. arXiv preprint arXiv:1702.03044, 2017.
Appendix A Image Compression Details
a.1 Architecture
We rely on a variant of the compressive autoencoder proposed recently in theis2017lossy , using convolutional neural networks for the image encoder and image decoder ^{3}^{3}3We note that the image encoder (decoder) refers to the left (right) part of the autoencoder, which encodes (decodes) the data to (from) the bottleneck (not to be confused with the symbol encoder (decoder) in Section 3). . The first two convolutional layers in the image encoder each downsample the input image by a factor 2 and collectively increase the number of channels from 3 to 128. This is followed by three residual blocks, each with 128 filters. Another convolutional layer then downsamples again by a factor 2 and decreases the number of channels to , where is a hyperparameter (theis2017lossy use 64 and 96 channels). For a dimensional input image, the output of the image encoder is the
dimensional “bottleneck tensor”.
The image decoder then mirrors the image encoder, using upsampling instead of downsampling, and deconvolutions instead of convolutions, mapping the bottleneck tensor into a dimensional output image. In contrast to the “subpixel” layers shi2016real ; shi2016deconvolution used in theis2017lossy , we use standard deconvolutions for simplicity.
a.2 Hyperparameters
We do vector quantization to centers, using , i.e., .We trained different combinations of and to explore different ratedistortion tradeoffs (measuring distortion in MSE). As controls to which extent the network minimizes entropy, directly controls bpp (see top left plot in Fig. 3). We evaluated all pairs with and , and selected 5 representative pairs (models) with average bpps roughly corresponding to uniformly spread points in the interval bpp. This defines a “quality index” for our model family, analogous to the JPEG quality factor.
We experimented with the other training parameters on a setup with , which we chose as follows. In the first stage we train for iterations using a learning rate of . In the second stage, we use an annealing schedule with , over iterations using a learning rate of . In both stages, we use a weak regularizer over all learnable parameters, with .
a.3 Effect of Vector Quantization and Entropy Loss
To investigate the effect of vector quantization, we trained models as described in Section 4, but instead of using vector quantization, we set and quantized to dimensional (scalar) centers, i.e., . Again, we chose 5 representative pairs . We chose to get approximately the same number of unique symbol assignments as for patches, i.e., .
To investigate the effect of the entropy loss, we trained models using centers for (as described above), but used .
Fig. 2 shows how both vector quantization and entropy loss lead to higher compression rates at a given reconstruction MSE compared to scalar quantization and training without entropy loss, respectively.
a.4 Effect of Annealing
a.5 Data Sets and Quality Measure Details
Kodak kodakurl is the most frequently employed dataset for analizing image compression performance in recent years. It contains 24 color images covering a variety of subjects, locations and lighting conditions.
B100 Timofte2014 is a set of 100 content diverse color test images from the Berkeley Segmentation Dataset MartinFTM01 .
Urban100 HuangCVPR2015 has 100 color images selected from Flickr with labels such as urban, city, architecture, and structure. The images are larger than those from B100 or Kodak, in that the longer side of an image is always bigger than 992 pixels. Both B100 and Urban100 are commonly used to evaluate image superresolution methods.
ImageNET100 contains 100 images randomly selected by us from ImageNET ImageNet_ILSVRC15 , also downsampled and cropped, see above.
Quality measures.
PSNR (peak signaltonoise ratio) is a standard measure in direct monotonous relation with the mean square error (MSE) computed between two signals. SSIM and MSSSIM are the structural similarity index SSIM and its multiscale SSIM computed variant SSIMMS proposed to measure the similarity of two images. They correlate better with human perception than PSNR.
We compute quantitative similarity scores between each compressed image and the corresponding uncompressed image and average them over whole datasets of images. For comparison with JPEG we used libjpeg^{4}^{4}4http://libjpeg.sourceforge.net/, for JPEG 2000 we used the Kakadu implementation^{5}^{5}5http://kakadusoftware.com/, subtracting in both cases the size of the header from the file size to compute the compression rate. For comparison with BPG we used the reference implementation^{6}^{6}6https://bellard.org/bpg/ and used the value reported in the picture_data_length header field as file size.
a.6 Image Compression Performance
a.7 Image Compression Visual Examples
An online supplementary of visual examples is available at http://www.vision.ee.ethz.ch/~aeirikur/compression/visuals2.pdf, showing the output of compressing the first four images of each of the four datasets with our method, BPG, JPEG, and JPEG 2000, at low bitrates.