Efficient and Effective Context-Based Convolutional Entropy Modeling for Image Compression

by   Mu Li, et al.
City University of Hong Kong

It has long been understood that precisely estimating the probabilistic structure of natural visual images is crucial for image compression. Despite the remarkable success of recent end-to-end optimized image compression, the latent code representation is assumed to be fully statistically factorized such that the entropy modeling is feasible. Here we describe context-based convolutional networks (CCNs) that exploit statistical redundancies in the codes for improved entropy modeling. We introduce a 3D zigzag coding order together with a 3D code dividing technique to define proper context and to achieve parallel entropy decoding, both of which boil down to place translation-invariant binary masks on convolution filters of CCNs. We demonstrate the power of CCNs for entropy modeling in both lossless and lossy image compression. For the former, we directly apply a CCN to binarized image planes for estimating the Bernoulli distribution of each code. For the latter, the categorical distribution of each code is represented by a discretized mixture of Gaussian distributions, whose parameters are estimated by three CCNs. We jointly optimize the CCN-based entropy model with analysis and synthesis transforms for rate-distortion performance. Experiments on two image datasets show that the proposed lossless and lossy image compression methods based on CCNs generally exhibit better compression performance than existing methods with manageable computational complexity.


page 1

page 4

page 7

page 9


Learning Context-Based Non-local Entropy Modeling for Image Compression

The entropy of the codes usually serves as the rate loss in the recent l...

End-to-end optimized image compression with competition of prior distributions

Convolutional autoencoders are now at the forefront of image compression...

Learning End-to-End Lossy Image Compression: A Benchmark

Image compression is one of the most fundamental techniques and commonly...

Thousand to One: Semantic Prior Modeling for Conceptual Coding

Conceptual coding has been an emerging research topic recently, which en...

Unified Multivariate Gaussian Mixture for Efficient Neural Image Compression

Modeling latent variables with priors and hyperpriors is an essential pr...

Efficient Trimmed Convolutional Arithmetic Encoding for Lossless Image Compression

Arithmetic encoding is an essential class of coding techniques which hav...

Practical Full Resolution Learned Lossless Image Compression

We propose the first practical learned lossless image compression system...

Code Repositories

I Introduction

Data compression has played a significant role in engineering for centuries [1]

. Compression can be either lossless or lossy. Lossless compression allows perfect data reconstruction from compressed bitstreams with the goal of assigning shorter codewords to more “probable” codes. Typical examples include Huffman coding 

[2], arithmetic coding [3], and range coding [4]. Lossy compression discards “unimportant” information, and the definition of importance is application-dependent. For example, if the data (such as images and videos) are meant to be consumed by the human visual system, importance should be measured in accordance with human perception, discarding data that are perceptually redundant, while keeping those that are most visually noticeable. In lossy compression, one must face the rate-distortion trade-off, where the rate is computed by the entropy of the discrete codes [5] and the distortion is measured by a signal fidelity metric. A prevailing scheme in the context of lossy image compression is transform coding, which consists of three operations - transformation, quantization, and entropy coding. Transforms are designed to be better-suited for exploiting aspects of human perception, and map an image to a latent code space with several statistical advantages. Early transforms [6, 7]

are linear, invertible, and fixed for all bit rates; errors arise only from quantization. Recent transforms take the form of deep neural networks (DNNs) 

[8], aiming for more comprehensible and nonlinear representations. DNN-based transforms are mostly non-invertible, which, however, may encourage discarding perceptually unimportant image features during transformation. This gives us an opportunity to learn different transforms at different bit rates for better rate-distortion performance. Entropy coding is responsible for losslessly compressing the quantized codes into bitstreams for storage and transmission.

In either lossless or lossy image compression, a discrete probability distribution of the latent codes shared by the encoder and the decoder (

i.e., the entropy model) is essential in determining the compression performance. According to Shannon’s source coding theorem [5]

, given a vector of code intensities

, the optimal code length of should be , where binary symbols are assumed to construct the codebook. Without further constraints, the general problem of estimating

in high-dimensional spaces is intractable, a problem commonly known as the curse of dimensionality. For this reason, most entropy coding schemes assume

is fully statistically factorized with the same marginal distribution, leading to a code length of

. Alternatively, the chain rule in probability theory offers a more accurate approximation


where represents the partial context of coded before it in . A representative example is the context-based adaptive binary arithmetic coding (CABAC) [9] in H.264/AVC, which considers the two nearest codes as the context and obtains noticeable improvements over previous image/video compression standards. As the size of becomes large, it is difficult to estimate this conditional probability by constructing histograms. Recent methods such as PixelRNN [10] and PixelCNN [11] take advantage of DNNs in modeling long range relations among pixels, making the estimation of with larger partial contexts feasible. Nevertheless, these methods are computationally intensive, taking hours for medium-resolution images.

We describe context-based convolutional networks (CCNs) for parallel entropy modeling. Given , we specify a 3D zigzag coding order so that the most relevant codes of can be included in its context. Parallel computation during entropy encoding is straightforward as the context of each code is known and readily available. However, this is not always the case during entropy decoding because the context of is not ready for probability estimation until all codes in have been decoded sequentially, which is prohibitively slow. To address this issue, we introduce a 3D code dividing technique, which partitions into multiple groups in compliance with the proposed coding order. The codes within each group are assumed to be conditionally independent given their respective contexts, and therefore can be decoded in parallel. In the context of CCNs, this amounts to applying properly designed translation-invariant masks to convolutional filters.

To validate the proposed CCNs, we combine them with arithmetic coding [3] for entropy coding. For lossless image compression, we convert the input grayscale image into eight binary planes and train a CCN to predict the Bernoulli distribution of , optimizing for the entropy loss in information theory [12]. For lossy image compression, we parameterize the the categorical distribution of with a discretized mixture of Gaussian (MoG) distributions, whose parameters (i.e

., mixture weights, means, and variances) are estimated by three CCNs, depending on its context. The CCN-based entropy model is jointly optimized with analysis and synthesis transforms (

i.e., mappings between raw pixel space and latent code space) over a database of training images for rate-distortion performance. Across two independent sets of images, we find that the optimized methods for lossless and lossy image compression perform favorably against image compression standards and DNN-based methods, especially at low bit rates.

Ii Related Work

In this section, we provide a detailed overview of DNN-based entropy models and lossy image compression methods. For traditional image compression techniques with separately optimized components, we refer the interested readers to [13, 14, 15].

Ii-a DNN-Based Entropy Models

The accuracy of entropy coding largely affects rate-distortion performance of image compression systems. In this process, the first and the most important step is to estimate the probability . In most image compression techniques, is assumed to be fully statistically factorized, whose entropy can be easily computed through the marginal distributions [8, 16, 17, 18]. Arguably speaking, natural images that undergo a highly nonlinear analysis transform still exhibit strong statistical redundancies [19], which suggests that incorporating context into probability estimation has great potentials in improving entropy modeling performance.

Considerable progress has been made in DNN-based context modeling for natural languages and images. In natural language processing, recurrent neural networks (RNN) 


, and long short-term memories 

[21] are two popular tools to model long-range dependencies. In image processing, PixelRNN [10] and PixelCNN [11] are among the first attempts to exploit long-range pixel dependencies for image generation. However, the above-mentioned methods are computational inefficient, which require one forward propagation to generate a single pixel or to estimate its probability. To speed up PixelCNN, Salimans et al[22] proposed PixelCNN++, which is able to sample a larger intermediate image conditioned on an initial image in a single forward pass. This process may be iterated to generate the final result. When viewing PixelCNN++ and its variants as entropy models, we must losslessly compress the initial image into bitstreams as side information, and send them to the decoder for entropy decoding.

Only recently has DNNs for context-based entropy modeling become an active research topic. Ballé et al[19] introduced a scale prior, which stores a variance parameter for each as side information. Richer side information generally leads to more accurate entropy modeling. However, as discussed before, this type of information should be quantized, compressed and considered part of the codes, and it is difficult to trade off the bits saved by the improved entropy model and the bits introduced by storing this information. Li et al[23] extracted a small code block for each as its context, and adopted a simple DNN for entropy modelling. The method suffers from heavy computational complexity similar to PixelRNN [10]. Li et al[24] and Mentzer et al[25] achieved parallel entropy encoding with mask DNNs. However, sequential entropy decoding has to be performed due to the context dependence of the codes, which is painfully slow. In contrast, our CCN-based entropy model equipped with a 3D zigzag coding order and a 3D code dividing technique permits parallel entropy coding, and is a more practical choice for industrial usage.

Ii-B DNN-Based Lossy Image Compression Methods

A major problem in end-to-end lossy image compression is that the gradients of the quantization function are zeros almost everywhere, making gradient descent-based optimization ineffective. Based on the strategies of alleviating the zero-gradient problem of quantization, DNN-based lossy image compression methods can be divided into different categories.

From a signal processing perspective, the quantizer can be approximated with an additive i.i.d. uniform noise, which has the same width as the quantization bin [26]. A desired property of this approximation is that the resulting density function is a continuous relaxation of the probability mass function of  [8]. Another line of research introduced more continuous functions (without the zero-gradient problem) to approximate the quantization function. The step quantizer is used in the forward pass, while its continuous proxy is used in the backward pass. Toderici et al[27] learned an RNN to compress small-size images to binary maps in a progressive manner. They later tested their models on large-size images [28]. Johnston et al[29] exploited adaptive bit allocations and perceptual losses to boost the compression performance especially in terms of MS-SSIM [30]. These methods are not explicitly optimized for rate-distortion performance, but treat entropy coding of the binary codes as a post-processing step. Ballé et al[8] explicitly formulated DNN-based image compression under the framework of rate-distortion optimization. Assuming is statistically factorized, they learned piece-wise linear density functions to compute differential entropy as approximation to discrete entropy. In a subsequent work [19], each

is assumed to zero-mean Gaussian with its own standard deviation, which is separately predicted with side information. Minnen

et al[31] combined the autoregressive and hierarchical priors, which lead to improved rate-distortion performance. Theis et al[16] introduced a continuous upper bound of the discrete entropy with a Gaussian scale mixture. Rippel et al[18] described pyramid-based analysis and synthesis transforms with adaptive code length regularization for real-time image compression. An adversarial loss [32] is incorporated to generate visually realistic results at low bit rates [18]. Agustsson et al[17] introduced a soft-to-hard relaxation scheme by approximating quantization with a parametric softmax function.

Iii CCNs for Entropy Modeling

In this section, we present in detail the construction of CCNs for entropy modeling. We start with a fully convolutional network, consisting of

layers of convolutions followed by point-wise nonlinear activation functions. In order to perform efficient context-based entropy modeling, three assumptions are made on the network architecture:

  • For a code block , where , , and denote channel, height, and width, respectively, the output of the th layer convolution , where denotes the number of feature blocks to represent . By doing so, we are able to associate the feature point in th channel and th feature block at spatial location with uniquely.

  • Let be the set of codes encoded before (full context), and be the set of codes in the receptive field of that contributes to its computation (support set), respectively. Then, .

  • For and , .

Assumption I establishes a one-to-many correspondence between the input code block and the output feature representation . Assumption II ensures that the computation of depends only on a subset of . Together, the two assumptions guarantee the legitimacy of context-based entropy modeling in fully convolutional networks, which can be achieved by placing binary masks to convolution filters. Then, can be computed from , which specify the parameters of the conditional distribution. Assumption III allows the masks to be translation-invariant, which can be achieved by properly designed coding orders. Specifically, we start with the case of a 2D code block, where , and write mask convolution at the th layer as


where and denote 2D convolution and Hadamard product, respectively. is a 2D convolution filter and is the corresponding 2D binary mask. According to Assumption I, the input and the output are of the same size as . The input code block corresponds to .

Fig. 1: Illustration of 2D mask convolution in the input layer of the proposed CCN for entropy modeling. A raster coding order (left to right, top to bottom) and a filter size of are assumed here. The orange and blue dashed regions indicate the full context of the orange and blue codes, respectively. In the right panel, we highlight the support sets of the two codes in corresponding colors, which share the same mask.
(a) (b) (c) (d) (e)
Fig. 2: Illustration of code dividing techniques in conjunction with different coding orders for a 2D code block. The orange and blue dots represent two nearby codes. The gray dots denote codes that have already been encoded, while the white circles represent codes yet to be encoded. (a) Raster coding order adopted in many compression methods. (b) Support sets of the orange and blue codes, respectively. It is clear that the orange code is in the support set of the blue one, and therefore should be decoded first. (c) Code dividing scheme for the raster coding order. By removing the dependencies among codes in each row, the orange and blue codes can be decoded in parallel. However, the orange code is excluded from the support set of the blue one, which may hinder entropy estimation accuracy. (d) Zigzag coding order and its corresponding code dividing scheme. The two codes in the orange squares that are important for the orange code in entropy prediction are retained in its partial context. (e) Support sets of the orange and blue codes in compliance with the zigzag coding order.
(a) (b)
Fig. 3: Illustration of the proposed 3D zigzag coding order and 3D code dividing technique. (a) Each group in the shape of a slant is highlighted in green. Specifically, are encoded first, than . Within , we first process codes along the line by gradually decreasing . We then process codes along the line with the same order. The procedure continues until we sweep codes along the last line in . (b) Support sets of the orange and blue codes with a spatial filter size of .

For the input layer of a fully convolutional network, the codes to produce is , where is the set of local indexes centered at . We choose


which can be achieved by setting


Fig. 1 illustrates the concepts of full context , support set , and translation-invariant mask , respectively. At the th layer, if a code , we have


where the first line follows by induction and the second line follows from the definition of context. That is, as long as is in the context of , we are able to compute from without violating Assumption II. Unlike the input layer, with is also generated from , and can be used to compute . Therefore, we update the mask at the th layer

(a) (b)
Fig. 4: Illustration of masked codes with , , and a filter size of . Blue dots represent codes activated by the mask and red dots indicate the opposite. The only difference lies in the green slant. (a) Input layer. (b) Hidden layer.
Fig. 5: The proposed CCN-based entropy model for lossless image compression. The grayscale image is first converted to bit-plane representation , which is fed to the network to predict the mean estimates of Bernoulli distributions . The size of convolution filters and the number of feature blocks in intermediate layers are set to and

, respectively. Each convolution layer is followed by a parametric ReLU nonlinearity, except for the last layer, where a sigmoid function is applied. From the mean estimates, we find that for most significant bit-planes, our model makes more confident predictions closely approaching local image structures. For least significant bit-planes, our model is less confident, producing mean estimates close to


With the proposed CCN with translation-invariant masks in Eq. (4) and Eq. (6), can be encoded in parallel. However, this is not the case in the entropy decoding phase. As shown in Fig. 2 (a) and (b), the two nearby codes in the same row (highlighted in orange and blue, respectively) could not be decoded simultaneously because the orange code is in the support set (or context) of the blue code given the raster coding order. To speed up entropy decoding with parallel computation, we may further remove dependencies among codes at the risk of model accuracy. Specifically, we partition into groups, namely, , and remove dependencies among the codes within the same group, resulting a partial context for . As a result, all codes in the th group share the same partial context, and can be decoded in parallel. Note that code dividing schemes are largely constrained by the pre-specified coding order. For example, if we use a raster coding order, it is straightforward to divide by row. In this case, ( and index vertical and horizontal directions, respectively), which is extremely important in predicting the probability of according to CABAC [9], has been excluded from its partial context. To make a good trade-off between modeling efficiency and accuracy, we adopt a zigzag coding order as shown in Fig 2 (d), where and . As such, we retain the most relevant codes in the partial context for better entropy modeling (see Fig. 9 for quantitative results). Accordingly, the mask at the th layer becomes


Now, we extend our discussion to a 3D code block, where . Fig. 3 (a) shows the proposed 3D zigzag coding order and 3D code dividing technique (zoom in for improved visibility). Specifically, is divided into groups in the shape of slants, where the th one is specified by . The partial context of is defined as . We then write mask convolution in the 3D case as


where and are indexes for the feature block and channel dimensions, respectively. For the 2D case, each layer shares the same mask (). When extending to 3D code blocks, each channel in a layer shares a mask, and there are a total of 3D masks. For the input layer, the codes to produce is , based on which we define the mask


For the th layer, we modify the mask to include the current slant


as shown in Fig. 4, where we highlight the difference in the green slant.

Iv CCN-Based Entropy Models for Lossless Image Compression

In this section, we combine our CCN-based entropy model with the arithmetic coding algorithm for lossless image compression.

As a starting point, we binarize the grayscale image to obtain a 3D code block


where we index the most significant bit-plane with . Our CCN takes as input and produces a feature block (the superscript is omitted for notation convenience) of the same size to compute the mean estimates of Bernoulli distributions . Fig. 5

shows the network architecture, which consists of eleven mask convolution layers with parametric ReLU nonlinearities in between. The last convolution responses undergo a sigmoid nonlinearity to constrain the dynamic range witin

. We make four residual connections as suggested in 

[33] to accelerate training. We will experiment with two hyper-parameters in CCN: the size of convolution filters for all layers and the number of feature blocks in hidden layers .

Fig. 6: The architecture of the proposed lossy image compression method, which consists of an analysis transform , a non-uniform and trainable quantizer , a CCN-based entropy model, and a synthesis transform . Conv: regular convolution with filter support () and number of channels (output

input). Down-upsampling: implemented jointly with the adjacent convolution (also referred to as stride convolution). DenseBlock:

matches the input channel number of the preceding convolution. is the channel number in DenseBlock set empirically. MConv: mask convolution used in our CCNs with filter support () and number of feature blocks (outputinput). Note that the number of channels is fixed in MConv, and is determined by that of .

To optimize the network parameters, which are collectively denoted by , we apply the expected code length as the empirical loss


where is an indicator function and the expectation may be approximated by averaging over a mini-batch of training images. Finally, we implement our own arithmetic coding with the learned CCN-based entropy model to compress to bitstreams, and report performance using actual bit rates. This facilitates comparison against widely used image compression standards.

V CCN-Based Entropy Models for Lossy Image Compression

In lossy image compression, our objective is to minimize a weighted sum of rate and distortion, , where governs the trade-off between the two terms. As illustrated in Fig. 6, our compression method consists of four components: an analysis transform , a quantizer , a CCN-based entropy model, and a synthesis transform . The analysis transform takes a color image as input and produces the latent code representation . consists of three convolutions, each of which is followed by down-sampling with a factor of two. A dense block [34] comprised of seven convolutions is employed after each down-sampling. After the last dense block, we add another convolution layer with filters to produce . Empirically, the parameter sets the upper bound of the bit rate that a general DNN-based compression method can achieve. The parameters of constitute the parameter vector to be optimized.

The synthesis transform is a mirror of the analysis transform. Particularly, the depth-to-space reshaping [28, 35] is adopted to up-sample the feature maps. The last convolution layer with three filters is responsible for producing the compressed image in RGB space. The parameters of constitute the parameter vector to be optimized.

For the quantizer , we parameterize its quantization centers for the th channel by , where is the number of quantization centers and . The monotonicity of can be enforced by a simple re-parameterization based on cumulative functions. Given a fixed set of , we perform quantization by mapping to its nearest center that minimizes the quantization error


has zero gradients almost everywhere, which hinders training via back-propagation. Taking inspirations from binarized neural networks [36, 37, 38], we make use of an identify mapping as a more continuous proxy to the step quantization function. During training, we use and in the forward and backward passes, respectively.

The quantization centers should be optimized by minimizing the mean squared error (MSE),


which is essentially a -means clustering problem, and can be solved efficiently by the Lloyd’s algorithm [39]. Specifically, we initialize using uniform quantization, which appears to work well in all experiments. To make parameter learning of the entire model smoother, we adjust

using stochastic gradient descent instead of a closed-form update.

Without prior knowledge of the categorical distributions of the quantized codes , we choose to work with discretized MoG distributions, whose parameters are predicted by the proposed CCNs. We write the differentiable MoG distribution with components as


where , and are the mixture weight, mean, and variance of the -th component, respectively. Then,


where is the quantization bin that lies in.

Now we describe the proposed entropy model in lossy image compression, which is comprised of three CCNs with the same structure, as shown in Fig. 6. Each CCN consists of nine mask convolutions with three residual connections to produce feature blocks, matching the number of components in MoG. They separately output mixture weights, means and variances to build the discretized MoG distributions. The network parameters of our CCNs constitute the parameter vector to be optimized.

Fig. 7: Bit rates (in terms of bpp) of different DNN-based entropy models for lossless image compression on the Kodak dataset. SIN() refers to a side information network that allocates output channels to represent side information. The orange and gray bars represent the bit rates from the image and the side information, respectively.

Now we are able to write the empirical rate-distortion objective for the parameters as


is the distortion term, which is more preferable to be assessed in a perceptual space. In this paper, we optimize and evaluate our lossy image compression methods using standard MSE and a perceptual metric MS-SSIM [30]. Similar in lossless image compression, we combine the optimized entropy model with arithmetic coding, and measure the rate using actual bit rates.

Vi Experiments

In this section, we test the proposed CCN-based entropy models in lossless and lossy image by comparing to state-of-the-art image coding standards and recent deep image compression algorithms. We first collect high-quality and high-definition images from Flickr, and down-sample them to further reduce possibly visible artifacts. We crop grayscale patches of size and color patches of size as the training sets for lossless and lossy image compression, respectively. We test our models on two independent datasets - Kodak and Tecnick [40]

, which are widely used to benchmark compression performance. The Caffe implementations along with the pre-trained models are made available at


Vi-a Lossless Image Compression

We train our CCN-basd entropy model using the Adam stochastic optimization package [41] by minimizing the objective in Eq. (12). We start with a learning rate of , and subsequently lower it by a factor of when the loss plateaus, until . The (actual) bit rate in terms of bits per pixel (bpp) is used to quantify the compression performance, which is defined as the ratio between the total amount of bits used to code the image and the number of pixels in the image. A smaller bpp indicates better performance. For example, an uncompressed grayscale image has eight bpp.

We first compare the proposed CCNs with mask convolutional networks (MCNs) [25, 24], PixelCNN++ [22], and side information networks (SINs)  [19] for entropy modeling. As a special case of CCNs, MCNs specify the raster coding order without using any code dividing technique (see Fig. 2). We implement our own version of MCN that inherits the network architecture from the CCN with (number of feature blocks) and (filter support). PixelCNN++ [22]

is originally designed for image inpainting as a generative model. Here we adapt it for entropy modeling. Starting from a down-sampled grayscale image with a factor of four, we use a DNN of similar model complexity compared with our CCN to predict the categorical distributions from the down-sampled image. SINs summarize the side information of

with a separate DNN, which is helpful in probability estimation. We adopt a DNN-based autoencoder of similar model complexity as our CCN (including three stride convolutions and two residual connections) to generate the side information, which is further quantized and compressed with arithmetic coding for performance evaluation. We test five variants of SINs with different amount of side information by changing the number of output channels

. All competing models are trained on the same dataset described at the beginning of Section VI. We also introduce a fast version of our method, which we name CCN, by making the network architecture lighter (with and ) and by dividing the input image into non-overlapping patches for parallel processing.

Fig. 8: Bit rates of CCN in comparison with lossless image compression standards on the Kodak and Tecnick datasets.
Encoding 0.155 0.323 0.121 0.323 0.074
Decoding 0.155 3079.68 0.121 35.28 0.984
TABLE I: Running time in seconds of different DNN-based entropy models on the Kodak dataset with image size of
Fig. 9: Ablation study of CCN on the Kodak and Tecnick datasets. CCN(,) denotes the CCN with feature blocks and filter size. CCN represents the CCN with the raster coding order and the corresponding code dividing technique (see Fig. 2).
Fig. 10: Visualization of the learned continuous MoG distributions of sample codes before discretization. It is clear that most of them are multimodal and therefore cannot be well fit using a single Gaussian.

Fig. 7 shows the bit rates of the competing methods on the Kodak dataset. The proposed CCN matches the best performing model MCN, which suggests that with the proposed zigzag coding order and code dividing technique, CCN does not exclude the most important codes from the partial context of the current code being processed. The bit rates of SINs come from two parts - the image itself and the side information. It is clear from the figure that increasing the amount of side information leads to bit savings of the image, at the cost of additional bits introduced to code the side information. In general, it is difficult to determine the amount of side information for optimal compression performance. PixelCNN++ can also be regarded as a special case of SINs, whose side information is a small image without further processing, leading to the worst performance.

We also compare CCN with the widely used lossless image compression standards, including TIFF, GIF, PNG, JPEG-LS, JPEG2000-LS, and BPG. All test results are generated by MATLAB2017. From Fig. 8, we find that CCN (along with its fast version CCN) overwhelms all competing methods on the Kodak/Tecnick dataset, achieving more than bit savings compared to the best lossless image compression standard, JPEG2000-LS.

The running time of the four types of DNN-based entropy models is tested on an NVIDIA TITAN Xp machine, whose results on the Kodak dataset are listed in Table I. For encoding, CCN enjoys the fastest speed, followed by PixelCNN++ and SIN (the best performing variant). Despite similar encoding time, they have substantially different decoding complexities. PixelCNN++ has the fastest decoder, followed by SIN. Due to the sequential decoding nature, MCN is the slowest, taking nearly one hour to decode a grayscale image of size . Our CCN achieves a significant improvement upon MCN with the proposed code dividing technique, while maintaining nearly the same bit rates. Moreover, CCN speeds up CCN more than times, striking a good balance model between efficiency and accuracy.

We conduct thorough ablation experiments to analyze the impact of individual components to final compression performance. Fig. 9 shows the bit rates of CCNs with three different numbers of feature blocks () and two filter sizes (). When the filter size and the network depth are fixed, adding more feature blocks effectively increases the model capability and thus boosts the compression performance. Similarly, using a larger filter size with fixed network depth and feature block number increases the partial context, leading to better entropy modeling. Moreover, we replace the proposed zigzag coding order in CCN with the raster coding order, whose model is denoted by CCN. From Fig. 9, we observe that the performance of CCN drops significantly, only comparable to the CCN with four feature blocks, which verifies the advantages of the proposed coding order.

(a) (b)
Fig. 11: Rate-distortion curves of different compression methods on the Kodak dataset. (a) PSNR. (b) MS-SSIM. Baseline denotes our method with separately optimized transforms and entropy model for MSE.
(a) (b)
Fig. 12: Rate-distortion curves of different compression methods on the Tecnick dataset. (a) PSNR. (b) MS-SSIM.
Average bpp 0.100 0.209 0.362 0.512 0.671 0.794
Encoding 0.013 0.025 0.044 0.066 0.085 0.103
Decoding 0.116 0.227 0.457 0.735 1.150 1.232
TABLE II: Running time in second of our CCN-based entropy model at six bit rates on the Kodak dataset
(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)
Fig. 13: Compressed images by different compression methods on the Kodak dataset. The quantitative measures are in the format of “bpp / PSNR / MS-SSIM”. (a) Uncompressed “Sailboat” image. (b) Ballé17 [8]. 0.209 / 31.81 / 0.962. (c) Li18 [23]. 0.244 / 31.97 / 0.966. (d) BPG. 0.220 / 33.19 / 0.963. (e) Ours optimized for MS-SSIM. 0.209 / 31.01 / 0.978. (f) Uncompressed “Statue” image. (g) Ballé17. 0.143 / 29.48 / 0.942. (h) Li18. 0.115 / 29.35 / 0.938. (i) BPG. 0.119 / 29.77 / 0.935. (j) Ours optimized for MS-SSIM. 0.116 / 28.05 / 0.954.
(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)
Fig. 14: Compressed images by different compression methods on the Tecnick dataset. The quantitative measures are in the format of “bpp / PSNR / MS-SSIM”. (a) Uncompressed “Bus” image. (b) JPEG2K. 0.199 / 24.41 / 0.914. (c) Li18 [23]. 0.224 / 23.41 / 0.908. (d) BPG. 0.208 / 25.36 / 0.928. (e) Ours(MS-SSIM). 0.198 / 23.71 / 0.951. (f) Uncompressed “Toy train” image. (g) JPEG2K. 0.201 / 28.29 / 0.917. (h) Li18. 0.189 / 26.83 / 0.899. (i) BPG. 0.210 / 29.25 / 0.933. (j) Ours(MS-SSIM). 0.198 / 28.08 / 0.949.

Vi-B Lossy Image Compression

In lossy image compression, the analysis transform , the non-uniform quantizer , the CCN-based entropy model, and the synthesis transform are jointly optimized for rate-distortion performance. In early stages of training, the probability may change rapidly, which makes it difficult to keep track of, causes instability in learning the entropy model. We find that this issue can be alleviated by a simple warmup strategy. Specifically, and are trained using the distortion term

only for the first epoch. We then fix

and train the CCN-based entropy model until it reasonably fits the current distribution of the codes. After that, we end-to-end optimize the entire method for the rest epochs. We use Adam with a learning rate of and gradually lower it by a factor of , until . The number of quantization centers is and the number of Gaussian components in MoG is . Fig. 10 shows the learned continuous distributions of some codes, which are typically complex and multimodal. This optimization is performed separately for each and for each distortion measure. We optimize twelve models for six bit rates and two distortion metrics (MSE and MS-SSIM). MSE is converted to peak signal-to-noise ratio (PSNR) for quantitative analysis.

We compare our methods with existing image coding standards and recent DNN-based compression models. These include JPEG [14], JPEG2000 [15], BPG [42], Agustsson17 [17], Theis17 [16], Toderici17 [28], Rippel17 [18], Mentzer18 [25], Johnston17 [29], Ballé17 [8], and Li18 [23]. Both JPEG (with 4:2:0 chroma subsampling) and JPEG2000 are based on the optimized implementations in MATLAB2017. For BPG, we adopt the latest version from its official website with the default setting. When it comes to DNN-based models for lossy image compression, the implementations are generally not available. Therefore, we report the results from their respective papers.

Fig. 11 shows the rate-distortion curves on the Kodak dataset. We find that our method optimized for MSE outperforms all competing methods at low bit rates ( bpp), except for BPG. When optimized for MS-SSIM, our method performs on par with Rippel17 and is much better than the rest. Fig. 12 shows the rate-distortion curves on the Tecnick dataset, where we observe similar trends for both PSNR and MS-SSIM. An interesting observation is that when we continue increasing the bit rate, PSNR/MS-SSIM starts to plateau, which may be due to the limited model capability. Without any constraint on rate () and quantization (), our method optimized for MSE only reaches dB on the Kodak dataset, which may be considered as an empirical upper-bound for our network structure. Preliminary results indicate that increasing the depth and width of the network leads to performance improvements at high bit rates.

We visually compare the compressed images by our method against Ballé17, Li18, JPEG2K, and BPG. Fig. 13 and Fig. 14

show sample compressed results on the Kodak and Tecnick datasets, respectively. JPEG2K and BPG exhibit artifacts (such as blocking, ringing, blurring, and aliasing) that are common to all linear transform coding methods, reflecting the underlying linear basis functions. Ballé17 is effective at suppressing ringing artifacts at the cost of over-smoothing fine structures. Li18 allocates more bits to preserve large-scale strong edges, while tends to eliminate small-scale localized features (

e.g., edges, contours, and textures). In contrast, our method generates compressed images with more faithful details and less visible distortions. In addition, we investigate the effectiveness of joint optimization of the transforms and the CCN-based entropy model. A baseline method is introdced, which first optimizes the transforms for MSE (), and trains the CCN-based entropy model with learned (and fixed) transforms. As shown in Fig. 11 (a), the baseline method underperforms the jointly optimized one by 1dB.

We report the running time of our method at six bit rates on the Kodak dataset using the same machine. It takes second to generate and second to reconstruct the image. The entropy coding time is listed in Table II, which we see that more time is needed to encode and decode images at higher bit rates. With the help of the proposed code dividing technique, our method performs entropy decoding in around one second for images of size .

Vii Conclusion and Discussion

We have introduced CCNs for context-based entropy modeling. Parallel entropy encoding and decoding are achieved with the proposed coding order and code dividing technique, which can be efficiently implemented using mask convolutions. We test the CCN-based entropy model (combined with arithmetic coding) in both lossless and lossy image compression. For the lossless case, our method achieves the best compression performance, which we believes arises from the more accurate estimation of the Bernoulli distributions of the binary codes. For the lossy case, our method offers improvements both visually and in terms of rate-distortion performance over image compression standards and recent DNN-based models.

The application scope of the proposed CCN is far beyond building the entropy model in image compression. As a general probability model, CCN appears promising for a number of image processing applications. For example, we may use CCN to learn a probability model

for natural images, and use it as a prior in Bayesian inference to solve various vision tasks such as image restoration 

[43, 44], image quality assessment [45, 46], and image generation[10, 47].


The authors would like to thank the NVIDIA Corporation for donating a TITAN Xp GPU used in this research.


  • [1] Wikipedia, “Morse code,” 2019. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Morse_code&oldid=895323877
  • [2] D. A. Huffman, “A method for the construction of minimum-redundancy codes,” IRE, vol. 40, no. 9, pp. 1098–1101, 1952.
  • [3] I. H. Witten, R. M. Neal, and J. G. Cleary, “Arithmetic coding for data compression,” Commun. ACM, vol. 30, no. 6, pp. 520–540, 1987.
  • [4] G. Martin, “Range encoding: An algorithm for removing redundancy from a digitised message,” in Video & Data Recording Conf., 1979, pp. 24–27.
  • [5] C. E. Shannon, “A mathematical theory of communication,” Bell System Tech. J., vol. 27, no. 3, pp. 379–423, 1948.
  • [6] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,” IEEE Trans. Comput., vol. C-23, no. 1, pp. 90–93, 1974.
  • [7] R. A. DeVore, B. Jawerth, and B. J. Lucier, “Image compression through wavelet transform coding,” IEEE Trans. Inform. Theory, vol. 38, no. 2, pp. 719–746, 1992.
  • [8] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” in Int. Conf. Learning Representations, 2017.
  • [9] D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive binary arithmetic coding in the H. 264/AVC video compression standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 620–636, 2003.
  • [10] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” in

    Int. Conf. Machine Learning

    , 2016, pp. 1747–1756.
  • [11] A. v. d. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu, “Conditional image generation with PixelCNN decoders,” in Neural Inf. Process. Syst, 2016, pp. 4797–4805.
  • [12] T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing).   New York, NY, USA: Wiley-Interscience, 2006.
  • [13] R. Sudhakar, R. Karthiga, and S. Jayaraman, “Image compression using coding of wavelet coefficients–a survey,” ICGST-GVIP J., vol. 5, no. 6, pp. 25–38, 2005.
  • [14] G. K. Wallace, “The JPEG still picture compression standard,” IEEE Trans. Consumer Electron., vol. 38, no. 1, pp. 18–34, 1992.
  • [15] A. Skodras, C. Christopoulos, and T. Ebrahimi, “The JPEG 2000 still image compression standard,” IEEE Signal Process. Mag., vol. 18, no. 5, pp. 36–58, 2001.
  • [16] L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image compression with compressive autoencoders,” in Int. Conf. Learning Representations, 2017.
  • [17] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool, “Soft-to-hard vector quantization for end-to-end learning compressible representations,” in Neural Inf. Process. Syst, 2017, pp. 1141–1151.
  • [18] O. Rippel and L. Bourdev, “Real-time adaptive image compression,” in Int. Conf. Mach. Learning, 2017, pp. 2922–2930.
  • [19]

    J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” in

    Int. Conf. Learning Representations, 2018.
  • [20] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur, “Recurrent neural network based language model,” in Int. Speech Commun. Assoc., 2010.
  • [21] M. Sundermeyer, R. Schlüter, and H. Ney, “LSTM neural networks for language modeling,” in Int. Speech Commun. Assoc., 2012.
  • [22] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “PixelCNN++: A PixelCNN implementation with discretized logistic mixture likelihood and other modifications,” in Int. Conf. Learning Representations, 2017.
  • [23] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convolutional networks for content-weighted image compression,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 3214–3223.
  • [24] M. Li, S. Gu, D. Zhang, and W. Zuo, “Efficient trimmed convolutional arithmetic encoding for lossless image compression,” arXiv:1801.04662, 2018.
  • [25] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool, “Conditional probability models for deep image compression,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 4394–4402.
  • [26] R. M. Gray and D. L. Neuhoff, “Quantization,” IEEE Trans. Inform. Theory, vol. 44, no. 6, pp. 2325–2383, 1998.
  • [27] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image compression with recurrent neural networks,” arXiv:1511.06085, 2015.
  • [28] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell, “Full resolution image compression with recurrent neural networks,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 5435–5443.
  • [29] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S. J. Hwang, J. Shor, and G. Toderici, “Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 4385–4393.
  • [30] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in 37th Asilomar Conf. Signals, Syst. and Comput., 2003, pp. 1398–1402.
  • [31] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” in Neural Inf. Process. Syst, 2018, pp. 10 794–10 803.
  • [32] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Neural Inf. Process. Syst, 2014, pp. 2672–2680.
  • [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 770–778.
  • [34] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks.” in IEEE Conf. Comput. Vis. Pattern Recog., vol. 1, no. 2, 2017, pp. 4700–4708.
  • [35]

    W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in

    IEEE Conf. Comput. Vis. Pattern Recog., 2016.
  • [36] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1,” arXiv:1602.02830, 2016.
  • [37] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv:1606.06160, 2016.
  • [38]

    M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” in

    Eur. Conf. Comput. Vis., 2016, pp. 525–542.
  • [39] S. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inform. Theory,, vol. 28, no. 2, pp. 129–137, 1982.
  • [40] N. Asuni and A. Giachetti, “TESTIMAGES: A large data archive for display and algorithm testing,” Journal of Graphics Tools, vol. 17, no. 4, pp. 113–125, 2013.
  • [41] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Int. Conf. Learning Representations, 2015.
  • [42] F. Bellard, “BPG image format,” 2019. [Online]. Available: https://bellard.org/bpg/
  • [43] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Trans. Image Process., vol. 26, no. 7, pp. 3142–3155, 2017.
  • [44] J. Portilla, V. Strela, M. J. Wainwright, and E. P. Simoncelli, “Image denoising using scale mixtures of Gaussians in the wavelet domain,” IEEE Trans. Image Process., vol. 12, no. 11, 2003.
  • [45] Z. Wang and A. Bovik, Modern Image Quality Assessment, 1st ed.   Morgan & Claypool Publishers, 2006.
  • [46] K. Ma, W. Liu, K. Zhang, Z. Duanmu, Z. Wang, and W. Zuo, “End-to-end blind image quality assessment using deep neural networks,” IEEE Trans. Image Process., vol. 27, no. 3, pp. 1202–1213, 2018.
  • [47]

    T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, “Neural ordinary differential equations,” in

    Neural Inf. Process. Syst, 2018, pp. 6571–6583.