SCAE
None
view repo
It has long been understood that precisely estimating the probabilistic structure of natural visual images is crucial for image compression. Despite the remarkable success of recent end-to-end optimized image compression, the latent code representation is assumed to be fully statistically factorized such that the entropy modeling is feasible. Here we describe context-based convolutional networks (CCNs) that exploit statistical redundancies in the codes for improved entropy modeling. We introduce a 3D zigzag coding order together with a 3D code dividing technique to define proper context and to achieve parallel entropy decoding, both of which boil down to place translation-invariant binary masks on convolution filters of CCNs. We demonstrate the power of CCNs for entropy modeling in both lossless and lossy image compression. For the former, we directly apply a CCN to binarized image planes for estimating the Bernoulli distribution of each code. For the latter, the categorical distribution of each code is represented by a discretized mixture of Gaussian distributions, whose parameters are estimated by three CCNs. We jointly optimize the CCN-based entropy model with analysis and synthesis transforms for rate-distortion performance. Experiments on two image datasets show that the proposed lossless and lossy image compression methods based on CCNs generally exhibit better compression performance than existing methods with manageable computational complexity.
READ FULL TEXT VIEW PDFNone
Data compression has played a significant role in engineering for centuries [1]
. Compression can be either lossless or lossy. Lossless compression allows perfect data reconstruction from compressed bitstreams with the goal of assigning shorter codewords to more “probable” codes. Typical examples include Huffman coding
[2], arithmetic coding [3], and range coding [4]. Lossy compression discards “unimportant” information, and the definition of importance is application-dependent. For example, if the data (such as images and videos) are meant to be consumed by the human visual system, importance should be measured in accordance with human perception, discarding data that are perceptually redundant, while keeping those that are most visually noticeable. In lossy compression, one must face the rate-distortion trade-off, where the rate is computed by the entropy of the discrete codes [5] and the distortion is measured by a signal fidelity metric. A prevailing scheme in the context of lossy image compression is transform coding, which consists of three operations - transformation, quantization, and entropy coding. Transforms are designed to be better-suited for exploiting aspects of human perception, and map an image to a latent code space with several statistical advantages. Early transforms [6, 7]are linear, invertible, and fixed for all bit rates; errors arise only from quantization. Recent transforms take the form of deep neural networks (DNNs)
[8], aiming for more comprehensible and nonlinear representations. DNN-based transforms are mostly non-invertible, which, however, may encourage discarding perceptually unimportant image features during transformation. This gives us an opportunity to learn different transforms at different bit rates for better rate-distortion performance. Entropy coding is responsible for losslessly compressing the quantized codes into bitstreams for storage and transmission.In either lossless or lossy image compression, a discrete probability distribution of the latent codes shared by the encoder and the decoder (
i.e., the entropy model) is essential in determining the compression performance. According to Shannon’s source coding theorem [5], given a vector of code intensities
, the optimal code length of should be , where binary symbols are assumed to construct the codebook. Without further constraints, the general problem of estimatingin high-dimensional spaces is intractable, a problem commonly known as the curse of dimensionality. For this reason, most entropy coding schemes assume
is fully statistically factorized with the same marginal distribution, leading to a code length of. Alternatively, the chain rule in probability theory offers a more accurate approximation
(1) |
where represents the partial context of coded before it in . A representative example is the context-based adaptive binary arithmetic coding (CABAC) [9] in H.264/AVC, which considers the two nearest codes as the context and obtains noticeable improvements over previous image/video compression standards. As the size of becomes large, it is difficult to estimate this conditional probability by constructing histograms. Recent methods such as PixelRNN [10] and PixelCNN [11] take advantage of DNNs in modeling long range relations among pixels, making the estimation of with larger partial contexts feasible. Nevertheless, these methods are computationally intensive, taking hours for medium-resolution images.
We describe context-based convolutional networks (CCNs) for parallel entropy modeling. Given , we specify a 3D zigzag coding order so that the most relevant codes of can be included in its context. Parallel computation during entropy encoding is straightforward as the context of each code is known and readily available. However, this is not always the case during entropy decoding because the context of is not ready for probability estimation until all codes in have been decoded sequentially, which is prohibitively slow. To address this issue, we introduce a 3D code dividing technique, which partitions into multiple groups in compliance with the proposed coding order. The codes within each group are assumed to be conditionally independent given their respective contexts, and therefore can be decoded in parallel. In the context of CCNs, this amounts to applying properly designed translation-invariant masks to convolutional filters.
To validate the proposed CCNs, we combine them with arithmetic coding [3] for entropy coding. For lossless image compression, we convert the input grayscale image into eight binary planes and train a CCN to predict the Bernoulli distribution of , optimizing for the entropy loss in information theory [12]. For lossy image compression, we parameterize the the categorical distribution of with a discretized mixture of Gaussian (MoG) distributions, whose parameters (i.e
., mixture weights, means, and variances) are estimated by three CCNs, depending on its context. The CCN-based entropy model is jointly optimized with analysis and synthesis transforms (
i.e., mappings between raw pixel space and latent code space) over a database of training images for rate-distortion performance. Across two independent sets of images, we find that the optimized methods for lossless and lossy image compression perform favorably against image compression standards and DNN-based methods, especially at low bit rates.In this section, we provide a detailed overview of DNN-based entropy models and lossy image compression methods. For traditional image compression techniques with separately optimized components, we refer the interested readers to [13, 14, 15].
The accuracy of entropy coding largely affects rate-distortion performance of image compression systems. In this process, the first and the most important step is to estimate the probability . In most image compression techniques, is assumed to be fully statistically factorized, whose entropy can be easily computed through the marginal distributions [8, 16, 17, 18]. Arguably speaking, natural images that undergo a highly nonlinear analysis transform still exhibit strong statistical redundancies [19], which suggests that incorporating context into probability estimation has great potentials in improving entropy modeling performance.
Considerable progress has been made in DNN-based context modeling for natural languages and images. In natural language processing, recurrent neural networks (RNN)
[20], and long short-term memories
[21] are two popular tools to model long-range dependencies. In image processing, PixelRNN [10] and PixelCNN [11] are among the first attempts to exploit long-range pixel dependencies for image generation. However, the above-mentioned methods are computational inefficient, which require one forward propagation to generate a single pixel or to estimate its probability. To speed up PixelCNN, Salimans et al. [22] proposed PixelCNN++, which is able to sample a larger intermediate image conditioned on an initial image in a single forward pass. This process may be iterated to generate the final result. When viewing PixelCNN++ and its variants as entropy models, we must losslessly compress the initial image into bitstreams as side information, and send them to the decoder for entropy decoding.Only recently has DNNs for context-based entropy modeling become an active research topic. Ballé et al. [19] introduced a scale prior, which stores a variance parameter for each as side information. Richer side information generally leads to more accurate entropy modeling. However, as discussed before, this type of information should be quantized, compressed and considered part of the codes, and it is difficult to trade off the bits saved by the improved entropy model and the bits introduced by storing this information. Li et al. [23] extracted a small code block for each as its context, and adopted a simple DNN for entropy modelling. The method suffers from heavy computational complexity similar to PixelRNN [10]. Li et al. [24] and Mentzer et al. [25] achieved parallel entropy encoding with mask DNNs. However, sequential entropy decoding has to be performed due to the context dependence of the codes, which is painfully slow. In contrast, our CCN-based entropy model equipped with a 3D zigzag coding order and a 3D code dividing technique permits parallel entropy coding, and is a more practical choice for industrial usage.
A major problem in end-to-end lossy image compression is that the gradients of the quantization function are zeros almost everywhere, making gradient descent-based optimization ineffective. Based on the strategies of alleviating the zero-gradient problem of quantization, DNN-based lossy image compression methods can be divided into different categories.
From a signal processing perspective, the quantizer can be approximated with an additive i.i.d. uniform noise, which has the same width as the quantization bin [26]. A desired property of this approximation is that the resulting density function is a continuous relaxation of the probability mass function of [8]. Another line of research introduced more continuous functions (without the zero-gradient problem) to approximate the quantization function. The step quantizer is used in the forward pass, while its continuous proxy is used in the backward pass. Toderici et al. [27] learned an RNN to compress small-size images to binary maps in a progressive manner. They later tested their models on large-size images [28]. Johnston et al. [29] exploited adaptive bit allocations and perceptual losses to boost the compression performance especially in terms of MS-SSIM [30]. These methods are not explicitly optimized for rate-distortion performance, but treat entropy coding of the binary codes as a post-processing step. Ballé et al. [8] explicitly formulated DNN-based image compression under the framework of rate-distortion optimization. Assuming is statistically factorized, they learned piece-wise linear density functions to compute differential entropy as approximation to discrete entropy. In a subsequent work [19], each
is assumed to zero-mean Gaussian with its own standard deviation, which is separately predicted with side information. Minnen
et al. [31] combined the autoregressive and hierarchical priors, which lead to improved rate-distortion performance. Theis et al. [16] introduced a continuous upper bound of the discrete entropy with a Gaussian scale mixture. Rippel et al. [18] described pyramid-based analysis and synthesis transforms with adaptive code length regularization for real-time image compression. An adversarial loss [32] is incorporated to generate visually realistic results at low bit rates [18]. Agustsson et al. [17] introduced a soft-to-hard relaxation scheme by approximating quantization with a parametric softmax function.In this section, we present in detail the construction of CCNs for entropy modeling. We start with a fully convolutional network, consisting of
layers of convolutions followed by point-wise nonlinear activation functions. In order to perform efficient context-based entropy modeling, three assumptions are made on the network architecture:
For a code block , where , , and denote channel, height, and width, respectively, the output of the th layer convolution , where denotes the number of feature blocks to represent . By doing so, we are able to associate the feature point in th channel and th feature block at spatial location with uniquely.
Let be the set of codes encoded before (full context), and be the set of codes in the receptive field of that contributes to its computation (support set), respectively. Then, .
For and , .
Assumption I establishes a one-to-many correspondence between the input code block and the output feature representation . Assumption II ensures that the computation of depends only on a subset of . Together, the two assumptions guarantee the legitimacy of context-based entropy modeling in fully convolutional networks, which can be achieved by placing binary masks to convolution filters. Then, can be computed from , which specify the parameters of the conditional distribution. Assumption III allows the masks to be translation-invariant, which can be achieved by properly designed coding orders. Specifically, we start with the case of a 2D code block, where , and write mask convolution at the th layer as
(2) |
where and denote 2D convolution and Hadamard product, respectively. is a 2D convolution filter and is the corresponding 2D binary mask. According to Assumption I, the input and the output are of the same size as . The input code block corresponds to .
For the input layer of a fully convolutional network, the codes to produce is , where is the set of local indexes centered at . We choose
(3) |
which can be achieved by setting
(4) |
Fig. 1 illustrates the concepts of full context , support set , and translation-invariant mask , respectively. At the th layer, if a code , we have
(5) |
where the first line follows by induction and the second line follows from the definition of context. That is, as long as is in the context of , we are able to compute from without violating Assumption II. Unlike the input layer, with is also generated from , and can be used to compute . Therefore, we update the mask at the th layer
(6) |
With the proposed CCN with translation-invariant masks in Eq. (4) and Eq. (6), can be encoded in parallel. However, this is not the case in the entropy decoding phase. As shown in Fig. 2 (a) and (b), the two nearby codes in the same row (highlighted in orange and blue, respectively) could not be decoded simultaneously because the orange code is in the support set (or context) of the blue code given the raster coding order. To speed up entropy decoding with parallel computation, we may further remove dependencies among codes at the risk of model accuracy. Specifically, we partition into groups, namely, , and remove dependencies among the codes within the same group, resulting a partial context for . As a result, all codes in the th group share the same partial context, and can be decoded in parallel. Note that code dividing schemes are largely constrained by the pre-specified coding order. For example, if we use a raster coding order, it is straightforward to divide by row. In this case, ( and index vertical and horizontal directions, respectively), which is extremely important in predicting the probability of according to CABAC [9], has been excluded from its partial context. To make a good trade-off between modeling efficiency and accuracy, we adopt a zigzag coding order as shown in Fig 2 (d), where and . As such, we retain the most relevant codes in the partial context for better entropy modeling (see Fig. 9 for quantitative results). Accordingly, the mask at the th layer becomes
(7) |
Now, we extend our discussion to a 3D code block, where . Fig. 3 (a) shows the proposed 3D zigzag coding order and 3D code dividing technique (zoom in for improved visibility). Specifically, is divided into groups in the shape of slants, where the th one is specified by . The partial context of is defined as . We then write mask convolution in the 3D case as
(8) |
where and are indexes for the feature block and channel dimensions, respectively. For the 2D case, each layer shares the same mask (). When extending to 3D code blocks, each channel in a layer shares a mask, and there are a total of 3D masks. For the input layer, the codes to produce is , based on which we define the mask
(9) |
For the th layer, we modify the mask to include the current slant
(10) |
as shown in Fig. 4, where we highlight the difference in the green slant.
In this section, we combine our CCN-based entropy model with the arithmetic coding algorithm for lossless image compression.
As a starting point, we binarize the grayscale image to obtain a 3D code block
(11) |
where we index the most significant bit-plane with . Our CCN takes as input and produces a feature block (the superscript is omitted for notation convenience) of the same size to compute the mean estimates of Bernoulli distributions . Fig. 5
shows the network architecture, which consists of eleven mask convolution layers with parametric ReLU nonlinearities in between. The last convolution responses undergo a sigmoid nonlinearity to constrain the dynamic range witin
. We make four residual connections as suggested in
[33] to accelerate training. We will experiment with two hyper-parameters in CCN: the size of convolution filters for all layers and the number of feature blocks in hidden layers .To optimize the network parameters, which are collectively denoted by , we apply the expected code length as the empirical loss
(12) |
where is an indicator function and the expectation may be approximated by averaging over a mini-batch of training images. Finally, we implement our own arithmetic coding with the learned CCN-based entropy model to compress to bitstreams, and report performance using actual bit rates. This facilitates comparison against widely used image compression standards.
In lossy image compression, our objective is to minimize a weighted sum of rate and distortion, , where governs the trade-off between the two terms. As illustrated in Fig. 6, our compression method consists of four components: an analysis transform , a quantizer , a CCN-based entropy model, and a synthesis transform . The analysis transform takes a color image as input and produces the latent code representation . consists of three convolutions, each of which is followed by down-sampling with a factor of two. A dense block [34] comprised of seven convolutions is employed after each down-sampling. After the last dense block, we add another convolution layer with filters to produce . Empirically, the parameter sets the upper bound of the bit rate that a general DNN-based compression method can achieve. The parameters of constitute the parameter vector to be optimized.
The synthesis transform is a mirror of the analysis transform. Particularly, the depth-to-space reshaping [28, 35] is adopted to up-sample the feature maps. The last convolution layer with three filters is responsible for producing the compressed image in RGB space. The parameters of constitute the parameter vector to be optimized.
For the quantizer , we parameterize its quantization centers for the th channel by , where is the number of quantization centers and . The monotonicity of can be enforced by a simple re-parameterization based on cumulative functions. Given a fixed set of , we perform quantization by mapping to its nearest center that minimizes the quantization error
(13) |
has zero gradients almost everywhere, which hinders training via back-propagation. Taking inspirations from binarized neural networks [36, 37, 38], we make use of an identify mapping as a more continuous proxy to the step quantization function. During training, we use and in the forward and backward passes, respectively.
The quantization centers should be optimized by minimizing the mean squared error (MSE),
(14) |
which is essentially a -means clustering problem, and can be solved efficiently by the Lloyd’s algorithm [39]. Specifically, we initialize using uniform quantization, which appears to work well in all experiments. To make parameter learning of the entire model smoother, we adjust
using stochastic gradient descent instead of a closed-form update.
Without prior knowledge of the categorical distributions of the quantized codes , we choose to work with discretized MoG distributions, whose parameters are predicted by the proposed CCNs. We write the differentiable MoG distribution with components as
(15) |
where , and are the mixture weight, mean, and variance of the -th component, respectively. Then,
(16) |
where is the quantization bin that lies in.
Now we describe the proposed entropy model in lossy image compression, which is comprised of three CCNs with the same structure, as shown in Fig. 6. Each CCN consists of nine mask convolutions with three residual connections to produce feature blocks, matching the number of components in MoG. They separately output mixture weights, means and variances to build the discretized MoG distributions. The network parameters of our CCNs constitute the parameter vector to be optimized.
Now we are able to write the empirical rate-distortion objective for the parameters as
(17) |
is the distortion term, which is more preferable to be assessed in a perceptual space. In this paper, we optimize and evaluate our lossy image compression methods using standard MSE and a perceptual metric MS-SSIM [30]. Similar in lossless image compression, we combine the optimized entropy model with arithmetic coding, and measure the rate using actual bit rates.
In this section, we test the proposed CCN-based entropy models in lossless and lossy image by comparing to state-of-the-art image coding standards and recent deep image compression algorithms. We first collect high-quality and high-definition images from Flickr, and down-sample them to further reduce possibly visible artifacts. We crop grayscale patches of size and color patches of size as the training sets for lossless and lossy image compression, respectively. We test our models on two independent datasets - Kodak and Tecnick [40]
, which are widely used to benchmark compression performance. The Caffe implementations along with the pre-trained models are made available at
https://github.com/limuhit/SCAE.We train our CCN-basd entropy model using the Adam stochastic optimization package [41] by minimizing the objective in Eq. (12). We start with a learning rate of , and subsequently lower it by a factor of when the loss plateaus, until . The (actual) bit rate in terms of bits per pixel (bpp) is used to quantify the compression performance, which is defined as the ratio between the total amount of bits used to code the image and the number of pixels in the image. A smaller bpp indicates better performance. For example, an uncompressed grayscale image has eight bpp.
We first compare the proposed CCNs with mask convolutional networks (MCNs) [25, 24], PixelCNN++ [22], and side information networks (SINs) [19] for entropy modeling. As a special case of CCNs, MCNs specify the raster coding order without using any code dividing technique (see Fig. 2). We implement our own version of MCN that inherits the network architecture from the CCN with (number of feature blocks) and (filter support). PixelCNN++ [22]
is originally designed for image inpainting as a generative model. Here we adapt it for entropy modeling. Starting from a down-sampled grayscale image with a factor of four, we use a DNN of similar model complexity compared with our CCN to predict the categorical distributions from the down-sampled image. SINs summarize the side information of
with a separate DNN, which is helpful in probability estimation. We adopt a DNN-based autoencoder of similar model complexity as our CCN (including three stride convolutions and two residual connections) to generate the side information, which is further quantized and compressed with arithmetic coding for performance evaluation. We test five variants of SINs with different amount of side information by changing the number of output channels
. All competing models are trained on the same dataset described at the beginning of Section VI. We also introduce a fast version of our method, which we name CCN, by making the network architecture lighter (with and ) and by dividing the input image into non-overlapping patches for parallel processing.SIN | MCN | PixelCNN++ | CCN | ||
---|---|---|---|---|---|
Encoding | 0.155 | 0.323 | 0.121 | 0.323 | 0.074 |
Decoding | 0.155 | 3079.68 | 0.121 | 35.28 | 0.984 |
Fig. 7 shows the bit rates of the competing methods on the Kodak dataset. The proposed CCN matches the best performing model MCN, which suggests that with the proposed zigzag coding order and code dividing technique, CCN does not exclude the most important codes from the partial context of the current code being processed. The bit rates of SINs come from two parts - the image itself and the side information. It is clear from the figure that increasing the amount of side information leads to bit savings of the image, at the cost of additional bits introduced to code the side information. In general, it is difficult to determine the amount of side information for optimal compression performance. PixelCNN++ can also be regarded as a special case of SINs, whose side information is a small image without further processing, leading to the worst performance.
We also compare CCN with the widely used lossless image compression standards, including TIFF, GIF, PNG, JPEG-LS, JPEG2000-LS, and BPG. All test results are generated by MATLAB2017. From Fig. 8, we find that CCN (along with its fast version CCN) overwhelms all competing methods on the Kodak/Tecnick dataset, achieving more than bit savings compared to the best lossless image compression standard, JPEG2000-LS.
The running time of the four types of DNN-based entropy models is tested on an NVIDIA TITAN Xp machine, whose results on the Kodak dataset are listed in Table I. For encoding, CCN enjoys the fastest speed, followed by PixelCNN++ and SIN (the best performing variant). Despite similar encoding time, they have substantially different decoding complexities. PixelCNN++ has the fastest decoder, followed by SIN. Due to the sequential decoding nature, MCN is the slowest, taking nearly one hour to decode a grayscale image of size . Our CCN achieves a significant improvement upon MCN with the proposed code dividing technique, while maintaining nearly the same bit rates. Moreover, CCN speeds up CCN more than times, striking a good balance model between efficiency and accuracy.
We conduct thorough ablation experiments to analyze the impact of individual components to final compression performance. Fig. 9 shows the bit rates of CCNs with three different numbers of feature blocks () and two filter sizes (). When the filter size and the network depth are fixed, adding more feature blocks effectively increases the model capability and thus boosts the compression performance. Similarly, using a larger filter size with fixed network depth and feature block number increases the partial context, leading to better entropy modeling. Moreover, we replace the proposed zigzag coding order in CCN with the raster coding order, whose model is denoted by CCN. From Fig. 9, we observe that the performance of CCN drops significantly, only comparable to the CCN with four feature blocks, which verifies the advantages of the proposed coding order.
Average bpp | 0.100 | 0.209 | 0.362 | 0.512 | 0.671 | 0.794 |
---|---|---|---|---|---|---|
Encoding | 0.013 | 0.025 | 0.044 | 0.066 | 0.085 | 0.103 |
Decoding | 0.116 | 0.227 | 0.457 | 0.735 | 1.150 | 1.232 |
In lossy image compression, the analysis transform , the non-uniform quantizer , the CCN-based entropy model, and the synthesis transform are jointly optimized for rate-distortion performance. In early stages of training, the probability may change rapidly, which makes it difficult to keep track of, causes instability in learning the entropy model. We find that this issue can be alleviated by a simple warmup strategy. Specifically, and are trained using the distortion term
only for the first epoch. We then fix
and train the CCN-based entropy model until it reasonably fits the current distribution of the codes. After that, we end-to-end optimize the entire method for the rest epochs. We use Adam with a learning rate of and gradually lower it by a factor of , until . The number of quantization centers is and the number of Gaussian components in MoG is . Fig. 10 shows the learned continuous distributions of some codes, which are typically complex and multimodal. This optimization is performed separately for each and for each distortion measure. We optimize twelve models for six bit rates and two distortion metrics (MSE and MS-SSIM). MSE is converted to peak signal-to-noise ratio (PSNR) for quantitative analysis.We compare our methods with existing image coding standards and recent DNN-based compression models. These include JPEG [14], JPEG2000 [15], BPG [42], Agustsson17 [17], Theis17 [16], Toderici17 [28], Rippel17 [18], Mentzer18 [25], Johnston17 [29], Ballé17 [8], and Li18 [23]. Both JPEG (with 4:2:0 chroma subsampling) and JPEG2000 are based on the optimized implementations in MATLAB2017. For BPG, we adopt the latest version from its official website with the default setting. When it comes to DNN-based models for lossy image compression, the implementations are generally not available. Therefore, we report the results from their respective papers.
Fig. 11 shows the rate-distortion curves on the Kodak dataset. We find that our method optimized for MSE outperforms all competing methods at low bit rates ( bpp), except for BPG. When optimized for MS-SSIM, our method performs on par with Rippel17 and is much better than the rest. Fig. 12 shows the rate-distortion curves on the Tecnick dataset, where we observe similar trends for both PSNR and MS-SSIM. An interesting observation is that when we continue increasing the bit rate, PSNR/MS-SSIM starts to plateau, which may be due to the limited model capability. Without any constraint on rate () and quantization (), our method optimized for MSE only reaches dB on the Kodak dataset, which may be considered as an empirical upper-bound for our network structure. Preliminary results indicate that increasing the depth and width of the network leads to performance improvements at high bit rates.
We visually compare the compressed images by our method against Ballé17, Li18, JPEG2K, and BPG. Fig. 13 and Fig. 14
show sample compressed results on the Kodak and Tecnick datasets, respectively. JPEG2K and BPG exhibit artifacts (such as blocking, ringing, blurring, and aliasing) that are common to all linear transform coding methods, reflecting the underlying linear basis functions. Ballé17 is effective at suppressing ringing artifacts at the cost of over-smoothing fine structures. Li18 allocates more bits to preserve large-scale strong edges, while tends to eliminate small-scale localized features (
e.g., edges, contours, and textures). In contrast, our method generates compressed images with more faithful details and less visible distortions. In addition, we investigate the effectiveness of joint optimization of the transforms and the CCN-based entropy model. A baseline method is introdced, which first optimizes the transforms for MSE (), and trains the CCN-based entropy model with learned (and fixed) transforms. As shown in Fig. 11 (a), the baseline method underperforms the jointly optimized one by 1dB.We report the running time of our method at six bit rates on the Kodak dataset using the same machine. It takes second to generate and second to reconstruct the image. The entropy coding time is listed in Table II, which we see that more time is needed to encode and decode images at higher bit rates. With the help of the proposed code dividing technique, our method performs entropy decoding in around one second for images of size .
We have introduced CCNs for context-based entropy modeling. Parallel entropy encoding and decoding are achieved with the proposed coding order and code dividing technique, which can be efficiently implemented using mask convolutions. We test the CCN-based entropy model (combined with arithmetic coding) in both lossless and lossy image compression. For the lossless case, our method achieves the best compression performance, which we believes arises from the more accurate estimation of the Bernoulli distributions of the binary codes. For the lossy case, our method offers improvements both visually and in terms of rate-distortion performance over image compression standards and recent DNN-based models.
The application scope of the proposed CCN is far beyond building the entropy model in image compression. As a general probability model, CCN appears promising for a number of image processing applications. For example, we may use CCN to learn a probability model
for natural images, and use it as a prior in Bayesian inference to solve various vision tasks such as image restoration
[43, 44], image quality assessment [45, 46], and image generation[10, 47].The authors would like to thank the NVIDIA Corporation for donating a TITAN Xp GPU used in this research.
Int. Conf. Machine Learning
, 2016, pp. 1747–1756.J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” in
Int. Conf. Learning Representations, 2018.W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in
IEEE Conf. Comput. Vis. Pattern Recog., 2016.M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” in
Eur. Conf. Comput. Vis., 2016, pp. 525–542.T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, “Neural ordinary differential equations,” in
Neural Inf. Process. Syst, 2018, pp. 6571–6583.