Most recently proposed machine learning based image compression algorithms[5, 20, 17]
leverage the autoencoder structure, which transforms raw pixels into compressible latent features via stacked convolutional neural networks (CNNs). These latent features are entropy coded subsequently by exploiting the statistical redundancy. Recent prior works have revealed that compression efficiency can be improved when exploring the conditional probabilities via the contexts of spatial neighbors and hyperpriors[17, 11, 5]. Typically, rate-distortion optimization  is fulfilled by minimizing Lagrangian cost = + , when performing the end-to-end training. Here, is referred to as entropy rate, and is the distortion measured by either mean squared error (MSE) or multiscale structural similarity (MS-SSIM) .
However, existing methods still present several limitations. For example, most of the operations, such as stacked convolutions, are performed locally with limited receptive field, even with pyramidal decomposition. Furthermore, latent features are treated with equal importance in either spatial or channel dimension in most works, without considering the diverse visual sensitivities to various contents (such as texture and edge). Thus, attempts have been made in [11, 17]
to exploit importance maps on top of latent feature vectors for adaptive bit allocation. But these methods require the extra explicit signaling overhead to carry the importance maps.
In this paper, we introduce non-local operation blocks proposed in  into the variational autoencoder (VAE) structure to capture both local and global correlations among pixels, and generate the attention masks which help to yield more compact distributions of latent features and hyperpriors.Different from those existing methods in [11, 17], we use non-local processing to generate attention masks at different layers (not only for quantized features), to allocate the bits intelligently through the end-to-end training. We also improve the context modeling of the entropy engine for better latent feature compression, by using a masked 3D CNN (i.e., 555) on the latent features to generate the conditional statistics of the latent features.
Two different model implementations are provided, one is the “NLAIC joint” , which uses both hyperpriors and spatial-channel neighbors in the latent features for context modeling, and the other is the “NLAIC baseline” with contexts only from hyperpriors. Our joint model outperforms all existing learned and traditional image compression methods, in terms of the rate distortion efficiency for the distortion measured by both MS-SSIM and PSNR.
To further verify the efficiency of our framework, we also conduct ablation studies to discuss model variants such as removing non-local operations and attention mechanisms layer by layer, as well as the visual comparison. These additional experiments provide further evidence of the superior performance of our proposed NLAIC framework over a broad dataset.
The main contributions of this paper are highlighted as follows:
We are the first to introduce non-local operations into compression framework to capture both local and global correlations among the pixels in the original image and feature maps.
We apply attention mechanism together with aforementioned non-local operations to generate implicit importance masks to guide the adaptive processing of latent features. These masks essentially allocate more bits to more important features that are critical for reducing the image distortion.
We employ a one-layer masked 3D CNN to exploit the spatial and cross channel correlations in the latent features, the output of which is then concatenated with hyperpriors to estimate the conditional statistics of the latent features, enabling more efficient entropy coding.
|Main Encoder||Main Decoder||Hyperprior Encoder||Hyperprior Decoder||Conditional Context Model|
|Conv: 55192 s2||NLAM||ResBlock(3): 33192||NLAM||Masked: 55524 s1|
|ResBlock(3): 33192||Deconv: 55192 s2||Conv: 55192 s2||Deconv: 55192 s2||Conv: 11148 s1|
|Conv: 55192 s2||ResBlock(3): 33192||ResBlock(3): 33192||ResBlock(3): 33192||ReLU|
|NLAM||Deconv: 55192 s2||Conv: 55192 s2||Deconv: 55192 s2||Conv: 11196 s1|
|Conv: 55192 s2||NLAM||NLAM||ResBlock(3): 33192||ReLU|
|ResBlock(3): 33192||Deconv: 55192 s2||Conv: 55384 s1||Conv: 1112 s1|
|Conv: 55192 s2||ResBlock(3): 33192|
|NLAM||Conv: 553 s2|
: “Conv” denotes a convolution layer with kernel size and number of output channels. “s” is the stride (e.g.,s2 means a down/up-sampling with stride 2). NLAM represents the non-local attention modules. “3” means cascading 3 residual blocks (ResBlock).
2 Related Work
Non-local Operations. Most traditional filters (such as Gaussian and mean) process the data locally, by using a weighted average of spatially neighboring pixels. It usually produces over-smoothed reconstructions. Classical non-local methods for image restoration problems (e.g., low-rank modeling , joint sparsity  and non-local means ) have shown their superior efficiency for quality improvement by exploiting non-local correlations. Recently, non-local operations haven been included into the deep neural networks (DNN) for video classification 
, image restoration (e.g., denoising, artifacts removal and super-resolution)[13, 27], etc, with significant performance improvement reported. It is also worth to point out that non-local operations have been applied in other scenarios, such as intra block copy in screen content extension of the High-Efficiency Video Coding (HEVC) .
Self Attention.15, 8, 23]. It can be described as a mapping strategy which queries a set of key-value pairs to an output. For example, Vaswani et. al  have proposed multi-headed attention methods which are extensively used for machine translation. For those low-level vision tasks [27, 11, 17], self-attention mechanism makes generated features with spatial adaptive activation and enables adaptive information allocation with the emphasis on more challenging areas (i.e., rich textures, saliency, etc).
In image compression, quantized attention masks are commonly used for adaptive bit allocation, e.g., Li et. al  uses 3 layers of local convolutions and Mentzer et. al  selects one of the quantized features. Unfortunately, these methods require the extra explicit signaling overhead. Our model adopts attention mechanism that is close to [11, 17] but applies multiple layers of non-local as well as convolutional operations to automatically generate attention masks from the input image. The attention masks are applied to the temporary latent features directly to generate the final latent features to be coded. Thus, there is no need to use extra bits to code the masks.
Image Compression Architectures. DNN based image compression generally relies on well-known autoencoders. Its back propagation scheme requires all the steps differentiable in an end-to-end manner. Several methods (e.g., adding uniform noise , replacing the direct derivative with the derivative of the expectation  and soft-to-hard quantization ) are developed to approximate the non-differentiable quantization process. On the other hand, entropy rate modeling of quantized latent features is another critical issue for learned image compression. PixelCNNs  and VAE are commonly used for entropy estimation following the Bayesian generative rules. Recently, conditional probability estimates based on autoregressive neighbors of the latent feature maps and hyperpriors jointly has shown significant improvement in entropy coding.
3 Non-Local Attention Implementation
3.1 General Framework
Fig. 1 illustrates our NLAIC framework. It is built on a variational autoencoder structure , with non-local attention modules (NLAM) as basic units in both main and hyperprior encoder-decoder pairs (i.e., , , and ). with quantization are used to generate latent quantized features and decodes the features into the reconstructed image. and generate much smaller side information as hyperpriors. The hyperpriors as well as autoregressive neighbors of the latent features are then processed through the conditional context model to generate the conditional probability estimates for entropy coding of the latent quantized features.
3.2 Non-local Module
Our NLAM adopts the non-local network proposed in  as a basic block, as shown in Fig. 2. As shown in Fig. 2, the non-local module (NLM) computes the output at pixel , , using a weighted average of the transformed feature values at pixel , , as below:
where is the location index of output vector and represents the index that enumerates all accessible positions of input . and share the same size. The function computes the correlations between and , and computes the representation of the input at the position . is a normalizing factor to generate the final response which is set as . Note that a variety of function forms of have been already discussed in . Thus in this work, we directly use the embedded Gaussian function for , i.e.,
Here, and , where and denote the cross-channel transform using 11 convolution in our framework. The weights are further modified by a softmax operation. The operation defined in Eq. (1) can be written in matrix form  as:
In addition, residual connection can be applied for better convergence as suggested in, as shown in Fig. 2, i.e.,
where is also a linear 11 convolution across all channels, and is the final output vector.
3.3 Non-local Attention Module
Importance map has been adopted in [11, 17] to adaptively allocate information to quantized latent features. For instance, we can give more bits to textured area but less bits to elsewhere, resulting in better visual quality at the similar bit rate. Such adaptive allocation can be implemented by using an explicit mask, which must be specified with additional bits. As aforementioned, existing mask generation methods in [11, 17] are too simple to handle areas with more complex content characteristics.
Inspired by , we propose to use a cascade of a non-local module and regular convolutional layers to generate the attention masks, as shown in Fig. 2. The NLAM consists of two branches. The main branch uses conventional stacked networks to generate features and the mask branch applies the NLM with three residual blocks , one 11 convolution and sigmoid activation to produce a joint spatial-channel attention mask , i.e.,
where denotes the attention mask and is the input features. represents the operations of using NLM with subsequent three residual blocks and 11 convolution which are shown in Fig. 2. This attention mask , having its element , is element-wise multiplied with feature maps from the main branch to perform adaptive processing. Finally a residual connection is added for faster convergence.
We avoid any batch normalization (BN) layers and only use one ReLU in our residual blocks, justified through our experimental observations.
Note that in existing learned image compression methods, particularly for those with superior performance [4, 5, 18, 14], GDN activation has proven its better efficiency compared with ReLU, tanh, sigmoid, leakyReLU, etc. This may be due to the fact that GDN captures the global information across all feature channels at the same pixel location. However, we just use the simple ReLU function, and rely on our proposed NLAM to capture both the local and global correlations. We also find through experiments that inserting two pairs of two layers of NLAM for the main encoder-decoder, and one layer of NLAM in the hyperprior encoder-decoder, provides the best performance. As will be shown in subsequent Section 4, our NLAIC has demonstrated the state-of-the-art coding efficiency.
3.4 Entropy Rate Modeling
Previous sections present our novel NLAM scheme to transform the input pixels into more compact latent features. This section details the entropy rate modeling part that is critical for the overall rate-distortion efficiency.
3.4.1 Context Modeling Using Hyperpriors
Similar as , a non-parametric, fully factorized density model is used for hyperpriors , which is described as:
where represents the parameters of each univariate distribution .
For quantized latent features , each element
can be modeled as a conditional Gaussian distribution as:
where its and are predicted using the distribution of . We evaluate the bits of and using:
Usually, we take as side information for estimating and and only occupies a very small fraction of bits, shown in Fig. 3.
3.4.2 Context Modeling Using Neighbors
PixelCNNs and PixelRNNs  have been proposed for effective modeling of probabilistic distribution of images using local neighbors in an autoregressive way. It is further extended for adaptive context modeling in compression framework with noticeable improvement. For example, Minnen et al.  have proposed to extract autoregressive information by a 2D 55 masked convolution, which is combined with hyperpriors using stacked 11 convolution, for probability estimation. It is the first deep-learning based method with better PSNR compared with the BPG444 at the same bit rate.
In our NLAIC, we use a one-layer 555 3D masked convolution to exploit the spatial and cross-channel correlation. For simplicity, a 333 example is shown in Fig. 5.
Traditional 2D PixelCNNs need to search for a well structured channel order to exploit the conditional probability efficiently. Instead, our proposed 3D masked convolutions implicitly exploit correlation among adjacent channels. Compared to 2D masked CNN used in , our 3D CNN approach significantly reduces the network parameters for the conditional context modeling. Leveraging the additional contexts from neighbors via an autoregressive fashion, we can obtain a better conditional Gaussian distribution to model the entropy as:
where denote the causal (and possibly reconstructed) pixels prior to current pixel .
We use COCO  and CLIC  datasets to train our NLAIC framework. We randomly crop images into 1921923 patches for subsequent learning. Rate-distortion optimization (RDO) is applied to do end-to-end training at various bit rate, i.e.,
is a distortion measurement between reconstructed image and the original image . Both negative MS-SSIM and MSE are used in our work as distortion loss for evaluation, which are marked as “MS-SSIM opt.” and “MSE opt.”, respectively. and represent the estimated bit rates of latent features and hyperpriors, respectively. Note that all components of our NLAIC are trained together. We set learning rates (LR) for , , , and at in the beginning. But for , its LR is clipped to
after 30 epochs. Batch size is set to 16 and the entire model is trained on 4-GPUs in parallel.
To understand the contribution of the context modeling using spatial-channel neighbors, we offer two different implementations: one is “NLAIC baseline” that only uses the hyperpriors to estimate the means and variances of the latent features (see Eq. (7)), while the other is “NLAIC joint” that uses both hyperpriors and previously coded pixels in the latent feature maps (see Eq. (10)). In this work, we first train the “NLAIC baseline” models. To train the “NLAIC joint” model, one way is fixing the main and hyperprior encoders and decoders in the baseline model, and updating only the conditional context model
. Compared with the “NLAIC baseline”, such transfer learning based “NLAIC joint” provides 3% bit rate reduction at the same distortion. Alternatively, we could use the baseline models as the start point, and refine all the modules in the “NLAIC joint” system. In this way, “NLAIC joint” offers more than 9% bit rate reduction over the “NLAIC baseline” at the same quality. Thus, we choose the latter one for better performance.
4.2 Performance Efficiency
We evaluate our NLAIC models by comparing the rate-distortion performance averaged on publicly available Kodak dataset. Fig. 4 shows the performance when distortion is measured by MS-SSIM and PSNR, respectively, that are widely used in image and video compression tasks. Here, PSNR represents the pixel-level distortion while MS-SSIM describes the structural similarity. MS-SSIM is reported to offer higher correlation with human perceptual inception, particularly at low bit rate . As we can see, our NLAIC provides the state-of-the-art performance with noticeable performance margin compared with the existing leading methods, such as Ballé2019  and Ballé2018 .
Specifically, as shown in Fig. 4 using MS-SSIM for both loss and final distortion measurement, “NLAIC baseline” outperforms the existing methods while the “NLAIC joint” presents even larger performance margin. For the case that uses MSE as loss and PSNR as distortion measurement, “NLAIC joint” still offers the best performance, as illustrated in Fig. 4. “NLAIC baseline” is slightly worse than the model in  that uses contexts from both hyperpriors and neighbors jointly as our “NLAIC joint”, but better than the work  that only uses the hyperpriors to do contexts modeling for a fair comparison. Fig. 6 compares the average BD-Rate reductions by various methods over the legacy JPEG encoder. Our “NLAIC joint” model shows 64.39% and 12.26% BD-Rate  reduction against JPEG420 and BPG444, respectively.
4.3 Ablation Studies
We further analyze our NLAIC in following aspects:
Impacts of NLAM:
To further discuss the efficiency of newly introduced NLAM, we remove the mask branch in the NLAM pairs gradually, and retrain our framework for performance evaluation. For this study, we use the baseline context modeling in all cases, and use the MSE as the loss function and PSNR as the final distortion measurement, shown in Fig.7. For illustrative understanding, we also provide two anchors, i.e., “Ballé2018”  and “NLAIC joint” respectively. However, to see the degradation caused by gradually removing the mask branch in NLAMs, one should compare with the NLAIC baseline curve.
Removing the mask branches of the first NLAM pair in the main encoder-decoders (referred to as “remove_first”) yields a PSNR drop of about 0.1dB compared to “NLAIC baseline” at the same bit rate. PSNR drop is further enlarged noticeably when removing all NLAM pairs’ mask branches in main encoder-decoders (a.k.a., “remove_main”). It gives the worst performance when further disabling the NLAM pair’s mask branches in hyperprior encoder-decoders, resulting in the traditional variational autoencoder without non-local characteristics explorations (i.e., “remove_all”).
Impacts of Joint Contexts Modeling: We further compare conditional context modeling efficiency of the model variants in Fig. 8. As we can see, with embedded NLAM and joint contexts modeling, our “NLAIC joint” could provide more powerful latent features, and more compact normalized feature prediction error, both contributing to its leading coding efficiency.
Hyperpriors Hyperpriors has noticeable contribution to the overall compression performance [18, 5]. Its percentage decreases as the overall bit rate increases, shown in Fig. 3. The percentage of for MSE loss optimized model is higher than the case using MS-SSIM loss optimization. Another interesting observation is that exhibits contradictive distributions of joint and baseline models, for respective MSE and MS-SSIM loss based schemes. More explorations is highly desired in this aspect to understand the bit allocation of hyperpriors in our future study.
4.4 Visual Comparison
We also evaluate our method on BSD500  dataset, which is widely used in image restoration problems. Fig. 9 shows the results of different image codecs at the similar bit rate. Our NLAIC provides the best subjective quality with relative smaller bit rate111In practice, some bit rate points cannot be reached for BPG and JPEG. Thus we choose the closest one to match our NLAIC bit rate..
Considering that MS-SSIM loss optimized results demonstrate much smaller PSNR at high bit rate in Fig. 4, we also show our model comparison optimized for respective PSNR and MS-SSIM loss at high bit rate scenario. We find it that MS-SSIM loss optimized results exhibit worse details compared with PSNR loss optimized models at high bit rate, as shown in Fig. 10. This may be due to the fact that pixel distortion becomes more significant at high bit rate, but structural similarity puts more weights at a fair low bit rate. It will be interesting to explore a better metric to cover the advantages of PSNR at high bit rate and MS-SSIM at low bit rate for an overall optimal efficiency.
In this paper, we proposed a non-local attention optimized deep image compression (NLAIC) method and achieve the state-of-the-art performance. Specifically, we have introduced the non-local operation to capture both local and global correlation for more compact latent feature representations. Together with the attention mechanism, we can enable the adaptive processing of latent features by allocating more bits to important area using the attention maps generated by non-local operations. Joint contexts from autoregressive spatial-channel neighbors and hyperpriors are leveraged to improve the entropy coding efficiency.
Our NLAIC outperforms the existing image compression methods, including well known BPG, JPEG2000, JPEG as well as the most recent learning based schemes [18, 5, 20], in terms of both MS-SSIM and PSNR evaluation at the same bit rate.
For future study, we can make our context model deeper to improve image compression performance. Parallelization and acceleration are important to deploy the model for actual usage in practice, particularly for mobile platforms. In addition, it is also meaningful to extend our framework for end-to-end video compression framework with more priors acquired from spatial and temporal information.
-  The berkeley segmentation dataset and benchmark.
-  Challenge on learned image compression 2018.
-  E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool. Soft-to-hard vector quantization for end-to-end learning compressible representations. In Advances in Neural Information Processing Systems, pages 1141–1151, 2017.
-  J. Ballé, V. Laparra, and E. P. Simoncelli. End-to-end optimized image compression. arXiv preprint arXiv:1611.01704, 2016.
-  J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston. Variational image compression with a scale hyperprior. arXiv preprint arXiv:1802.01436, 2018.
-  G. Bjontegaard. Calculation of average psnr differences between rd-curves. VCEG-M33, 2001.
-  A. Buades, B. Coll, and J.-M. Morel. A non-local algorithm for image denoising. In , volume 2, pages 60–65. IEEE, 2005.
-  O. Firat, K. Cho, and Y. Bengio. Multi-way, multilingual neural machine translation with a shared attention mechanism. arXiv preprint arXiv:1601.01073, 2016.
-  S. Gu, L. Zhang, W. Zuo, and X. Feng. Weighted nuclear norm minimization with application to image denoising. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2862–2869, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang. Learning convolutional networks for content-weighted image compression. arXiv preprint arXiv:1703.10553, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  D. Liu, B. Wen, Y. Fan, C. C. Loy, and T. S. Huang. Non-local recurrent network for image restoration. In Advances in Neural Information Processing Systems, pages 1680–1689, 2018.
-  H. Liu, T. Chen, P. Guo, Q. Shen, and Z. Ma. Gated context model with embedded priors for deep image compression. arXiv preprint arXiv:1902.10480, 2019.
-  M.-T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
-  J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-local sparse models for image restoration. In 2009 IEEE 12th International Conference on Computer Vision (ICCV), pages 2272–2279. IEEE, 2009.
-  F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool. Conditional probability models for deep image compression. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 3, 2018.
-  D. Minnen, J. Ballé, and G. D. Toderici. Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, pages 10794–10803, 2018.
-  A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
-  O. Rippel and L. Bourdev. Real-time adaptive image compression. arXiv preprint arXiv:1705.05823, 2017.
-  G. J. Sullivan, T. Wiegand, et al. Rate-distortion optimization for video compression. IEEE signal processing magazine, 15(6):74–90, 1998.
-  G. Toderici, D. Vincent, N. Johnston, S.-J. Hwang, D. Minnen, J. Shor, and M. Covell. Full resolution image compression with recurrent neural networks. CoRR, abs/1608.05148, 2016.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
-  X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
-  Z. Wang, E. P. Simoncelli, and A. C. Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398–1402. Ieee, 2003.
-  X. Xu, S. Liu, T. Chuang, Y. Huang, S. Lei, K. Rapaka, C. Pang, V. Seregin, Y. Wang, and M. Karczewicz. Intra block copy in hevc screen content coding extensions. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 6(4):409–419, Dec 2016.
-  Y. Zhang, K. Li, K. Li, B. Zhong, and Y. Fu. Residual non-local attention networks for image restoration. 2018.