1 Introduction
Most recently proposed machine learning based image compression algorithms
[5, 20, 17]leverage the autoencoder structure, which transforms raw pixels into compressible latent features via stacked convolutional neural networks (CNNs). These latent features are entropy coded subsequently by exploiting the statistical redundancy. Recent prior works have revealed that compression efficiency can be improved when exploring the conditional probabilities via the contexts of spatial neighbors and hyperpriors
[17, 11, 5]. Typically, ratedistortion optimization [21] is fulfilled by minimizing Lagrangian cost = + , when performing the endtoend training. Here, is referred to as entropy rate, and is the distortion measured by either mean squared error (MSE) or multiscale structural similarity (MSSSIM) [25].However, existing methods still present several limitations. For example, most of the operations, such as stacked convolutions, are performed locally with limited receptive field, even with pyramidal decomposition. Furthermore, latent features are treated with equal importance in either spatial or channel dimension in most works, without considering the diverse visual sensitivities to various contents (such as texture and edge). Thus, attempts have been made in [11, 17]
to exploit importance maps on top of latent feature vectors for adaptive bit allocation. But these methods require the extra explicit signaling overhead to carry the importance maps.
In this paper, we introduce nonlocal operation blocks proposed in [24] into the variational autoencoder (VAE) structure to capture both local and global correlations among pixels, and generate the attention masks which help to yield more compact distributions of latent features and hyperpriors.Different from those existing methods in [11, 17], we use nonlocal processing to generate attention masks at different layers (not only for quantized features), to allocate the bits intelligently through the endtoend training. We also improve the context modeling of the entropy engine for better latent feature compression, by using a masked 3D CNN (i.e., 555) on the latent features to generate the conditional statistics of the latent features.
Two different model implementations are provided, one is the “NLAIC joint” , which uses both hyperpriors and spatialchannel neighbors in the latent features for context modeling, and the other is the “NLAIC baseline” with contexts only from hyperpriors. Our joint model outperforms all existing learned and traditional image compression methods, in terms of the rate distortion efficiency for the distortion measured by both MSSSIM and PSNR.
To further verify the efficiency of our framework, we also conduct ablation studies to discuss model variants such as removing nonlocal operations and attention mechanisms layer by layer, as well as the visual comparison. These additional experiments provide further evidence of the superior performance of our proposed NLAIC framework over a broad dataset.
The main contributions of this paper are highlighted as follows:

We are the first to introduce nonlocal operations into compression framework to capture both local and global correlations among the pixels in the original image and feature maps.

We apply attention mechanism together with aforementioned nonlocal operations to generate implicit importance masks to guide the adaptive processing of latent features. These masks essentially allocate more bits to more important features that are critical for reducing the image distortion.

We employ a onelayer masked 3D CNN to exploit the spatial and cross channel correlations in the latent features, the output of which is then concatenated with hyperpriors to estimate the conditional statistics of the latent features, enabling more efficient entropy coding.
Main Encoder  Main Decoder  Hyperprior Encoder  Hyperprior Decoder  Conditional Context Model 

Conv: 55192 s2  NLAM  ResBlock(3): 33192  NLAM  Masked: 55524 s1 
ResBlock(3): 33192  Deconv: 55192 s2  Conv: 55192 s2  Deconv: 55192 s2  Conv: 11148 s1 
Conv: 55192 s2  ResBlock(3): 33192  ResBlock(3): 33192  ResBlock(3): 33192  ReLU 
NLAM  Deconv: 55192 s2  Conv: 55192 s2  Deconv: 55192 s2  Conv: 11196 s1 
Conv: 55192 s2  NLAM  NLAM  ResBlock(3): 33192  ReLU 
ResBlock(3): 33192  Deconv: 55192 s2  Conv: 55384 s1  Conv: 1112 s1  
Conv: 55192 s2  ResBlock(3): 33192  
NLAM  Conv: 553 s2 
: “Conv” denotes a convolution layer with kernel size and number of output channels. “s” is the stride (e.g.,s2 means a down/upsampling with stride 2). NLAM represents the nonlocal attention modules. “
3” means cascading 3 residual blocks (ResBlock).2 Related Work
Nonlocal Operations. Most traditional filters (such as Gaussian and mean) process the data locally, by using a weighted average of spatially neighboring pixels. It usually produces oversmoothed reconstructions. Classical nonlocal methods for image restoration problems (e.g., lowrank modeling [9], joint sparsity [16] and nonlocal means [7]) have shown their superior efficiency for quality improvement by exploiting nonlocal correlations. Recently, nonlocal operations haven been included into the deep neural networks (DNN) for video classification [24]
, image restoration (e.g., denoising, artifacts removal and superresolution)
[13, 27], etc, with significant performance improvement reported. It is also worth to point out that nonlocal operations have been applied in other scenarios, such as intra block copy in screen content extension of the HighEfficiency Video Coding (HEVC) [26].Self Attention.
Selfattention mechanism is widely used in deep learning based natural language processing (NLP)
[15, 8, 23]. It can be described as a mapping strategy which queries a set of keyvalue pairs to an output. For example, Vaswani et. al [23] have proposed multiheaded attention methods which are extensively used for machine translation. For those lowlevel vision tasks [27, 11, 17], selfattention mechanism makes generated features with spatial adaptive activation and enables adaptive information allocation with the emphasis on more challenging areas (i.e., rich textures, saliency, etc).In image compression, quantized attention masks are commonly used for adaptive bit allocation, e.g., Li et. al [11] uses 3 layers of local convolutions and Mentzer et. al [17] selects one of the quantized features. Unfortunately, these methods require the extra explicit signaling overhead. Our model adopts attention mechanism that is close to [11, 17] but applies multiple layers of nonlocal as well as convolutional operations to automatically generate attention masks from the input image. The attention masks are applied to the temporary latent features directly to generate the final latent features to be coded. Thus, there is no need to use extra bits to code the masks.
Image Compression Architectures. DNN based image compression generally relies on wellknown autoencoders. Its back propagation scheme requires all the steps differentiable in an endtoend manner. Several methods (e.g., adding uniform noise [4], replacing the direct derivative with the derivative of the expectation [22] and softtohard quantization [3]) are developed to approximate the nondifferentiable quantization process. On the other hand, entropy rate modeling of quantized latent features is another critical issue for learned image compression. PixelCNNs [19] and VAE are commonly used for entropy estimation following the Bayesian generative rules. Recently, conditional probability estimates based on autoregressive neighbors of the latent feature maps and hyperpriors jointly has shown significant improvement in entropy coding.
3 NonLocal Attention Implementation
3.1 General Framework
Fig. 1 illustrates our NLAIC framework. It is built on a variational autoencoder structure [5], with nonlocal attention modules (NLAM) as basic units in both main and hyperprior encoderdecoder pairs (i.e., , , and ). with quantization are used to generate latent quantized features and decodes the features into the reconstructed image. and generate much smaller side information as hyperpriors. The hyperpriors as well as autoregressive neighbors of the latent features are then processed through the conditional context model to generate the conditional probability estimates for entropy coding of the latent quantized features.
3.2 Nonlocal Module
Our NLAM adopts the nonlocal network proposed in [24] as a basic block, as shown in Fig. 2. As shown in Fig. 2, the nonlocal module (NLM) computes the output at pixel , , using a weighted average of the transformed feature values at pixel , , as below:
(1) 
where is the location index of output vector and represents the index that enumerates all accessible positions of input . and share the same size. The function computes the correlations between and , and computes the representation of the input at the position . is a normalizing factor to generate the final response which is set as . Note that a variety of function forms of have been already discussed in [24]. Thus in this work, we directly use the embedded Gaussian function for , i.e.,
(2) 
Here, and , where and denote the crosschannel transform using 11 convolution in our framework. The weights are further modified by a softmax operation. The operation defined in Eq. (1) can be written in matrix form [24] as:
(3) 
In addition, residual connection can be applied for better convergence as suggested in
[24], as shown in Fig. 2, i.e.,(4) 
where is also a linear 11 convolution across all channels, and is the final output vector.
3.3 Nonlocal Attention Module
Importance map has been adopted in [11, 17] to adaptively allocate information to quantized latent features. For instance, we can give more bits to textured area but less bits to elsewhere, resulting in better visual quality at the similar bit rate. Such adaptive allocation can be implemented by using an explicit mask, which must be specified with additional bits. As aforementioned, existing mask generation methods in [11, 17] are too simple to handle areas with more complex content characteristics.
Inspired by [27], we propose to use a cascade of a nonlocal module and regular convolutional layers to generate the attention masks, as shown in Fig. 2. The NLAM consists of two branches. The main branch uses conventional stacked networks to generate features and the mask branch applies the NLM with three residual blocks [10], one 11 convolution and sigmoid activation to produce a joint spatialchannel attention mask , i.e.,
(5) 
where denotes the attention mask and is the input features. represents the operations of using NLM with subsequent three residual blocks and 11 convolution which are shown in Fig. 2. This attention mask , having its element , is elementwise multiplied with feature maps from the main branch to perform adaptive processing. Finally a residual connection is added for faster convergence.
We avoid any batch normalization (BN) layers and only use one ReLU in our residual blocks, justified through our experimental observations.
Note that in existing learned image compression methods, particularly for those with superior performance [4, 5, 18, 14], GDN activation has proven its better efficiency compared with ReLU, tanh, sigmoid, leakyReLU, etc. This may be due to the fact that GDN captures the global information across all feature channels at the same pixel location. However, we just use the simple ReLU function, and rely on our proposed NLAM to capture both the local and global correlations. We also find through experiments that inserting two pairs of two layers of NLAM for the main encoderdecoder, and one layer of NLAM in the hyperprior encoderdecoder, provides the best performance. As will be shown in subsequent Section 4, our NLAIC has demonstrated the stateoftheart coding efficiency.
3.4 Entropy Rate Modeling
Previous sections present our novel NLAM scheme to transform the input pixels into more compact latent features. This section details the entropy rate modeling part that is critical for the overall ratedistortion efficiency.
3.4.1 Context Modeling Using Hyperpriors
Similar as [5], a nonparametric, fully factorized density model is used for hyperpriors , which is described as:
(6) 
where represents the parameters of each univariate distribution .
For quantized latent features , each element
can be modeled as a conditional Gaussian distribution as:
(7) 
where its and are predicted using the distribution of . We evaluate the bits of and using:
(8)  
(9) 
Usually, we take as side information for estimating and and only occupies a very small fraction of bits, shown in Fig. 3.
3.4.2 Context Modeling Using Neighbors
PixelCNNs and PixelRNNs [19] have been proposed for effective modeling of probabilistic distribution of images using local neighbors in an autoregressive way. It is further extended for adaptive context modeling in compression framework with noticeable improvement. For example, Minnen et al. [18] have proposed to extract autoregressive information by a 2D 55 masked convolution, which is combined with hyperpriors using stacked 11 convolution, for probability estimation. It is the first deeplearning based method with better PSNR compared with the BPG444 at the same bit rate.
In our NLAIC, we use a onelayer 555 3D masked convolution to exploit the spatial and crosschannel correlation. For simplicity, a 333 example is shown in Fig. 5.
Traditional 2D PixelCNNs need to search for a well structured channel order to exploit the conditional probability efficiently. Instead, our proposed 3D masked convolutions implicitly exploit correlation among adjacent channels. Compared to 2D masked CNN used in [18], our 3D CNN approach significantly reduces the network parameters for the conditional context modeling. Leveraging the additional contexts from neighbors via an autoregressive fashion, we can obtain a better conditional Gaussian distribution to model the entropy as:
(10) 
where denote the causal (and possibly reconstructed) pixels prior to current pixel .
4 Experiments
4.1 Training
We use COCO [12] and CLIC [2] datasets to train our NLAIC framework. We randomly crop images into 1921923 patches for subsequent learning. Ratedistortion optimization (RDO) is applied to do endtoend training at various bit rate, i.e.,
(11) 
is a distortion measurement between reconstructed image and the original image . Both negative MSSSIM and MSE are used in our work as distortion loss for evaluation, which are marked as “MSSSIM opt.” and “MSE opt.”, respectively. and represent the estimated bit rates of latent features and hyperpriors, respectively. Note that all components of our NLAIC are trained together. We set learning rates (LR) for , , , and at in the beginning. But for , its LR is clipped to
after 30 epochs. Batch size is set to 16 and the entire model is trained on 4GPUs in parallel.
To understand the contribution of the context modeling using spatialchannel neighbors, we offer two different implementations: one is “NLAIC baseline” that only uses the hyperpriors to estimate the means and variances of the latent features (see Eq. (
7)), while the other is “NLAIC joint” that uses both hyperpriors and previously coded pixels in the latent feature maps (see Eq. (10)). In this work, we first train the “NLAIC baseline” models. To train the “NLAIC joint” model, one way is fixing the main and hyperprior encoders and decoders in the baseline model, and updating only the conditional context model. Compared with the “NLAIC baseline”, such transfer learning based “NLAIC joint” provides 3% bit rate reduction at the same distortion. Alternatively, we could use the baseline models as the start point, and refine all the modules in the “NLAIC joint” system. In this way, “NLAIC joint” offers more than 9% bit rate reduction over the “NLAIC baseline” at the same quality. Thus, we choose the latter one for better performance.
4.2 Performance Efficiency
We evaluate our NLAIC models by comparing the ratedistortion performance averaged on publicly available Kodak dataset. Fig. 4 shows the performance when distortion is measured by MSSSIM and PSNR, respectively, that are widely used in image and video compression tasks. Here, PSNR represents the pixellevel distortion while MSSSIM describes the structural similarity. MSSSIM is reported to offer higher correlation with human perceptual inception, particularly at low bit rate [25]. As we can see, our NLAIC provides the stateoftheart performance with noticeable performance margin compared with the existing leading methods, such as Ballé2019 [18] and Ballé2018 [5].
Specifically, as shown in Fig. 4 using MSSSIM for both loss and final distortion measurement, “NLAIC baseline” outperforms the existing methods while the “NLAIC joint” presents even larger performance margin. For the case that uses MSE as loss and PSNR as distortion measurement, “NLAIC joint” still offers the best performance, as illustrated in Fig. 4. “NLAIC baseline” is slightly worse than the model in [18] that uses contexts from both hyperpriors and neighbors jointly as our “NLAIC joint”, but better than the work [5] that only uses the hyperpriors to do contexts modeling for a fair comparison. Fig. 6 compares the average BDRate reductions by various methods over the legacy JPEG encoder. Our “NLAIC joint” model shows 64.39% and 12.26% BDRate [6] reduction against JPEG420 and BPG444, respectively.
4.3 Ablation Studies
We further analyze our NLAIC in following aspects:
Impacts of NLAM:
To further discuss the efficiency of newly introduced NLAM, we remove the mask branch in the NLAM pairs gradually, and retrain our framework for performance evaluation. For this study, we use the baseline context modeling in all cases, and use the MSE as the loss function and PSNR as the final distortion measurement, shown in Fig.
7. For illustrative understanding, we also provide two anchors, i.e., “Ballé2018” [5] and “NLAIC joint” respectively. However, to see the degradation caused by gradually removing the mask branch in NLAMs, one should compare with the NLAIC baseline curve.Removing the mask branches of the first NLAM pair in the main encoderdecoders (referred to as “remove_first”) yields a PSNR drop of about 0.1dB compared to “NLAIC baseline” at the same bit rate. PSNR drop is further enlarged noticeably when removing all NLAM pairs’ mask branches in main encoderdecoders (a.k.a., “remove_main”). It gives the worst performance when further disabling the NLAM pair’s mask branches in hyperprior encoderdecoders, resulting in the traditional variational autoencoder without nonlocal characteristics explorations (i.e., “remove_all”).
Impacts of Joint Contexts Modeling: We further compare conditional context modeling efficiency of the model variants in Fig. 8. As we can see, with embedded NLAM and joint contexts modeling, our “NLAIC joint” could provide more powerful latent features, and more compact normalized feature prediction error, both contributing to its leading coding efficiency.
Hyperpriors Hyperpriors has noticeable contribution to the overall compression performance [18, 5]. Its percentage decreases as the overall bit rate increases, shown in Fig. 3. The percentage of for MSE loss optimized model is higher than the case using MSSSIM loss optimization. Another interesting observation is that exhibits contradictive distributions of joint and baseline models, for respective MSE and MSSSIM loss based schemes. More explorations is highly desired in this aspect to understand the bit allocation of hyperpriors in our future study.
4.4 Visual Comparison
We also evaluate our method on BSD500 [1] dataset, which is widely used in image restoration problems. Fig. 9 shows the results of different image codecs at the similar bit rate. Our NLAIC provides the best subjective quality with relative smaller bit rate^{1}^{1}1In practice, some bit rate points cannot be reached for BPG and JPEG. Thus we choose the closest one to match our NLAIC bit rate..
Considering that MSSSIM loss optimized results demonstrate much smaller PSNR at high bit rate in Fig. 4, we also show our model comparison optimized for respective PSNR and MSSSIM loss at high bit rate scenario. We find it that MSSSIM loss optimized results exhibit worse details compared with PSNR loss optimized models at high bit rate, as shown in Fig. 10. This may be due to the fact that pixel distortion becomes more significant at high bit rate, but structural similarity puts more weights at a fair low bit rate. It will be interesting to explore a better metric to cover the advantages of PSNR at high bit rate and MSSSIM at low bit rate for an overall optimal efficiency.
5 Conclusion
In this paper, we proposed a nonlocal attention optimized deep image compression (NLAIC) method and achieve the stateoftheart performance. Specifically, we have introduced the nonlocal operation to capture both local and global correlation for more compact latent feature representations. Together with the attention mechanism, we can enable the adaptive processing of latent features by allocating more bits to important area using the attention maps generated by nonlocal operations. Joint contexts from autoregressive spatialchannel neighbors and hyperpriors are leveraged to improve the entropy coding efficiency.
Our NLAIC outperforms the existing image compression methods, including well known BPG, JPEG2000, JPEG as well as the most recent learning based schemes [18, 5, 20], in terms of both MSSSIM and PSNR evaluation at the same bit rate.
For future study, we can make our context model deeper to improve image compression performance. Parallelization and acceleration are important to deploy the model for actual usage in practice, particularly for mobile platforms. In addition, it is also meaningful to extend our framework for endtoend video compression framework with more priors acquired from spatial and temporal information.
References
 [1] The berkeley segmentation dataset and benchmark.
 [2] Challenge on learned image compression 2018.
 [3] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool. Softtohard vector quantization for endtoend learning compressible representations. In Advances in Neural Information Processing Systems, pages 1141–1151, 2017.
 [4] J. Ballé, V. Laparra, and E. P. Simoncelli. Endtoend optimized image compression. arXiv preprint arXiv:1611.01704, 2016.
 [5] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston. Variational image compression with a scale hyperprior. arXiv preprint arXiv:1802.01436, 2018.
 [6] G. Bjontegaard. Calculation of average psnr differences between rdcurves. VCEGM33, 2001.

[7]
A. Buades, B. Coll, and J.M. Morel.
A nonlocal algorithm for image denoising.
In
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
, volume 2, pages 60–65. IEEE, 2005.  [8] O. Firat, K. Cho, and Y. Bengio. Multiway, multilingual neural machine translation with a shared attention mechanism. arXiv preprint arXiv:1601.01073, 2016.
 [9] S. Gu, L. Zhang, W. Zuo, and X. Feng. Weighted nuclear norm minimization with application to image denoising. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2862–2869, 2014.
 [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [11] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang. Learning convolutional networks for contentweighted image compression. arXiv preprint arXiv:1703.10553, 2017.
 [12] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
 [13] D. Liu, B. Wen, Y. Fan, C. C. Loy, and T. S. Huang. Nonlocal recurrent network for image restoration. In Advances in Neural Information Processing Systems, pages 1680–1689, 2018.
 [14] H. Liu, T. Chen, P. Guo, Q. Shen, and Z. Ma. Gated context model with embedded priors for deep image compression. arXiv preprint arXiv:1902.10480, 2019.
 [15] M.T. Luong, H. Pham, and C. D. Manning. Effective approaches to attentionbased neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
 [16] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Nonlocal sparse models for image restoration. In 2009 IEEE 12th International Conference on Computer Vision (ICCV), pages 2272–2279. IEEE, 2009.
 [17] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool. Conditional probability models for deep image compression. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 3, 2018.
 [18] D. Minnen, J. Ballé, and G. D. Toderici. Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, pages 10794–10803, 2018.
 [19] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
 [20] O. Rippel and L. Bourdev. Realtime adaptive image compression. arXiv preprint arXiv:1705.05823, 2017.
 [21] G. J. Sullivan, T. Wiegand, et al. Ratedistortion optimization for video compression. IEEE signal processing magazine, 15(6):74–90, 1998.
 [22] G. Toderici, D. Vincent, N. Johnston, S.J. Hwang, D. Minnen, J. Shor, and M. Covell. Full resolution image compression with recurrent neural networks. CoRR, abs/1608.05148, 2016.
 [23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
 [24] X. Wang, R. Girshick, A. Gupta, and K. He. Nonlocal neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
 [25] Z. Wang, E. P. Simoncelli, and A. C. Bovik. Multiscale structural similarity for image quality assessment. In The ThritySeventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398–1402. Ieee, 2003.
 [26] X. Xu, S. Liu, T. Chuang, Y. Huang, S. Lei, K. Rapaka, C. Pang, V. Seregin, Y. Wang, and M. Karczewicz. Intra block copy in hevc screen content coding extensions. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 6(4):409–419, Dec 2016.
 [27] Y. Zhang, K. Li, K. Li, B. Zhong, and Y. Fu. Residual nonlocal attention networks for image restoration. 2018.