Visual information accounts for more than 70% of all sensory input in our human body111https://antranik.org/the-eye-and-vision/. A typical representative example of visual information illustration is an image where objects (e.g., face, car plate, building, etc), background, and their relative layered associations are presented to convey information. Ideally, image is a composite model of relevant objects by which appropriate visual perception (e.g., depth, geometry, photometric attributes, etc) and content understanding (e.g., segmentation, classification, etc) based applications are performed.
Therefore, a straightforward application-driven image restoration is to reconstruct its embedded and meaningful objects, rather the entire scene, which has promised a great potential for both low bitrate and high-level semantical applications. It had led to extensive explorations on object-based image coding (OBIC) about two decades ago. However, it had rarely been applied in practice, even with a industry standard (e.g., MPEG-4 ) concluded at that time. This is mainly due to the inefficient compact object representation and insufficient computing capability back to early 2000’s. For example, although shape-adaptive discrete wavelet and cosine transforms [2, 3] were dedicatedly developed for arbitrary shape support, they introduced a great amount of computational cost for implementation.
Recently, thanks to the advances in algorithm and hardware of deep learning, we have envisioned the emerging breakthrough of such classical OBIC problem. Here, key issue behind is how to efficiently segment and compress objects from an image, at the element or pixel granularity to support arbitrary-shaped representation. Thus, we have proposed to apply the element-wise masking mechanism to activate or deactivate pixels for arbitrary-shaped object processing which can be easily coupled with regular stacked convolutions to efficiently exploit local spatial correlation, without resorting for dedicated shape-adaptive tool design as in [2, 3].
Towards this goal, we choose to use the segmentation network (e.g., DeepLab) in  to derive the element-wise mask which is utilized to produce masked image layers (e.g., object, or background) for subsequent parallel image compression and reconstruction222For simplicity, we have devised a binary mask, for a foreground object layer and a background layer. Multiple image layers can be supported easily.
using stacked convolutional neural network (CNN) based NLAIC in that can efficiently leverage the properties of masking and convolutional operations, for high-efficiency shape-adaptive object processing. All components are connected and optimized in an end-to-end learning framework, as shown in Fig. 1, where masked image layers are processed in parallel. This system is referred to as the LearntOBIC. Rate-distortion optimization (RDO) is conducted to maximize the visual quality at a given bitrate budget.
Our LearntOBIC is then evaluated using the public PASCAL VOC 2012 and Kodak dataset for low bitrate application scenarios, in comparison to the existing JPEG 2000 , High-Efficiency Video Coding (HEVC)-based Image Compression (aka, BPG)  and NLAIC , offering significant visual quality improvement for object reconstruction. Ablation studies are also offered to further discuss the capacity of proposed system.
To the best of our knowledge, this is the first work to revisit the OBIC via an end-to-end learning approach. Key novelties of this paper include: 1) End-to-end learnable framework for intelligent object-based compression by parallel processing masked image layers; 2) Element-wise masking mechanism to support arbitrary-shaped object processing that leverages the advances in object segmentation and CNN-based image compression; 3) Modularized functional components for supporting parallel processing and future extension (e.g., multiple image layers, different utility loss, sub-stream extraction, etc).
2 Related Work
We have categorized relevant research activities into three major classes, i.e., image/object segmentation, learnt image compression and non-uniform image encoding, mainly with the focus on recent learning-driven approaches.
Image semantic segmentation is aimed to classify image pixels into different (object) instances. Longet al.  was the first one to introduce an end-to-end fully convolutional network (FCN) for image segmentation, which was then improved by follow-up works, such as the U-Net , feature pyramid network (FPN) , DeepLab , etc. DeepLab is used in this work to produce accurate element-wise object segmentation mask, due to its superior performance introduced by the atrous convolution-based field-of-view enlargement, spatial pyramid pooling-based multiscale image context combination, and fully connected Conditional Random Fields enhanced localization of object boundaries. In the subsequent experimental studies, we utilize the ResNet-34 as the backbone for object mask generation.
Learnt Image Compression.
Deep CNN-based image compression methods have been studied extensively and shown a promising and encouraging progress on coding efficiency. Such compression efficiency improvements are mainly contributed by the VAE with autoregressive neighbors and hyperprior[13, 7], nonlinear transform (e.g., generalized divisive normalization - GDN , non-local operations ), differentiable quantization , embedded attention mechanisms [16, 17, 7], etc. Recently, learnt image compression methods even have emerged with better performance than HEVC-based BPG, with image quality measured by both PSNR and MS-SSIM [13, 7].
Object-Based Non-uniform Image Encoding. Images are used to convey visual information. Either human or machine (e.g., autonomous drive) cares about the embedded objects with high-quality reconstruction for better perception and understanding. Thus, non-uniform image encoding has been investigated to allocate different bitrate budgets for objects or region-of-interests [18, 19, 20] with non-uniform reconstructed qualities. However, performance always suffered because it was difficult to implement pixel-wise object representation in such conventional block-based system, and to apply the joint optimization between segmentation and compression.
3 LearntOBIC: Object-Based Image Compression via End-to-End Learning
Leveraging the advances in learning-based image segmentation, compression and quality control (as discussed in Section 2), we have proposed to integrate the segmentation network with image compression network in a fully end-to-end learnable framework, for efficient object-based image coding. We have exemplified our LearntOBIC using a two-layer decomposition structure including the (foreground) object and background, as shown in Fig. 1.
3.1 Object Mask Generation
We use DeepLab  to classify input image for the element-wise mask derivation. As revealed in our simulations, object segmentation accuracy in compression task is not as crucial as in other vision tasks (e.g., recognition). Thus, we choose to use ResNet-34 as the backbone for DeepLab to generate masks, instead of the original ResNet-101, considering the balanced tradeoff between the accuracy and computational complexity. As shown in Table 1, there is negligible compression efficiency loss for applying the ResNet-34 with segmentation accuracy drop measured using mIOU, but 2 speedup with half GFLOPs requirement, in comparison to the RestNet-101. Our LearntOBIC can be easily extended to support masks generated by various segmentation works.
Mean Intersection over Union - mIOU
For the two-layer structure, we are using the binary mask to indicate the feature elements (e.g., pixels) corresponding to the foreground object and background, respectively. We refer as the object mask while (1-) is the background mask. Masks can be augmented with image in its pixel domain or feature domain, via element-wise multiplication. Here, we choose to apply the mask in feature domain to derive corresponding activated fMaps of respective object and background,
is the latent fMaps generated by NLAIC encoder in Fig. 1 and is the element-wise multiplication. Note that presents the same height and width as at a size of with for the number of channels. Identical mask or () is applied to all channels. Masked fMaps will then be processed in parallel for independent quantization and context modeling using hyper encoder-decoder pairs.
3.2 Parallel Image Layer Compression
We use NLAIC as the basic codec unit to process the object and background layers. NLAIC uses the popular stacked CNN-based VAE structure with both hyperpriors and autoregressive neighbors for context modeling.
Masked fMaps, i.e., and , are fed into hyper encoder-decoder pairs for accurate context modeling. Conditional probability or for latent fMaps, and factorized probability or for hyper fMaps, are applied to respective object and background layer for accurate entropy rate estimation that will be devised for RDO and actual binary bits generation.
Rate estimation is used for object or background layer individually, each of which will have rate dissipated at both latent and hyper fMaps, i.e.,
with indicating the entropy sum by traversing all feature elements. On the contrary, we will use the total distortion between compressed result and input image, e.g., , that can be measured using PSNR, MS-SSIM or even feature loss for end-to-end learning.
For implementation, we can put compressed object and background layers in separated sub-streams for subsequent multiplexing, by which we can offer the individual layer reconstruction by extracting specific sub-stream. This would generally enable the capability for object-based tasks, without streaming and decoding the entire image.
4 Experimental Discovery
We choose PASCAL VOC as our dataset, given that it is widely used for object detection and segmentation. Our LearntOBIC is trained using the training set of PASCAL VOC 2007 and PASCAL VOC 2012. Input images are set at 3203203, while segmentation mask has the same size of () as the latent fMaps. LearntOBIC is tested using the validation set of PASCAL VOC 2012.
We first use pre-trained DeepLab with ResNet-34 as backbone, and pre-trained NLAIC to initialize the LearntOBIC model, and then finetune it using the loss function:
Here, is defined above. We choose the MS-SSIM as our distortion metric which is reported to have better correlation with human visual perception, especially at low bit rate . and are defined in (3) and (2), respectively.
By adjusting we achieve rate-distortion trade-off for a variety of bitrates. We use and to adapt the bit consumption for object and background layers. Here, we set to shift bits from background to object. We set learning rate at in the beginning and clip the value to
after 10 epochs. We set the total bitrate lower than 0.1 bits per pixel (bpp) to experiment the low bitrate application.
4.2 Performance Evaluation
Objective Efficiency. Table 2 has listed the averaged objective results for all test image samples in PASCAL VOC 2012, in terms of bit rate (ave. bpp), MS-SSIM, and PSNR. At such low bit rate, e.g., bpp (or 340x compression ratio), MS-SSIM offers more meaningful quality measurement close to our subjective sensation . Learnt image compression methods, e.g., NLAIC, and LearntOBIC, exhibit quite close perceptual index measured by MS-SSIM . and both are better than traditional JPEG2K and BPG.
Subjective Evaluation. We further visualize the reconstructed images that are processed using JPEG2K, BPG, NLAIC, and LearntOBIC in Fig. 2. Snapshots in row #1 to #3 are from PASCAL VOC 2012. Our LearntOBIC offers significantly better visual results than others (and even slightly smaller bit rate), by intelligently distributing bits between object and background layers. For such low bitrate, traditional JPEG2K and BPG have lost the capacity for fine reconstruction, where severe artifacts are presented in column one and column two, impairing the subjective perception clearly. Though default NLAIC outperforms JPEG2K and BPG subjectively, noticeable artifacts (e.g., over-smoothed texture, and color distortion) are still perceivable.
4.3 Ablation Studies
Masking. Note that we can also perform the masking in pixel domain to separate object and background layers prior to being fed into compression networks. However, pixel domain masking provides more blurry object boundary, as shown in Fig. 3
due to convolutions in compression network applied for deactivated (e.g., 0 after masking) and activated pixels across the object boundary. Boundary extension or padding may resolve this issue to some extent which is for our future study.
Bits Allocation. We further explore the relationship between bits distribution and subjective quality in LearntOBIC, as shown in Fig. 4. Given the same total bit rate, by weighing more attention to foreground object with more bits, e.g., from Fig. 4(b) to Fig. 4(a), its texture can be reconstructed with finer details, but background is deteriorated with slightly color distortion. This would generally be of interest for applications with the focus on the object, rather the entire image. An interesting exploration avenue is how to automatically shift bits for task oriented applications.
Model Generalization. As in Fig. 2, e.g., row#4, #5, we extend PASCAL trained model to Kodak dataset directly. Our LearntOBIC still provides better visual reconstruction for Kodak images with clearly distinguishable object. As revealed in our work, a more robust segmentation method is highly desired for reliable performance.
We proposed an object-based image compression method, referred to as the LearntOBIC, by integrating the segmentation network and compression network in an end-to-end learnable framework. With this learning-based approach, we could offer the element-wise operations (e.g., masking, convolution-based transforms, etc) to efficiently support the processing of arbitrary-shaped objects. Compared with traditional JPEG2K, HEVC-based BPG, as well as the recent learning-based NLAIC, our LearntOBIC offers much improved visual quality for the application scenarios at a very low bitrate.
The OBIC itself is an interesting problem since both human beings and machines are weighing more attentions on particular/salient objects within an image, rather the entire scene, for task oriented application. This work is our preliminary attempt to revisit the classic OBIC defined almost two decades ago. There are a lot of interesting problems for further investigation. For example, segmentation is generally content dependent. How to make it more robust in this LearntOBIC framework is worth for deep study. On the other hand, how to use sub-stream that corresponds to the image object layer, and how to distribute bits intelligently, are crucial for object-based visual tasks.
-  A. Puri and A. Eleftheriadis, “MPEG-4: An object-based multimedia coding standard supporting mobile applications,” Mobile Networks and Applications, vol. 3, no. 1, pp. 5–32, 1998.
-  T. Sikora and B. Makai, “Shape-adaptive DCT for generic coding of video,” IEEE Trans. Circuits and Systems for Video Technol., vol. 5, no. 1, pp. 59–62, 1995.
-  S. Li and W. Li, “Shape-adaptive discrete wavelet transforms for arbitrarily shaped visual object coding,” IEEE Trans. Circuits and Systems for Video Technol., vol. 10, no. 5, pp. 725–743, 2000.
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Trans. pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE CVPR, 2016, pp. 770–778.
-  T. Chen, H. Liu, Z. Ma, Q. Shen, X. Cao, and Y. Wang, “Neural image compression via non-local attention optimization and improved context modeling,” arXiv preprint arXiv:1910.06244, 2019.
-  A. Skodras, C. Christopoulos, and T. Ebrahimi, “The jpeg 2000 still image compression standard,” IEEE Signal processing magazine, vol. 18, no. 5, pp. 36–58, 2001.
-  G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (HEVC) standard,” IEEE Trans. Circuits and Systems for video technol., vol. 22, no. 12, pp. 1649–1668, 2012.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in , June 2015.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI. Springer, 2015, pp. 234–241.
-  T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE CVPR, 2017, pp. 2117–2125.
-  D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” in NIPS, 2018, pp. 10 771–10 780.
-  J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” arXiv:1611.01704, 2016.
-  J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” arXiv:1802.01436, 2018.
-  M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convolutional networks for content-weighted image compression,” in Proceedings of the IEEE CVPR, 2018, pp. 3214–3223.
-  E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. V. Gool, “Generative adversarial networks for extreme learned image compression,” in ICCV, 2019, pp. 221–231.
-  S. Han and N. Vasconcelos, “Image compression using object-based regions of interest,” in IEEE ICIP, 2006, pp. 3097–3100.
-  ——, “Object-based regions of interest for image compression,” in IEEE DCC, 2008, pp. 132–141.
-  Z. Chen, J. Han, and K. N. Ngan, “Dynamic bit allocation for multiple video object coding,” IEEE Tran. Multimedia, vol. 8, no. 6, pp. 1117–1124, 2006.
-  Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The 37-th Asilomar Conf. Signals, Systems & Computers, 2003, 2003, pp. 1398–1402.