Learning a Single Tucker Decomposition Network for Lossy Image Compression with Multiple Bits-Per-Pixel Rates

07/10/2018 ∙ by Jianrui Cai, et al. ∙ DJI 0

Lossy image compression (LIC), which aims to utilize inexact approximations to represent an image more compactly, is a classical problem in image processing. Recently, deep convolutional neural networks (CNNs) have achieved interesting results in LIC by learning an encoder-quantizer-decoder network from a large amount of data. However, existing CNN-based LIC methods usually can only train a network for a specific bits-per-pixel (bpp). Such a "one network per bpp" problem limits the generality and flexibility of CNNs to practical LIC applications. In this paper, we propose to learn a single CNN which can perform LIC at multiple bpp rates. A simple yet effective Tucker Decomposition Network (TDNet) is developed, where there is a novel tucker decomposition layer (TDL) to decompose a latent image representation into a set of projection matrices and a core tensor. By changing the rank of the core tensor and its quantization, we can easily adjust the bpp rate of latent image representation within a single CNN. Furthermore, an iterative non-uniform quantization scheme is presented to optimize the quantizer, and a coarse-to-fine training strategy is introduced to reconstruct the decompressed images. Extensive experiments demonstrate the state-of-the-art compression performance of TDNet in terms of both PSNR and MS-SSIM indices.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 8

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

As an indispensable step in many image processing applications, lossy image compression (LIC) is a classical yet still active topic. The goal of LIC is to reduce the image storage space without sacrificing much the image quality, and thus provide an economic solution to image storage and transmission systems. Recently, with the development of portable imaging devices and social media (e.g., Facebook, Instagram and Flickr), billions of images are transmitted and stored daily on social networks [1]. The explosive growth of the amount of shared images on Internet raises higher requirements on LIC for more effective visual communication systems.

A typical LIC system contains mainly three modules: transformation (e.g., an encoder and a corresponding decoder), quantization (e.g., a quantizer), and encoding. To compress an image into bitstreams, conventional LIC methods firstly apply predefined transformations to transform an image into a sparse domain, then perform lossy quantization on the transformed coefficients, followed by entropy coding [2]. Notwithstanding their demonstrated success, conventional LIC methods suffer from three major drawbacks. First, they generally employ a series of cascaded modules to compress an image, which may introduce cumulative errors because there are few interactions between these modules. Second, the transformations employed in these LIC methods are generally designed in a hand-crafted manner (e.g., discrete cosine transform (DCT) for JPEG [3] and discrete wavelet transform (DWT) for JPEG 2000 [4]), which are limited to represent the various complex structures in natural images. Third, traditional LIC methods’ performance is poor for compression with a low bits-per-pixel (bpp) rate, often generating severe visual artifacts (e.g., blocky artifacts, blurrings and ringings).

Deep convolutional neural networks (CNNs) have recently led to a series of breakthroughs in many vision problems [5, 6, 7, 8, 9]. The flexible non-linear modelling capability and powerful end-to-end training paradigm of CNN also make it a promising new approach to LIC. In the last several years, a flurry of CNN-based LIC methods have been proposed, including the study of network structures [10, 11, 12, 13, 14]

as well as loss functions

[15, 16, 17, 18]. Firstly, the end-to-end training manner enables CNN-based LIC systems to adaptively learn an effective encoder-decoder pair from a large amount of image data and in a larger context to represent more complex image structures, reducing the artifacts in the decompressed image. Secondly, by adopting specific loss functions (i.e., perceptual metrics) in the training, the CNN compressors are able to strengthen certain desired aspects (i.e., perceptual quality) of the decomposed image.

Despite the advantages of employing CNN for compression, there are still some challenges which limit the performance of CNN-based compressors. First, existing CNN-based LIC methods can only change the number of latent feature maps and/or quantized values to adjust the bpp rate. As a result, the network is trained dedicatedly for a specific bpp rate once at a time. Such a “one network per bpp” problem limits the flexibility and applicability of CNNs to practical image compression systems. Second, because of the non-differentiable property of discrete operation, quantizer is hard to be updated during the end-to-end CNN network training. Therefore, the optimal decision boundaries of quantization levels are almost unreachable. Third, existing CNN based LIC methods usually adopt fixed quantization bins to discretize the latent image representation and treat each element of the latent image representation equally. Such a quantization scheme, however, ignores the prior knowledge that the local content is spatially variant in an image, and restricts the capability of CNNs in compressing complex image structures.

To address the aforementioned issues, in this work, we propose a new paradigm for deep LIC. More specifically, we proposed a deep Tucker Decomposition Network (TDNet) which takes the sparsity/low-rankness of latent image representations into consideration. The key component of TDNet is a novel tucker decomposition layer (TDL), which decomposes the latent image representation into a set of projection matrices and a compact core tensor. By changing the rank of core tensor and its quantization levels, we can easily adjust the bpp rate of latent image representation, and thus a single CNN model can be trained to compress and reconstruct images under multiple bpp rates. Besides, we propose an iterative non-uniform quantization strategy to obtain the optimal quantization boundaries based on the distribution of encoding coefficients. A coarse-to-fine training strategy is introduced to train a stable TDNet and reconstruct the decompressed images. Extensive experiments demonstrate that, our proposed TDNet trained with the mean-squared error (MSE) loss or the multi-scale structural similarity index (MS-SSIM) [19] loss can yield competitive results with state-of-the-art CNN-based LIC schemes but it uses only a single network to achieve this goal.

The contributions of this work are summarized as follows:

  1. We propose an end-to-end trainable deep tucker decomposition network, namely TDNet, which, for the first time to the best of our knowledge, enables a single network to perform LIC at multiple bpp rates.

  2. We present an iterative non-uniform quantization scheme to obtain the quantization boundaries of the tensor decomposition coefficients, and adopt a variable-bits quantization scheme to discretize the latent image representation. The proposed methods demonstrate state-of-the-art PSNR/SSIM indices and visual quality.

The remainder of this paper is organized as follows. Section II provides a brief survey of related work. Section III introduces our proposed TDNet model. Section IV presents in detail the tucker decomposition layer. Section V presents the all-in-one training strategy. In Section VI, extensive experiments are conducted to evaluate TDNet. Finally, several concluding remarks are given in Section VII.

Ii Related Work

Ii-a Traditional Lossy Image Compression

The most prevalent LIC method is JPEG (Joint Photographic Experts Group)111https://jpeg.org/, which first applies discrete cosine transform (DCT) to non-overlapping image blocks, and then quantizes the transformed DCT coefficients in frequency domain using a predefined quantization table, followed by entropy coding such as Huffman coding and arithmetic coding [20]. As a significantly improved version of JPEG, JPEG2000 adopts the more powerful discrete wavelet transform (DWT), instead of DCT, to perform time-frequency analysis on images. More specifically, JPEG2000 adopts the Cohen-Daubechies-Feauveau (CDF) 9/7 wavelet to decompose an image into multiple bands, and performs scalar-quantization on the DWT coefficients, followed by the Embedded Block Coding with Optimal Truncation (EBCOT) [21]. Another powerful LIC scheme is the so-called Better Portable Graphics (BPG) method222https://bellard.org/bpg/, which is built upon the intra-frame encoding scheme of the High Efficiency Video Coding (HEVC) video compression standard333https://www.itu.int/rec/T-REC-H.265. It has been proved that BPG can produce smaller files for a given quality than JPEG and JPEG 2000.

Although these traditional LIC approaches have demonstrated their great success, they all adopt hand-crafted transformations to transform the image into some sparse domain for quantization. The hand-crafted transformations are limited in adaptively and effectively decomposing complex image structures, resulting in visual artifacts around image edges and textures, especially when the bpp rates are low. The deep neural network based LIC methods are then proposed to address these problems.

Ii-B Deep Lossy Image Compression

Recently, deep neural networks have been investigated and achieved promising results in LIC. As a pioneering work, Toderici et al

. adopted the recurrent neural network (RNN) to encode and decode images of size 32

32 [10], and they further extended the network to compress full-resolution images [11]. Built upon the architecture proposed in [10, 11], Johnston et al. [17] modified the recurrent architecture by introducing hidden-state priming to improve spatial diffusion, and replaced the MSE loss by MS-SSIM loss [19] to increase the visual quality of reconstructed images.

Different from the above methods which employ RNN, methods in [12, 13, 14, 15, 18, 16] rely on CNN based auto-encoder architectures. Ballé et al. [12] used generalized divisive normalization for joint nonlinearity to implement local gain control. Li et al. [14] learned a content-weighted importance map, according to which more bits are allocated to the region with rich content to preserve image edge and texture details. Rippel et al. [15] aggregated image information across different scales by exploiting the pyramidal decomposition strategy, and introduced the generative adversarial networks (GANs) [22] to sharpen the edge of reconstructed images. To alleviate the effect of vanishing gradient caused by non-differentiable quantization operation, Theis et al. [13] introduced a smooth approximation of the derivative of the rounding function. A soft-to-hard scheme is adopted in [18] to find assignments to the quantizer.

Fig. 1: Illustration of conventional lossy image compression network architecture.

For all the aforementioned deep LIC methods, the bpp rate of latent image representation can only be adjusted by changing the number of latent feature maps and/or quantized values since the output of encoder should have the same size as the input of decoder. Thus, one network can only be trained to deal with a specific bpp rate, making these deep CNN-based LIC methods less flexible. In this work, we introduce a novel tucker decomposition layer into CNN, and present a TDNet scheme which enables a single network to tackle with multiple bpp rates for LIC.

Iii Deep Lossy Image Compression Model

In this section, we first summarize the pipeline of conventional CNN-based LIC methods, and then present the pipeline of our proposed TDNet. Finally, we present in detail the network architecture.

Iii-a Overview of Conventional LIC Network Pipeline

Existing deep LIC networks can be generally formulated as a joint rate-distortion optimization process to learn an encoder, a quantizer, and a decoder. The architecture of those networks is shown in Figure 1. Given a set of training images , where is the total number of training images, deep LIC methods aim to learn a nonlinear analysis transformation encoder , a quantizer , and a nonlinear synthesis transformation decoder . The encoder first converts an input image into a latent feature representation . Then, the quantizer quantizes the features into discrete values , which can be losslessly encoded into a bitstream for transmission or storage. Once the bitstream is received by the decoder , an approximation of the original image is obtained as . Overall, the deep image compression pipeline can be formulated as:

(1)

where and are the parameters of encoder and decoder , respectively.

Given a certain compression ratio, the network is expected to learn the parameters and to minimize the distortion of the reconstructed image. Note that the compression ratio can be defined as , where is the function to calculate the average number of bits to store a pixel of an image. Since is usually a constant for the original image without compression, we can adjust to change the compression ratio . For most of the existing CNN based LIC methods, one can only change the number of feature maps and quantization levels to adjust of latent image representation . As a result, usually a specific network has to be trained for a certain compression ratio, or bpp rate. For a new bpp rate, a new network has to be trained by adjusting the number of latent representation feature maps and quantization levels.

Iii-B Proposed LIC Network Pipeline

Fig. 2: Illustration of our proposed TDNet architecture.

Our proposed TDNet is designed to achieve the objective of multiple bpp rates with a single network. The pipeline of TDNet is shown in Figure 2. Instead of directly quantizing the latent image representation into a bitstream as in conventional deep LIC methods, we introduce a novel tucker decomposition layer (TDL) to process the latent image representation. Denote by the decomposition operation of TDL, and by the inverse operation. Given the latent image representation , we use to decompose the features into 3 orthogonal matrices and a core tensor , and then quantize the decomposed components to generate bitstream. Once the bitstream is received, with we can de-quantize and reproduce the features by back-projecting the core tensor and 3 orthogonal matrices into the approximation .

By changing the rank of core tensor in TDL, we can easily adjust the bpp rates and hence the compression ratio while keeping the size of latent image representation unchanged. Once is received, we can obtain an approximation of the original input image by , where is the reconstruction network to reproduce the decompressed image, and is the deconvolutional process to up-sample the latent image representations to the size of original images. Together, the decoder can be presented as . The pipeline of the proposed TDNet can be formulated as:

(2)

where is the parameter of encoder , and and are the parameters of decoder and , respectively. Being optimized in an end-to-end manner, the network is expected to learn the parameters to minimize the distortion of the reconstructed image.

Iii-C Architecture of TDNet

Loss Function: The loss function of a LIC network defines how close or how similar the decompressed image is to the original image. Many existing deep CNN based LIC methods [16, 18] use the perceptual loss such as the MS-SSIM loss [19, 23] to strength the perceptual quality of the compressed image. The MS-SSIM loss can also be adopted into our TDNet to learn the LIC network. Refer to Figure 2, we require both the deconvolution output and the final output of the network to be similar to the original image, resulting in the following loss function:

(3)

where refers to the -th image in , and is a parameter to balance the loss between the intermediate deconvolution output and the final reconstruction.

Considering that most of the classical LIC methods such as JPEG and JPEG2000 take MSE as the objective to optimize, it is also important to validate whether a deep LIC network can achieve good MSE or equivalently PSNR measures. By minimizing the MSE of both the deconvolution output and the final output of the network, the MSE oriented loss function of the proposed network can be formulated as:

(4)

Encoder: Our encoder network consists of 3 types of layers, which are shown with 3 different colors in Figure 2

. Instead of using Rectified Linear Unit (ReLU), we adopt Parametric Rectified Linear Units (PReLU)

[24]

as the activation function since it could improve the model fitting capability with little extra computational cost. Several convolution layers with a stride of 2 are utilized to downsample the feature maps, and the sigmoid function is used to project the data into the range of

. Besides, Concat operations are adopted to concatenate the feature maps of two layers to ensure maximum information flow. By stacking several convolutional layers, PReLU and Concat layers, the encoder network can transform the images into a compact domain with reduced redundancy. The detailed settings of the encoder network are summarized in Table I.

Layer Activation size
Input
Conv + PReLU
(

, stride 2, pad 1)

Concat blocks
(, stride 1, pad 1)
Conv + PReLU
(, stride 2, pad 1)
Concat blocks
(, stride 1, pad 1)
Conv + PReLU
(, stride 2, pad 1)
Concat blocks
(, stride 1, pad 1)
Conv + Sigmoid
(, stride 1, pad 1)
TABLE I: Encoder network architecture.

Tucker Decomposition Layer (TDL): Our TDL consists of two operations: , which decomposes and quantizes the features into 3 orthogonal matrices and a core tensor , and , which de-quantizes and projects the core tensor and 3 orthogonal matrices back into the features. By setting the ranks of the three matrices and setting the quantization level of the core tensor, the compression ratio of the network can be calculated as:

(5)

To change the compression rate, we can adjust the ranks of the decomposition matrices and the quantization levels of the core tensor, instead of retraining the network, and consequently achieve the goal of multiple bpp rates with a single network. More detail of the proposed TDL can be found in Section IV.

To ensure the end-to-end training of the network, the gradient of each component should be calculated for back-propagation. Though the tucker decomposition operation is non-differentiable and we cannot differentiate it with respect to its argument, based on the straight through estimator on gradient in

[13], fortunately, we could set the derivative of tucker decomposition layer as:

(6)

By setting the derivative to 1, the network can back propagate the loss from a decoder to an encoder. Thus, the whole network can be trained in an end-to-end manner.

Layer Activation size
Input
Conv + PReLU
(, stride 1, pad 1)
Concat blocks
(, stride 1, pad 1)
Sub-pixel
(upsampling factor: )
Concat blocks
(, stride 1, pad 1)
Sub-pixel
(upsampling factor: )
Concat blocks
(, stride 1, pad 1)
Sub-pixel
(upsampling factor: )
Conv
(, stride 1, pad 1)
Conv + PReLU
(, stride 1, pad 1)
Conv
(, stride 1, pad 1)
Residual Sum
TABLE II: Decoder network architecture.

Decoder: Our decoder network consists of a deconvolutional sub-network and a reconstruction sub-network, as shown in Figure 2. The deconvolutional sub-network basically mirrors the architecture of the encoder, and the stride of all convolutional layers is set to since there is no need to downsample the feature maps. To ensure that the output image will have the same size as the input one, the sub-pixel layer [25] is adopted to reshape and upsample feature maps. Usually, the deconvolution sub-network can deliver a rough approximation of the original image. The reconstruction sub-network aims to further enhance the deconvlution output by reproducing the missing details and textures in the encoding and TDL quantization process. Inspired by [8, 26], we propose to use the residual learning framework for the design of reconstruction sub-network. It consists of five convolutional layers, PReLUs and an element-wise addition operation. With the reconstruction sub-network, the final output image quality can be much refined. The detailed settings of the decoder network are summarized in Table II.

Fig. 3: Flowchart of the proposed tucker decomposition layer.

Iv Tucker Decomposition Layer

In this section, we first introduce some necessary notations and preliminaries of tensor decomposition, and then present in detail the proposed TDL.

Iv-a Notations and Preliminaries

Denote by an -order tensor, and denote by its elements, where . Let denote a matrix. The mode- product of a tensor and a matrix can be defined as [27]:

(7)

where the symbol denotes the tensor-times-matrix operation, and the mode- product output is a tensor of order . The elementwise representation of Eq. (7) can be written as:

(8)

The mode- product can also be calculated by matrix multiplication:

(9)

where and , , are called mode- matrices. Note that the operator is the process of reordering the elements of an -way date array into a matrix. Conversely, the unfolding matrices along the mode can be transformed back to the tensor by the operation.

For convenience, we define as [27]:

(10)

where denotes the Kronecker product. The SVD of is defined as:

(11)

and the leading -dimensional left singular subspace of is defined as .

Iv-B Tucker Decomposition Layer

As a powerful low rank approximation approach, tensor decomposition, e.g., Tucker decomposition [28] and CP decomposition [29], has been successfully used in various tasks, such as multispectral image restoration [30, 31], 3D image reconstruction [32], and higher-order web link analysis [33]. Inspired by the success of tensor decomposition methods, we introduce a novel TDL into the network architecture to achieve the goal that a single LIC network can perform image compression with multiple bpp rates. The flowchart of the proposed TDL is illustrated in Figure 3. It consists of 4 major components: 1) tucker decomposition; 2) quantization; 3) de-quantization; and 4) tucker reconstruction. The details of each component are described as follows.

Tucker decomposition: Tucker decomposition aims to decompose an -order tensor as an affiliation of orthogonal bases and the associated core tensor , where . It can be formulated as:

(12)

To find the optimal orthogonal matrices and the core tensor , we could minimize the error between the original data tensor and its approximation , leading to the following optimization problem:

(13)

Since is a constant, according to [34, 27, 35, 36], Eq. (13) can be recast as an optimization problem to maximize . We have:

(14)

To solve Eq. (14

), we first employ the higher order singular value decomposition (HOSVD)

[34] to initialize a set of basis factor matrices , then utilize the higher order orthogonal iteration (HOOI) [37] to iteratively update the orthogonal matrices until convergence, where is the index of loop. With the obtained set of optimal orthogonal matrices , we can easily obtain the corresponding core tensor by Eq. (12).

Specifically, for our TDNet the order of the feature tensor is . Given a -order tensor and the desired rank of output , we first compute the leading -dimensional left singular subspace of to initialize the basis factor matrices , where . Then, we rewrite Eq. (14) as follows to solve the -th component matrix :

(15)

The optimal solution of Eq. (15) can be set as the leading

-dimensional left singular vectors of the matrix

. By iteratively updating , we can obtain a set of final orthogonal matrices . With these final basis factors, the corresponding core tensor can be easily solved by Eq. (12).

Quantization and de-quantization: Since the core tensor has both positive and negative values, we take one bit to represent the sign of the original value. Let denotes the absolute value of the core tensor. With a set of training images, we can easily compute

, the probability density function (PDF) of the positive core tensor

. The optimal quantizer can be solved as follows by minimizing the quantization error:

(16)

Given a number of decision intervals, the optimal quantizer is expected to find the set of decision boundaries and quantized values . Solving the partial derivative of Eq.(16), we could have:

(17)

The optimal solutions of Eq.(17) can be easily solved by the Lloyd‘s algorithm [38], outputting the optimal quantizer with decision boundaries and quantized values .

Considering that the range of core tensor values is spatially variant for an input image, we adopt a variable-bits quantization scheme to allocate different quantized bits to the core tensor, which is useful in preserving the major edges and textures. More specifically, we first scan the positive core tensor in raster oreder, then utilize the decision boundaries to divide the core tensor into non-overlapping chunks. For each chunk (), instead of using as the quantized values, we define a new quantizer for symbol (the -th element of ):

(18)

where and are the maximum and minimum values of chunk , respectively, and is the number of quantized bits in each chunk . In this way, each chunk would take bits for the quantization.

Conversely, the de-quantization process can be readily formulated as:

(19)

Tucker reconstruction: With the de-quantized core tensor and the orthogonal matrices , we can easily obtain an approximation of the original feature data by:

(20)

The overall TDL is summarized in Algorithm 1. With the proposed TDL, we can easily adjust the compression ratio while keeping the size of latent image representation unchanged. Finally, we are able to train a single network to achieve LIC with multiple bpp rates.

Input: A -order tensor ,
         the desired output rank-,
         a number of decision intervals .
Output: An approximation of original data ,
           a -order quantized core tensor ,
           the orthogonal matrices .
leading left singular vectors of , for ;
For … (until converged), do:
     For , do:
           by Eq. (15);
     End for;
End for;
Let =, where is the index of the final result of step 2;
Compute decision boundaries by Eq.(17);
Divide into non-overlapping chunks;
by Eq. (18);
by Eq. (19);
by Eq. (20);
return .
Input: A set of training data ,
          groups of desired ranks ,
          groups of decision intervals .
Output: The optimal parameters .
train encoder-decoder with Eq. (4) / (3);
For … (until converged), do:
   ;
    decompose by Algorithm 1;
   For , do:
       compute the -th group decision boundaries
                                                                      by Eq. (17);
   End for;
   Fixed the TDL;
   For … (until converged), do:
    

= epoch

G;
     use
                        and Eq. (4) / (3) to update encoder-TDL-decoder;
  End for;
  Let = ,
                        where is the index of the final result of step 9;
End for;
Let = ,
                        where is the index of the final result of step 2;
= by Algorithm 1;
update the decoder network with
                     Eq. (4) / (3) and ;
return .

V All-in-One Training

Instead of training a specific network for a certain compression ratio as in previous deep LIC methods [10, 11, 12, 13, 14, 15, 16, 17, 18], the proposed TDNet allows an all-in-one training strategy to enable a single network to compress an image at multiple compression ratio. More specifically, we first train an encoder-decoder network without the TDL to learn some initial parameters . We can then calculate a set of latent image representations of the input images by . Given groups of desired output ranks and quantization levels and the obtained latent image representations , we can use the proposed TDL to calculate groups of decision boundaries . The TDL can then be initialized after obtaining the decision boundaries .

With the initialized TDL, we can use Eq.(4) or Eq. (3) to jointly fine-tune the encoder-TDL-decoder network by minimizing the loss function. The latent image representations at multiple bpp rates will be taken into consideration during the training process. In each training epoch, we first decide which group of ranks and decision boundaries will be used by calculating , then take this group of desired output ranks and decision boundaries to update the TDL and fine-tune the parameters of the whole network. To obtain groups of optimal decision boundaries and network parameters , an iterative training scheme can be used, i.e., fix the encoder-decoder network to update the TDL decision boundaries by solving Eq.(17), and fix the TDL to update the network parameters . Such an alternative optimization process continues till the loss function in Eq.(4) or Eq. (3) converges.

After the TDNet converges, we can use the optimal decision boundaries and network parameters to compress and reconstruct images with different bpp rates. The overall all-in-one training scheme is summarized as Algorithm 2.

(a) Original (b) Single network (BPP: 0.352 PSNR: 30.36) (c) Individual network (BPP: 0.348 PSNR: 30.45)
Fig. 4: Visual comparison on image “” by the proposed single network and individual network.

Vi Experimental Results

In this section, we first present the experimental settings, including training and testing datasets, as well as parameter settings. We then discuss the performance of TDNet using a single network and multiple networks. Finally, we compare TDNet with state-of-the-art LIC methods.

Vi-a Experimental Settings

Datasets: It is generally agreed that a larger scale training dataset which covers various image contents and structures will bring benefit to train a stable deep LIC network. Therefore, we mix the MS COCO test2017 dataset [39], the DIV2K dataset [40] and the Waterloo Exploration dataset [41] together as the training dataset. The MS COCO test2017 dataset contains images which cover a great diversity of objects and scenes. The DIV2K dataset has high-resolution images with complex structures and texture patterns. The Waterloo Exploration dataset contains elaborately selected high quality natural images. We first crop these images into patches444We experimentally found that using the same network architecture, the larger the training patches are, the better the results would be. To trade off between GPU memory and compression results, we set the size of training patches as ., then randomly flip them. With around image patches, we could train a robust model for LIC with multiple bpp rates.

For the testing, we use two different test datasets for comprehensive evaluation: the Kodak PhotoCD dataset 555http://r0k.us/graphics/kodak/, which contains 24 natural images, and the McMaster dataset [42], which contains 18 high quality images. Note that all those images are widely used for the evaluation of image processing methods and they are not included in the training dataset.

Parameter Settings: In the training phase, we set groups of desired ranks and decision intervals to train the TDNet: . The mini-batch size is set to . We initialize the network weights by the method in [24] and adopt the Adam solver [43] to optimize the network parameters . The learning rate starts from 14 and is then fixed to 15 when the training error stops decreasing. The training is terminated when the training error does not decrease in sequential epochs. For the other hyper-parameters of Adam, we utilize the default setting. We employ the context-based adaptive binary arithmetic coding (CABAC) [45] for lossless entropy coding.

We experimentally found that the alternative optimization process of our TDNet compressor (refer to Algorithm 2 please) will converge in less than iterations. The parameter in our loss function Eq. (4) or Eq. (3) is set to

by experience. The network is trained in CAFFE

[44] with an Nvidia Titan Xp GPU. In our PC with Intel(R) Core(TM) i9-7900X CPU @ 3.3GHz, 96G RAM, the training process costs about 3 days.

Though we use groups of ranks and decision intervals to train the TDNet, to validate the generality of the trained network, in the testing phase we use groups of ranks and decision intervals to test the performance of TDNet: and . As we will see in the following sections, our TDNet achieves highly competitive performance.

Fig. 5: Rate-distortion curves by the single network and multiple networks on the Kodak dataset.
(a) (b)
(c) (d)
Fig. 6: Comparison of the rate-distortion curves on Kodak and McMaster datasets.

Vi-B Single Network vs. Multiple Networks

The proposed TDNet allows an ”all-in-one” training strategy to learn a single network to perform LIC at multiple bpp rates. To validate the effectiveness of our ”single network multiple bpp” scheme, in this section we also train six individual TDNet compressors for the six groups of (i.e., six bpp rates), respectively, and compare the performance of single TDNet and multiple TDNet at different bpp rates.

Figure 4 compares the visual quality of compressed images by those two training strategies at around bpp. We also show the zoom-in images of a smooth background area, a texture area and a large edge area. It can be seen that both the two training strategies produce good preservation of image edges and details, and they have very small visual difference.

Figure 5 shows the PSNR based rate-distortion curves by the single network and multiple networks on the Kodak dataset. Note that the six points on the curve of multiple networks is obtained by a TDNet trained at a specific bpp. As one can see, the rate-distortion curve of the trained single TDNet is very close to the curve obtained by multiple networks. On average, its PSNR is only dB lower than that of multiple networks.

(a) Original (b) JPEG    BPP: 0.308 (c) JPEG 2000    BPP: 0.279
(d) BPG    BPP: 0.264 (e) Ballé    BPP: 0.288 (f) Li    BPP: 0.301
(g) Theis    BPP: 0.375 (h) Proposed (MS-SSIM)    BPP: 0.301 (i) Proposed (MSE)    BPP: 0.296
Fig. 7: Visual comparison on image “” by different methods at a compression rate around 0.3bpp (Image from Kodak dataset)
(a) Original (b) JPEG    BPP: 0.339 (c) JPEG 2000     BPP: 0.324
(d) BPG    BPP: 0.294 (e) Proposed (MS-SSIM)      BPP: 0.296 (f) Proposed (MSE)     BPP: 0.291
Fig. 8: Visual comparison on image “” by different methods at a compression rate around 0.3bpp (Image from McMaster dataset)

Vi-C Results

We compare our proposed TDNet with both traditional LIC Codecs and CNN-based LIC methods.

Traditional LIC Codecs: The compared traditional LIC codecs include JPEG (implemented by libjpeg666http://libjpeg.sourceforge.net/), JPEG 2000 (implemented by Matlab) and the state-of-the-art compression format better portable graphics (BPG)7775https://bellard.org/bpg/. Following [15], we use BPG with the setting of chroma format.

Deep LIC Methods: The compared CNN-based LIC methods include Ballé et al. [12]888http://www.cns.nyu.edu/lcv/iclr2017/, Theis et al. [13]999http://theis.io/compressive_autoencoder/, Li et al. [14]101010http://www2.comp.polyu.edu.hk/~15903062r/index.html, Johnston et al. [17], Rippel Bourdev [15] and Mentzer et al. [16]. Note that since the source codes of the above deep compressors [12, 13, 14, 17, 15, 16] are not available, we either digitize their rate-distortion curves on the Kodak dataset from the original papers or copy the results from their websites.

Quantitative Evaluation: Most of the existing CNN-based LIC models [15, 16, 17] are optimized with the MS-SSIM loss [19], while traditional LIC methods (i.e., JPEG, JPEG2000 and BPG) and some of the deep LIC methods [12, 13, 14] are optimized in terms of PSNR. Therefore, we conduct the experiments to quantitatively evaluate the competing methods in terms of both PSNR and MS-SSIM indices for a more comprehensive comparison.

The PSNR and MS-SSIM based rate-distortion curves on the Kodak and McMaster datasets are summarized in Figure 6. As in previous works [10, 11, 12, 13, 14, 15, 16, 17, 18]

, the curves are interpolated based on a set of points [bpp, PNSR] and [bpp, MS-SSIM] for one method. Note that for some methods, only the points of [bpp, PNSR] or [bpp, MS-SSIM] are available on the Kodak dataset, and all existing deep LIC methods do not report their results on the McMaster dataset. Therefore, not all methods have all the four curves in Figure

6.

Figures 6(a) and 6(c) show the PSNR based rate-distortion curves on the Kodak and McMaster datasets, respectively. One can see that on the Kodak dataset, the proposed TDNet (trained with MSE loss) achieves better result than JPEG2000 and the recently developed deep LIC methods, including Ballé et al. [12], Theis et al. [13] and Li et al. [14], and significantly outperforms the prevalent compressor JPEG. Although the proposed TDNet does not show advantage over BPG in term of PSNR on the Kodak dataset, it achieves much better PSNR index than BPG on the McMaster dataset (see Figure 6(c)). Meanwhile, it is not a surprise that TDNet trained with MSE has much higher PSNR indices than TDNet trained with MS-SSIM.

Figures 6(b) and 6(d) show the MS-SSIM based rate-distortion curves on the Kodak and McMaster datasets, respectively. One can see that our TDNet largely outperforms the traditional codecs BPG, JPEG2000 and JPEG. It also significantly outperforms the methods of Johnston et al. [17], Ballé et al. [12] and Theis et al. [13], and achieves comparable performance to the state-of-the-art deep LIC methods Rippel Bourdev [15] and Mentzer et al. [16].

Again, we would like to stress that all the competing CNN-based LIC methods here train a specific network for a certain bpp, while our proposed DTNet trains a single network to deal with multiple bpp rates.

Visual Quality Evaluation: We further compare the visual quality of images compressed by JPEG, JPEG 2000, BPG, Ballé et al. [12], Li et al. [14], Theis et al. [13] and our proposed TDNet (trained with MSE and MS-SSIM). Note that since the source codes of all existing deep LIC compressors are not available, we can only download the results of Ballé et al. [12], Li et al. [14] and Theis et al. [13] from their websites. The compressed images of other deep LIC methods are not available and thus cannot be compared.

Figure 7 shows the compressed images by the comparison methods at a compression rate around 0.3bpp (note that Theis et al., only provides the image at 0.375bpp). One can see that noticeable blocky and ringing artifacts are inevitable in the reconstructed images by traditional JPEG and JPEG 2000 compression formats. While BPG, Theis et al. [13] and Ballé et al. can produce much better visual quality, they still blur much the edges and over-smooth the textures (see the zoom-in areas). Li et al.’s method can preserve better the sharp edges and detailed textures, but still generate some noticeable artifacts. Compared with these methods, the image compressed by our TDNet method is visually more pleasing with sharper edges and much less artifacts.

Figure 8 presents the visual comparison results on image from the McMaster dataset at a compression rate around bpp. Note that since the results on this dataset are not available for all existing deep LIC methods, we only compare TDNet with JPEG, JPEG2000 and BPG. Again, one can see noticeable artifacts in the zoom-in areas for the traditional LIC methods. In contrast, the result produced by our proposed TDNet exhibits visually much more pleasing results.

Vii Conclusion And Future Work

In this paper, we presented a simple yet effective Tucker Decomposition Network (TDNet) with a novel tucker decomposition layer (TDL), which can decompose a latent image representation into a set of matrices and one small core tensor for lossy image compression (LIC). By changing the rank of core tensor and its quantization levels, we could easily adjust the bits-per-pixel (bpp) rate of latent image representation, and consequently achieved the goal of using a single CNN model to cover a range of bpp rates. An iterative non-uniform quantization scheme was presented to optimize the quantizer, and an all-in-one training strategy was employed to train the TDNet. Compared with traditional LIC schemes and previous deep LIC compressors which use different networks to compress images at different bpp rates, our TDNet exhibits very competitive results on benchmark datasets by using a single network.

Acknowledgment

We gratefully acknowledge the support from NVIDIA Corporation for providing us the Titan X GPU used in this research.

References

  • [1] X. Liu, G. Cheung, C. W. Lin, D. Zhao, and W. Gao, “Prior-based quantization bin matching for cloud storage of jpeg images,” IEEE Transactions on Image Processing, 2018.
  • [2] X. Liu, G. Cheung, X. Wu, and D. Zhao, “Random walk graph laplacian-based smoothness prior for soft decoding of jpeg images,” IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 509–524, 2017.
  • [3] J. M. Shapiro, “Embedded image coding using zerotrees of wavelet coefficients,” IEEE Transactions on signal processing, vol. 41, no. 12, pp. 3445–3462, 1993.
  • [4] M. Rabbani and R. Joshi, “An overview of the jpeg 2000 still image compression standard,” Signal processing: Image communication, vol. 17, no. 1, pp. 3–48, 2002.
  • [5]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [6] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2014, pp. 580–587.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [8]

    J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution using very deep convolutional networks,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1646–1654.
  • [9] J. Cai, S. Gu, and L. Zhang, “Learning a deep single image contrast enhancer from multi-exposure images,” IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 2049–2062, 2018.
  • [10] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image compression with recurrent neural networks,” arXiv preprint arXiv:1511.06085, 2015.
  • [11] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell, “Full resolution image compression with recurrent neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2017, pp. 5435–5443.
  • [12] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” International Conference on Learning Representations, 2017.
  • [13]

    L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image compression with compressive autoencoders,”

    International Conference on Learning Representations, 2017.
  • [14] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convolutional networks for content-weighted image compression,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2018.
  • [15] O. Rippel and L. Bourdev, “Real-time adaptive image compression,” in

    International Conference on Machine Learning

    , 2017, pp. 2922–2930.
  • [16] E. Agustsson, F. Mentzer, M. Tschannen, R. Timofte, and L. Van Gool, “Conditional probability models for deep image compression,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2018.
  • [17] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S. J. Hwang, J. Shor, and G. Toderici, “Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks,” arXiv preprint arXiv:1703.10114, 2017.
  • [18] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool, “Soft-to-hard vector quantization for end-to-end learning compressible representations,” in Advances in Neural Information Processing Systems, 2017, pp. 1141–1151.
  • [19] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in Signals, Systems and Computers, 2004. Conference Record of the Thirty-Seventh Asilomar Conference on, vol. 2.   Ieee, 2003, pp. 1398–1402.
  • [20] G. K. Wallace, “The jpeg still picture compression standard,” IEEE transactions on consumer electronics, vol. 38, no. 1, pp. xviii–xxxiv, 1992.
  • [21] A. Skodras, C. Christopoulos, and T. Ebrahimi, “The jpeg 2000 still image compression standard,” IEEE Signal processing magazine, vol. 18, no. 5, pp. 36–58, 2001.
  • [22] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
  • [23] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image restoration with neural networks,” IEEE Transactions on Computational Imaging, vol. 3, no. 1, pp. 47–57, 2017.
  • [24] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
  • [25] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1874–1883.
  • [26] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
  • [27] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009.
  • [28] L. R. Tucker, “Some mathematical notes on three-mode factor analysis,” Psychometrika, vol. 31, no. 3, pp. 279–311, 1966.
  • [29] F. L. Hitchcock, “The expression of a tensor or a polyadic as a sum of products,” Studies in Applied Mathematics, vol. 6, no. 1-4, pp. 164–189, 1927.
  • [30] Q. Xie, Q. Zhao, D. Meng, Z. Xu, S. Gu, W. Zuo, and L. Zhang, “Multispectral images denoising by intrinsic tensor sparsity regularization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1692–1700.
  • [31] Q. Xie, Q. Zhao, D. Meng, and Z. Xu, “Kronecker-basis-representation based tensor sparsity and its applications to tensor recovery,” IEEE transactions on pattern analysis and machine intelligence, 2017.
  • [32] A. C. Sauve, A. Hero, W. L. Rogers, S. Wilderman, and N. Clinthorne, “3d image reconstruction for a compton spect camera model,” IEEE Transactions on Nuclear Science, vol. 46, no. 6, pp. 2075–2084, 1999.
  • [33] T. G. Kolda, B. W. Bader, and J. P. Kenny, “Higher-order web link analysis using multilinear algebra,” in Data Mining, Fifth IEEE International Conference on.   IEEE, 2005, pp. 8–pp.
  • [34] L. De Lathauwer, B. De Moor, and J. Vandewalle, “A multilinear singular value decomposition,” SIAM journal on Matrix Analysis and Applications, vol. 21, no. 4, pp. 1253–1278, 2000.
  • [35] C. A. Andersson and R. Bro, “Improving the speed of multi-way algorithms:: Part i. tucker3,” Chemometrics and intelligent laboratory systems, vol. 42, no. 1-2, pp. 93–103, 1998.
  • [36] T. G. Kolda, “Multilinear operators for higher-order decompositions.” Sandia National Laboratories, Tech. Rep., 2006.
  • [37] L. De Lathauwer, B. De Moor, and J. Vandewalle, “On the best rank-1 and rank-(r 1, r 2,…, rn) approximation of higher-order tensors,” SIAM journal on Matrix Analysis and Applications, vol. 21, no. 4, pp. 1324–1342, 2000.
  • [38] S. Lloyd, “Least squares quantization in pcm,” IEEE transactions on information theory, vol. 28, no. 2, pp. 129–137, 1982.
  • [39] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision.   Springer, 2014, pp. 740–755.
  • [40] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, vol. 3, 2017, p. 2.
  • [41] K. Ma, Z. Duanmu, Q. Wu, Z. Wang, H. Yong, H. Li, and L. Zhang, “Waterloo exploration database: New challenges for image quality assessment models,” IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 1004–1016, 2017.
  • [42] L. Zhang, X. Wu, A. Buades, and X. Li, “Color demosaicking by local directional interpolation and nonlocal adaptive thresholding,” Journal of Electronic imaging, vol. 20, no. 2, p. 023016, 2011.
  • [43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [44] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia.   ACM, 2014, pp. 675–678.
  • [45] D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive binary arithmetic coding in the h. 264/avc video compression standard,” IEEE Transactions on circuits and systems for video technology, vol. 13, no. 7, pp. 620–636, 2003.