Generalized Octave Convolutions for Learned Multi-Frequency Image Compression

02/24/2020 ∙ by Mohammad Akbari, et al. ∙ Google Simon Fraser University Tencent 8

Learned image compression has recently shown the potential to outperform all standard codecs. The state-of-the-art rate-distortion performance has been achieved by context-adaptive entropy approaches in which hyperprior and autoregressive models are jointly utilized to effectively capture the spatial dependencies in the latent representations. However, the latents contain a mixture of high and low frequency information, which has inefficiently been represented by features maps of the same spatial resolution in previous works. In this paper, we propose the first learned multi-frequency image compression approach that uses the recently developed octave convolutions to factorize the latents into high and low frequencies. Since the low frequency is represented by a lower resolution, their spatial redundancy is reduced, which improves the compression rate. Moreover, octave convolutions impose effective high and low frequency communication, which can improve the reconstruction quality. We also develop novel generalized octave convolution and octave transposed-convolution architectures with internal activation layers to preserve the spatial structure of the information. Our experiments show that the proposed scheme outperforms all standard codecs and learning-based methods in both PSNR and MS-SSIM metrics, and establishes the new state of the art for learned image compression.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

page 7

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning-based image compression [19, 1, 3, 5, 13, 16, 18, 14, 21, 23, 24] has shown the potential to outperform standard codecs such as JPEG2000, the H.265/HEVC-based BPG image codec [7], and the new versatile video coding test model (VTM) [11]. Learned image compression was first used in [23]

to compress thumbnail images using long short-term memory (LSTM)-based recurrent neural networks (RNNs) in which better SSIM results than JPEG and WebP were reported. This approach was generalized in

[13], which utilized spatially adaptive bit allocation to further improve the performance.

In [5]

, a scheme based on generalized divisive normalization (GDN) and inverse GDN (IGDN) were proposed, which outperformed JPEG2000 in both PSNR and SSIM. A compressive autoencoder framework with residual connection as in ResNet was proposed in

[21], where the quantization was replaced by a smooth approximation, and a scaling approach was used to get different rates. In [1]

, a soft-to-hard vector quantization approach was introduced, and a unified framework was developed for image compression. In order to take the spatial variation of image content into account, a content-weighted framework was also introduced in

[16]

, where an importance map for locally adaptive bit rate allocation was employed to handle the spatial variation of image content. A learned channel-wise quantization along with arithmetic coding was also used to reduce the quantization error. There have also been some efforts in taking advantage of other computer vision tasks in image compression frameworks. For example, in

[3], a deep semantic segmentation-based layered image compression (DSSLIC) was proposed, by taking advantage of the Generative Adversarial Network (GAN) and BPG-based residual coding. It outperformed the BPG codec (in RGB444 format) in both PSNR and MS-SSIM [26].

Since most learned image compression methods need to train multiple networks for multiple bit rates, variable-rate approaches have also been proposed in which a single neural network model is trained to operate at multiple bit rates. This approach was first introduced by [23], which was then generalized for full-resolution images using deep learning-based entropy coding in [24]. A CNN-based multi-scale decomposition transform was optimized for all scales in [8], which was better than BPG in MS-SSIM. In [28], a learned progressive image compression model was proposed using bit-plane decomposition and also bidirectional assembling of gated units. Another variable-rate framework was introduced in [2], which employed GDN-based shortcut connections, stochastic rounding-based scalable quantization, and a variable-rate objective function. The method in [2] outperformed previous learned variable-rate methods.

Most previous works used fixed entropy models shared between the encoder and decoder. In [6], a conditional entropy model based on Gaussian scale mixture (GSM) was proposed where the scale parameters were conditioned on a hyperprior learned using a hyper autoencoder. The compressed hyperprior was transmitted and added to the bit stream as side information. This model was extended in [18, 14]

where a Gaussian mixture model (GMM) with both mean and scale parameters conditioned on the hyperprior was utilized. In these methods, the hyperpriors were combined with autoregressive priors generated using context models, which outperformed BPG in terms of both PSNR and MS-SSIM. The coding efficiency in

[14] was further improved in [15] by a joint optimization of image compression and quality enhancement networks. Another context-adaptive approach was introduced by [30] in which multi-scale masked convolutional networks were utilized for their autoregressive model combined with hyperpriors.

The state of the art in learned image compression has been achieved by context-adaptive entropy methods in which hyperprior and autoregressive models are combined. These approaches are jointly optimized to effectively capture the spatial dependencies and probabilistic structures of the latent representations, which lead to a compression model with significant rate-distortion (R-D) performance. However, similar to natural images, the latents contain a mixture of information with multiple frequencies, which are usually represented by feature maps of the same spatial resolution. Some of these maps are spatially redundant because they consist of low frequency information, which can be more efficiently represented and compressed.

In this paper, a multi-frequency entropy model is introduced in which octave convolutions [9] are utilized to factorize the latent representations into high and low frequencies. The low frequency information is then represented by a lower spatial resolution, which reduces the corresponding spatial redundancy and improves the compression ratio, similar to wavelet transforms [4]. In addition, due to the effective communication between high and low frequencies, the reconstruction performance is also improved. In order to preserve the spatial information and structure of the latents, we develop novel generalized octave convolution and octave transposed-convolution architectures with internal activation layers. Experimental results show that the proposed multi-frequency framework outperforms all standard codecs including BPG and VTM as well as learning-based methods in both PSNR and MS-SSIM metrics, and establishes the new state-of-the-art learning-based image compression.

The paper is organized as follows. In Section 2, vanilla and octave convolutions are briefly described and compared. The proposed generalized octave convolution and transposed-convolution with internal activation layers are formulated and discussed in Section 3. Following that, the architecture of the proposed multi-frequency image compression framework as well as the multi-frequency entropy model are introduced in Section 4. Finally, in Section 5, the experimental results along with the ablation study will be discussed, and compared with the state-of-the-art in learning-based image compression methods.

2 Vanilla vs. Octave Convolution

Let be the input and output feature vectors with number of channels of size , each feature map in the vanilla convolution is obtained as follows:

(1)

where is a convolution kernel, is the location coordinate, and is a local neighborhood.

In the vanilla convolution, all input and output feature maps are of the same spatial resolution, which represents a mixture of information at high and low frequencies. Due to the redundancy in low frequency information, the vanilla convolution is not efficient in terms of both memory and computation cost.

Figure 1: Architecture of the proposed generalized octave convolution (GoConv) shown in the left figure, and transposed-convolution (GoTConv) shown in the right figure. Act: the activation layer; : regular vanilla convolution; : regular transposed-convolution;

: regular convolution with stride 2;

: regular transposed-convolution with stride 2.

To address this problem, in the recently developed octave convolution [9], the feature maps are factorized into high and low frequencies each processed with different convolutions. As a result, the resolution of low-frequency feature maps can be spatially reduced, which saves both memory and computation.

The factorization of input vector in octave convolutions is denoted by , where and are respectively the high and low frequency maps. The ratio of channels allocated to the low frequency feature representations (i.e., at half of spatial resolution) is defined by . The factorized output vector is denoted by , where and are the output high and low frequency maps given by and , where and are intra-frequency update and and denote inter-frequency communication. Intra-frequency component is used to update the information for each high and low convolutions, while inter-frequency communication further enables information exchange between them.

The octave convolution kernel is given by with which the inputs and are respectively convolved. and are further divided into intra- and inter-frequency components as follows: and

. For the intra-frequency update, the regular vanilla convolution is used. However, up- and down-sampling interpolations are applied to compute the inter-frequency communication formulated as:

(2)

where denotes a vanilla convolution with parameters .

As reported in [9], due to the effective inter-frequency communications, the octave convolution can have better performance in classification and recognition performance compared to the vanilla convolution.

3 Generalized Octave Convolution

In the original octave convolutions, the average pooling and nearest interpolation are respectively employed for down- and up-sampling operations in inter-frequency communication [9]. Such conventional interpolations do not preserve spatial information and structure of the input feature map. In addition, in convolutional autoencoders where sub-sampling needs to be reversed at the decoder side, fixed operations such as pooling result in a poor performance [20].

In this work, we propose a novel generalized octave convolution (GoConv) in which strided convolutions are used to sub-sample the feature vectors and compute inter-frequency communication in a more effective way. Unlike conventional sub-sampling methods, strided convolutions can learn to preserve properties such as spatial information, which stablizes and improves the performance [20], especially in autoencoder architectures. Moreover, as in ResNet, applying strided convolution (i.e., convolution and down-sampling at the same time) reduces the computational cost rather than a convolution followed by a fixed down-sampling operation (e.g., average pooling). The architecture of the proposed GoConv is shown in Figure 1.

The output high and low frequency feature maps in GoConv are formulated as follows:

(3)

where and are respectively Vanilla convolution and transposed-convolution operations with stride of 2.

Figure 2: Overall framework of the proposed image compression model. H-AE and H-AD: arithmetic encoder and decoder for high frequency latents. L-AE and L-AD: arithmetic encoder and decoder for low frequency latents. H-CM and L-CM: the high and low frequency context models each composed of one 5*5 masked convolution layer with 2*M filters and stride of 2. Q: represents the additive uniform noise for training, or uniform quantizer for the test.

We also propose a generalized octave transposed-convolution denoted by GoTConv (Figure 1), which can replace the conventional transposed-convolution commonly employed in deep autoencoder (encoder-decoder) architectures. Let and respectively be the factorized input and output feature vectors, the output high and low frequency maps and ) in GoTConv are obtained as follows:

(4)

where and . Unlike GoConv in which regular convolution operation is used, transposed-convolution denoted by is applied for intra-frequency update in GoTConv. For up- and down-sampling operations in inter-frequency communication, the same strided convolutions and as in GoConv are respectively utilized.

Similar to the original octave convolution, the proposed GoConv and GoTConv are designed and formulated as generic, plug-and-play units. As a result, they can respectively replace vanilla convolution and transposed-convolution units in any convolutional neural network (CNN) architecture, especially autoencoder-based frameworks such as image compression, image denoising, and semantic segmentation. When used in an autoencoder, the input image to the encoder is not represented as a multi-frequency tensor. In this case, to compute the output of the first GoConv layer in the encoder, Equation

3 is modified as follows:

(5)

Similarly, at the decoder side, the output of the last GoTConv is a single tensor representation, which can be formulated by modifying Equation 4 as:

(6)

In original octave convolutions, activation layers (e.g., ReLU) are applied to the output high and low frequency maps. However, in this work, we utilize activations for each internal convolution performed in our proposed GoConv and GoTConv. In this case, we assure activation functions are properly applied to each feature map computed by convolution operations. Each of the inter- and intra-frequency components is then followed by an activation layer in GoConv. This process is inverted in the proposed GoTConv where the activation layer is followed by inter- and intra-frequency communications as shown in Figure

1.

4 Multi-Frequency Entropy Model

Octave convolution is similar to the wavelet transform [4], since it has lower resolution in low frequency than in high frequency. Therefore, it can be used to improve the R-D performance in learning-based image compression frameworks. Octave convolutions stores the features in a multi-frequency representation where the low frequency is stored in half spatial resolution, which results in a higher compression ratio. Moreover, due to the effective high and low frequency communication as well as the receptive field enlargement in octave convolutions, they also improve the performance of the analysis (encoding) and synthesis (decoding) transforms in a compression framework.

The overall architecture of the proposed multi-frequency image compression framework is shown in Figure 2. Similar to [18], our architecture is composed of two sub-networks: the core autoencoder and the entropy sub-network. The core autoencoder is used to learn a quantized latent vector of the input image, while the entropy sub-network is responsible for learning a probabilistic model over the quantized latent representations, which is utilized for entropy coding.

We have made several improvements to the scheme in [18]

. In order to handle multi-frequency entropy coding, all vanilla convolutions in the core encoder, hyper encoder, and parameters estimator are replaced by the proposed GoConv, and all vanilla transposed-convolutions in the core and hyper decoders are replaced by GoTConv. In

[18], each convolution/transposed-convolution is accompanied by an activation layer (e.g., GDN/IGDN or Leaky ReLU). In our architecture, we move these layers into the GoConv and GoTConv architectures and directly apply them to the inter- and intra-frequency components as described in Section 3. GDN/IGDN transforms are respectively used for the GoConv and GoTConv employed in the proposed deep encoder and decoder, while Leaky ReLU is utilized for the hyper autoencoder and the parameters estimator. The convolution properties (i.e., size and number of filters and strides) of all networks including the core and hyper autoencoders, context models, and parameter estimator are the same as in [18].

Let be the input image, the multi-frequency latent representations are denoted by where and are generated using the parametric deep encoder (i.e., analysis transform) represented as:

(7)

where is the parameter vector to be optimized. denotes the total number of output channels in , which is divided into channels for high frequency and channels for low frequency (i.e., at half spatial resolution of the high frequency part). The calculation in Equation 5 is used for the first GoConv layer, while the other encoder layers are formulated using Equation 3.

At the decoder side, the parametric decoder (i.e., synthesis transform) with the parameter vector reconstructs the image as follows:

(8)

where represents the addition of uniform noise to the latent representations during training, or uniform quantization (i.e., round function in this work) and arithmetic coding/decoding of the latents during the test. As illustrated in Figure 2, the quantized high and low frequency latents and are entropy-coded using two separate arithmetic encoder and decoder.

The entropy sub-network in our architecture contains two models: a context model and a hyper autoencoder. The context model is an autoregressive model over multi-frequency latent representations. Unlike the other networks in our architecture where GoConv are incorporated for their convolutions, we use Vanilla convolutions in the context model to assure the causality of the contexts is not spoiled due to the intra-frequency communication in GoConv. The contexts of the high and low frequency latents, denoted by and , are then predicted with two separate models and defined as follows:

(9)

where and are the parameters to be generalized. Both and are composed of one 5*5 masked convolution [25] with stride of 2.

The hyper autoencoder learns to represent side information useful for correcting the context-based predictions. The spatial dependencies of are then captured into the multi-frequency hyper latent representations using the parametric hyper encoder (with the parameter vector ) defined as:

(10)

Figure 3: Sample high and low frequency latent representations. Left column: original image; Middle columns: high frequency; Right column: low frequency.
Figure 4: Kodak comparison results of our approach with traditional codecs and learning-based image compression methods.

The quantized hyper latents are also part of the generated bitstream that is required to be entropy-coded and transmitted. Similar to the core latents, two separate arithmetic coders are used for the quantized high and low frequency and . Given the quantized hyper latents, the side information used for the entropy model estimation is reconstructed using the hyper decoder (with the parameter vector ) formulated as:

(11)

As shown in Figure 2, to estimate the mean and scale parameters required for a conditional Gaussian entropy model, the information from both context model and hyper decoder is combined by another network, denoted by (with the parameter vector ), represented as follows:

(12)

where and are the parameters for entropy modelling of the high frequency information, and and are for the low frequency information.

The objective function for training is composed of two terms: rate , which is the expected length of the bitstream, and distortion , which is the expected error between the input and reconstructed images. The R-D balance is determined by a Lagrange multiplier denoted by . The R-D optimization problem is then defined as follows:

(13)

where is the unknown distribution of natural images and can be any distortion metric such as mean squared error (MSE) or MS-SSIM. and are the rates corresponding to the high and low frequency information (bitstreams) defined as follows:

(14)

where and are respectively the conditional Gaussian entropy models for high and low frequency latent representations ( and ) formulated as:

(15)

where each latent is modelled as a Gaussian convolved with a unit uniform distribution, which ensures a good match between encoder and decoder distributions of both quantized and continuous-values latents. The mean and scale parameters

, , , and are generated via the network defined in Equation 12.

Since the compressed hyper latents and are also part of the generated bitstream, their transmission costs are also considered in the rate term formulated in Equation 14. As in [6, 18], non-parametric, fully-factorized density model is used for the entropy model of the high and low frequency hyper latents as follows:

(16)

where and denote the parameter vectors for the univariate distributions and .

Original

Ours (0.10bpp, 31.70dB, 14.57dB)

BPG (0.10bpp, 30.29dB, 11.72dB)

J2K (0.11bpp, 28.95dB, 10.98dB)
Figure 5: Kodak visual example (bits-per-pixel, PSNR, ).

5 Experimental Results

The ADE20K dataset [29] with images of at least 512 pixels in height or width (9272 images in total) were used for training the proposed model. All images were rescaled to and to have a fixed size for training. We set so that 50% of the latent representations is assigned to the low frequency part with half spatial resolution. Sample high and low frequency latent representations are shown in Figure 3.

Considering four layers of strided convolutions (with stride of 2) and the output channel size in the core encoder (Figure 2), the high and low frequency latents and will respectively be of size 161696 and 8896 for training. As discussed in [6], the optimal number of filters (i.e., ) increases with the R-D balance factor , which indicates higher network capacity is required for the models with higher bit rates. As a result, in order to avoid -dependent performance saturation and to boost the network capacity, we set

for higher bit rates (BPPs > 0.5). All models in our framework were jointly trained for 200 epochs with mini-batch stochastic gradient descent and a batch size of 8. The Adam solver with learning rate of 0.00005 was fixed for the first 100 epochs, and was gradually decreased to zero for the next 100 epochs.

We compare the performance of the proposed scheme with standard codecs including JPEG, JPEG2000 [10], WebP [12], BPG (both YUV4:2:0 and YUV4:4:4 formats) [7], VTM (version 7.1) [11], and also state-of-the-art learned image compression methods in [18, 16, 14, 15, 30]. We use both PSNR and

as the evaluation metrics, where

represents MS-SSIM scores in dB defined as: . The comparison results on the popular Kodak image set (averaged over 24 test images) are shown in Figure 4. For the PSNR results, we optimized the model for the MSE loss as the distortion metric in Equation 13, while the perceptual MS-SSIM metric was used for the MS-SSIM results reported in Figure 4. In order to obtain the seven different bit rates on the R-D curve illustrated in Figure 4, seven models with seven different values for were trained.

As shown in Figure 4, our method outperforms all standard codecs as well as the state-of-the-art learning-based image compression methods in terms of both PSNR and MS-SSIM. Compared to the new versatile video coding test model (VTM) and the method in [15], our approach provides 0.5dB better PSNR at lower bit rates (BPPs < 0.4) and almost the same performance at higher rates.


= 0.5 = 0.25 = 0.75 Act-Out Core-Oct Org-Oct
BPP
(high  /  low)
0.271
(0.217  /  0.054)
0.350
(0.323  /  0.027)
0.243
(0.104  /  0.139)
0.270 0.267 0.266
PSNR (dB) 32.10 32.12 31.11 31.84 31.62 28.70
MS-SSIM (dB) 16.31 16.34 16.13 16.04 15.34 13.23
Model Size (MB) 95.2 110 84 94.9 72.1 73
Inference Time (sec) 0.447 0.476 0.411 0.421 0.487 0.445
Table 1: Ablation study of different components in the proposed framework. The reported PSNR, MS-SSIM, and Inference Time are averaged over Kodak image set. The Inference Time includes the entire encoding and decoding time. BPP: bits-per-pixel (high/low: BPPs for high and low frequency latents). Act-Out: activation layers moved out of GoConv/GoTConv. Core-Oct: proposed GoConv/GoTConv only used for the core autoencoder. Org-Oct: GoConv/GoTConv replaced by original octave convolutions.

One visual example from the Kodak image set is given in Figure 5 in which our results are qualitatively compared with JPEG2000 and BPG (4:4:4 chroma format) at 0.1bpp. As seen in the example, our method provides the highest visual quality compared to the others. JPEG2000 has poor performance due to the ringing artifacts. The BPG result is smoother compared to JPEG2000, but the details and fine structures are not preserved in many areas, for example, in the patterns on the shirt and the colors around the eye.

5.1 Ablation Study

In order to evaluate the performance of different components of the proposed framework, ablation studies were performed, which are reported in Table 1. The results are the average over the Kodak image set. All the models reported in this ablation study have been optimized for MSE distortion metric (for one single bit-rate). However, the results were evaluated with both PSNR and MS-SSIM metric.

Ratio of high and low frequency: in order to study varying choices of the ratio of channels allocated to the low frequency feature representations, we evaluated our model with three different ratios . As summarized in Table 1, compressing 50% of the low frequency part to half the resolution (i.e., ) results in the best R-D performance in both PSNR and MS-SSIM at 0.271bpp (where the contributions of high and low frequency latents are 0.217bpp and 0.054bpp).

As the ratio decreases to , less compression with a higher bit rate of 0.350bpp (0.323bpp for high and 0.027 for low frequency) is obtained, while no significant gain in the reconstruction quality is achieved. Although increasing the ratio to 75% provides a better compression with 0.243bpp (high: 0.104bpp, low: 0.139bpp), it significantly results in a lower PSNR. As indicated by the model size and inference time in the table, larger ratio results in a smaller and faster model since less space is required to store the low frequency maps with half spatial resolution.

Internal vs. external activation layers: in this scenario, we cancel the internal activations (i.e., GDN/IGDN) employed in our proposed GoConv and GoTConv. Instead, as in the original octave convolution [9], we apply GDN to the output high and low frequency maps in GoConv, and IGDN before the input high and low frequency maps for GoTConv. This experiment is denoted by Act-Out in Table 1. As the comparison results indicate, the proposed architecture with internal activations ( = 0.5) provides a better performance (with 0.26dB higher PNSR) since all internal feature maps corresponding to the inter- and intra- communications are benefited from the activation function.

Octave only for core autoencoder: as described in Section 4, we utilized the proposed multi-frequency entropy model for both latents and hyper latents. In order to evaluate the effectiveness of multi-frequency modelling of hyper latents, we also report the results in which the proposed entropy model is only used for the core latents (denoted by Core-Oct in Table 1). To deal with the high and low frequency latents resulted from the multi-frequency core autoencoder, we used two separate networks (similar to [18] with Vanilla convolutions) for each of the hyper encoder, hyper decoder, and the parameters estimator model. As summarized in the table, a PSNR gain of 0.48dB is achieved when both core and hyper autoencoders benefit from the proposed multi-frequency model.

Original octave convolutions: in this experiment, the performance of the proposed GoConv and GoTConv architectures compared with the original octave convolutions (denoted by Org-Oct in Table 1) is analyzed. We replace all GoConv layers in the proposed framework (Figure 2) by original octave convolutions. For the octave transposed-convolution used in the core and hyper decoders, we reverse the octave convolution operation formulated as follows:

(17)

where and are the input and output feature maps, and is vanilla transposed-convolution. Similar to the octave convolution defined in [9], average pooling and nearest interpolation are respectively used for downsample and upsample operations. As reported in Table 1, Org-Oct provides a significantly lower performance than the architecture with the proposed GoConv and GoTConv, which is due to the fixed sub-sampling operations incorporated for its inter-frequency components. The PSNR and MS-SSIM of the proposed architecture are respectively 3.4dB and 3.08dB higher than Org-Conv at the same bit rate. Note that the ratio was used for the Act-Out, Core-Oct, and Org-Oct models.

6 Conclusion

In this paper, we proposed a new multi-frequency image compression scheme with octave convolutions in which the latents were factorized into high and low frequency, and the low frequency was stored at lower resolution to reduce the spatial redundancy. To retain the spatial structure of the input, novel generalized octave convolution and transposed-convolution architectures denoted by GoConv and GoTConv were also introduced. Our experiments showed that the proposed method significantly improves the R-D performance and achieves the new state-of-the-art learned image compression.

Further improvements can be achieved by multi-resolution factorization of latents into a sequence of high to low frequencies as in wavelet transform. Another potential direction of this work is to employ the proposed GoConv/GoTConv in other CNN-based architectures, particularly autoencoder-based schemes such as image denoising and semantic segmentation (please see the Appendix).

Acknowledgements

This work is supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada under grant RGPIN-2015-06522.

References

  • [1] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool (2017) Soft-to-hard vector quantization for end-to-end learning compressible representations. arXiv preprint arXiv:1704.00648. Cited by: §1, §1.
  • [2] M. Akbari, J. Liang, J. Han, and C. Tu (2019) Learned variable-rate image compression with residual divisive normalization. arXiv preprint arXiv:1912.05688. Cited by: §1.
  • [3] M. Akbari, J. Liang, and J. Han (2019) DSSLIC: deep semantic segmentation-based layered image compression. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2042–2046. Cited by: §1, §1.
  • [4] M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies (1992) Image coding using wavelet transform. IEEE Transactions on image processing 1 (2), pp. 205–220. Cited by: §1, §4.
  • [5] J. Ballé, V. Laparra, and E. P. Simoncelli (2016) End-to-end optimization of nonlinear transform codes for perceptual quality. In Picture Coding Symposium, pp. 1–5. Cited by: §1, §1.
  • [6] J. Ballé, S. Singh, S. J. Hwang, and N. Johnston (2018) Variational image compression with a scale hyperprior. arXiv preprint arXiv:1802.01436. Cited by: §1, §4, §5.
  • [7] F. Bellard (2017) BPG image format. Accessed. Note: http://bellard.org/bpg Cited by: §1, §5.
  • [8] C. Cai, L. Chen, X. Zhang, and Z. Gao (2018) Efficient variable rate image compression with multi-scale decomposition network. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §1.
  • [9] Y. Chen, H. Fan, B. Xu, Z. Yan, Y. Kalantidis, M. Rohrbach, S. Yan, and J. Feng (2019) Drop an octave: reducing spatial redundancy in convolutional neural networks with octave convolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3435–3444. Cited by: §1, §2, §2, §3, §5.1, §5.1.
  • [10] C. Christopoulos, A. Skodras, and T. Ebrahimi (2000) The JPEG2000 still image coding system: an overview. IEEE transactions on consumer electronics 46 (4), pp. 1103–1127. Cited by: §5.
  • [11] HHI. Fraunhofer (2019) VVC official test model VTM. Accessed. Note: https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM Cited by: §1, §5.
  • [12] Google Inc. (2016) WebP. Accessed. Note: https://developers.google.com/speed/webp Cited by: §5.
  • [13] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S. J. Hwang, J. Shor, and G. Toderici (2017) Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. arXiv preprint arXiv:1703.10114. Cited by: §1.
  • [14] J. Lee, S. Cho, and S. Beack (2018) Context-adaptive entropy model for end-to-end optimized image compression. arXiv preprint arXiv:1809.10452. Cited by: §1, §1, §5.
  • [15] J. Lee, S. Cho, and M. Kim (2019) A hybrid architecture of jointly learning image compression and quality enhancement with improved entropy minimization. arXiv preprint arXiv:1912.12817. Cited by: §1, §5, §5.
  • [16] M. Li, W. Zuo, S. Gu, J. You, and D. Zhang (2019) Learning content-weighted deep image compression. arXiv preprint arXiv:1904.00664. Cited by: §1, §1, §5.
  • [17] P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo (2018) Multi-level wavelet-cnn for image restoration. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    ,
    pp. 773–782. Cited by: §A.1.
  • [18] D. Minnen, J. Ballé, and G. D. Toderici (2018) Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, pp. 10771–10780. Cited by: §1, §1, §4, §4, §4, §5.1, §5.
  • [19] O. Rippel and L. Bourdev (2017) Real-time adaptive image compression. In

    Proceedings of the 34th International Conference on Machine Learning

    ,
    Vol. 70, pp. 2922–2930. Cited by: §1.
  • [20] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §3, §3.
  • [21] L. Theis, W. Shi, A. Cunningham, and F. Huszár (2017) Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395. Cited by: §1, §1.
  • [22] C. Tian, L. Fei, W. Zheng, Y. Xu, W. Zuo, and C. Lin (2019) Deep learning on image denoising: an overview. arXiv preprint arXiv:1912.13171. Cited by: §A.1.
  • [23] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar (2015) Variable rate image compression with recurrent neural networks. arXiv preprint arXiv:1511.06085. Cited by: §1, §1.
  • [24] G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang, D. Minnen, J. Shor, and M. Covell (2017) Full resolution image compression with recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5306–5314. Cited by: §1, §1.
  • [25] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. (2016) Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pp. 4790–4798. Cited by: §4.
  • [26] Z. Wang, E. Simoncelli, and A. Bovik (2003) Multiscale structural similarity for image quality assessment. In Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, Vol. 2, pp. 1398–1402. Cited by: §1.
  • [27] J. Xie, L. Xu, and E. Chen (2012) Image denoising and inpainting with deep neural networks. In Advances in neural information processing systems, pp. 341–349. Cited by: §A.1.
  • [28] Z. Zhang, Z. Chen, J. Lin, and W. Li (2019) Learned scalable image compression with bidirectional context disentanglement network. In IEEE International Conference on Multimedia and Expo, pp. 1438–1443. Cited by: §1.
  • [29] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ADE20K dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 4. Cited by: §5.
  • [30] J. Zhou, S. Wen, A. Nakagawa, K. Kazui, and Z. Tan (2019) Multi-scale and context-adaptive entropy model for image compression. arXiv preprint arXiv:1910.07844. Cited by: §1, §5.

Appendix A Appendix

a.1 Multi-Frequency Image Denoising

As explained Section 3, since the proposed GoConv and GoTConv are designed as generic, plug-and-play units, they can be used in any autoencoder-based architecture. In this experiment, we build a simple convolutional autoencoder and use it for image denoising problem [27, 17, 22]. In this problem, we try to denoise images corrupted by white Gaussian noise, which is the common result of many acquisition channels. The architecture of the autoencoder used in this experiment is summarized in Table 2

where the encoder and decoder are respectively composed of a sequence of vanilla convolutions and transposed-convolutions each followed by batch normalization and ReLU.

We performed our experiments on MNIST and CIFAR10 datasets. After 100 epochs of training, an average PSNR of 23.19dB and 23.29dB for MNIST and CIFAR10 test sets were respectively achieved. In order to analyze the performance of GoConv and GoTConv in this experiment, we replaced all the vanilla convolution layers with GoConv and all transposed-convolutions with GoTConv. The other properties of the encoder and decoder networks (e.g., numebr of layers, filters, and strides) were the same as the baseline in Table 2. We set and trained the model for 100 epochs. For MNIST dataset, the multi-frequency autoencoder achieved an average PSNR of 23.20dB (almost the same as the baseline with vanilla convolutions). However, for CIFAR10, we achieved an average PSNR of 23.54dB, which is 0.25dB higher than the baseline, due to the effective communication between high and low frequencies. In addition, the proposed multi-frequency autoencoder is 5.5% smaller than the baseline model. The comparison results are presented in Table 3.

In Figure 7, 8 visual examples from CIFAR10 test set are given. Compared to the baseline model with vanilla convolutions and transposed-convolutions, the multi-frequency model with the proposed GoConv and GoTConv results in higher visual quality in the denoised images (e.g., the red car in the second column from right).

Encoder Decoder
Conv (3*3, 32, s1) T-Conv (3*3, 128, s2)
Conv (3*3, 32, s1) T-Conv (3*3, 128, s1)
Conv (3*3, 64, s1) T-Conv (3*3, 64, s1)
Conv (3*3, 64, s2) T-Conv (3*3, 64, s2)
Conv (3*3, 128, s1) T-Conv (3*3, 32, s1)
Conv (3*3, 128, s1) T-Conv (3*3, 32, s1)
Conv (3*3, 256, s2) T-Conv (3*3, 3, s1)
Table 2: Baseline convolutional autoencoder for image denoising (Conv: vanilla convolution; T-Conv: vanilla transposed-convolution).
Baseline Multi-Frequency
MNIST CIFAR10 MNIST CIFAR10
PSNR (dB) 23.19 23.29 23.20 23.54
Size (MB) 4.49 4.50 4.44 4.45
Table 3: Comparison results of the baseline and multi-frequency autoencoders for image denoising on MNIST and CIFAR10 datasets.

(a) Input Images

(b) Input Noisy Images

(c) Baseline Denoised Results

(d) Multi-Frequency Denoised Results
Figure 6: Sample image denoising results from MNIST test set.

(a) Input Images

(b) Input Noisy Images

(c) Baseline Denoised Results

(d) Multi-Frequency Denoised Results
Figure 7: Sample image denoising results from CIFAR10.