1 Introduction
Deep learningbased image compression [19, 1, 3, 5, 13, 16, 18, 14, 21, 23, 24] has shown the potential to outperform standard codecs such as JPEG2000, the H.265/HEVCbased BPG image codec [7], and the new versatile video coding test model (VTM) [11]. Learned image compression was first used in [23]
to compress thumbnail images using long shortterm memory (LSTM)based recurrent neural networks (RNNs) in which better SSIM results than JPEG and WebP were reported. This approach was generalized in
[13], which utilized spatially adaptive bit allocation to further improve the performance.In [5]
, a scheme based on generalized divisive normalization (GDN) and inverse GDN (IGDN) were proposed, which outperformed JPEG2000 in both PSNR and SSIM. A compressive autoencoder framework with residual connection as in ResNet was proposed in
[21], where the quantization was replaced by a smooth approximation, and a scaling approach was used to get different rates. In [1], a softtohard vector quantization approach was introduced, and a unified framework was developed for image compression. In order to take the spatial variation of image content into account, a contentweighted framework was also introduced in
[16], where an importance map for locally adaptive bit rate allocation was employed to handle the spatial variation of image content. A learned channelwise quantization along with arithmetic coding was also used to reduce the quantization error. There have also been some efforts in taking advantage of other computer vision tasks in image compression frameworks. For example, in
[3], a deep semantic segmentationbased layered image compression (DSSLIC) was proposed, by taking advantage of the Generative Adversarial Network (GAN) and BPGbased residual coding. It outperformed the BPG codec (in RGB444 format) in both PSNR and MSSSIM [26].Since most learned image compression methods need to train multiple networks for multiple bit rates, variablerate approaches have also been proposed in which a single neural network model is trained to operate at multiple bit rates. This approach was first introduced by [23], which was then generalized for fullresolution images using deep learningbased entropy coding in [24]. A CNNbased multiscale decomposition transform was optimized for all scales in [8], which was better than BPG in MSSSIM. In [28], a learned progressive image compression model was proposed using bitplane decomposition and also bidirectional assembling of gated units. Another variablerate framework was introduced in [2], which employed GDNbased shortcut connections, stochastic roundingbased scalable quantization, and a variablerate objective function. The method in [2] outperformed previous learned variablerate methods.
Most previous works used fixed entropy models shared between the encoder and decoder. In [6], a conditional entropy model based on Gaussian scale mixture (GSM) was proposed where the scale parameters were conditioned on a hyperprior learned using a hyper autoencoder. The compressed hyperprior was transmitted and added to the bit stream as side information. This model was extended in [18, 14]
where a Gaussian mixture model (GMM) with both mean and scale parameters conditioned on the hyperprior was utilized. In these methods, the hyperpriors were combined with autoregressive priors generated using context models, which outperformed BPG in terms of both PSNR and MSSSIM. The coding efficiency in
[14] was further improved in [15] by a joint optimization of image compression and quality enhancement networks. Another contextadaptive approach was introduced by [30] in which multiscale masked convolutional networks were utilized for their autoregressive model combined with hyperpriors.The state of the art in learned image compression has been achieved by contextadaptive entropy methods in which hyperprior and autoregressive models are combined. These approaches are jointly optimized to effectively capture the spatial dependencies and probabilistic structures of the latent representations, which lead to a compression model with significant ratedistortion (RD) performance. However, similar to natural images, the latents contain a mixture of information with multiple frequencies, which are usually represented by feature maps of the same spatial resolution. Some of these maps are spatially redundant because they consist of low frequency information, which can be more efficiently represented and compressed.
In this paper, a multifrequency entropy model is introduced in which octave convolutions [9] are utilized to factorize the latent representations into high and low frequencies. The low frequency information is then represented by a lower spatial resolution, which reduces the corresponding spatial redundancy and improves the compression ratio, similar to wavelet transforms [4]. In addition, due to the effective communication between high and low frequencies, the reconstruction performance is also improved. In order to preserve the spatial information and structure of the latents, we develop novel generalized octave convolution and octave transposedconvolution architectures with internal activation layers. Experimental results show that the proposed multifrequency framework outperforms all standard codecs including BPG and VTM as well as learningbased methods in both PSNR and MSSSIM metrics, and establishes the new stateoftheart learningbased image compression.
The paper is organized as follows. In Section 2, vanilla and octave convolutions are briefly described and compared. The proposed generalized octave convolution and transposedconvolution with internal activation layers are formulated and discussed in Section 3. Following that, the architecture of the proposed multifrequency image compression framework as well as the multifrequency entropy model are introduced in Section 4. Finally, in Section 5, the experimental results along with the ablation study will be discussed, and compared with the stateoftheart in learningbased image compression methods.
2 Vanilla vs. Octave Convolution
Let be the input and output feature vectors with number of channels of size , each feature map in the vanilla convolution is obtained as follows:
(1) 
where is a convolution kernel, is the location coordinate, and is a local neighborhood.
In the vanilla convolution, all input and output feature maps are of the same spatial resolution, which represents a mixture of information at high and low frequencies. Due to the redundancy in low frequency information, the vanilla convolution is not efficient in terms of both memory and computation cost.
: regular convolution with stride 2;
: regular transposedconvolution with stride 2.To address this problem, in the recently developed octave convolution [9], the feature maps are factorized into high and low frequencies each processed with different convolutions. As a result, the resolution of lowfrequency feature maps can be spatially reduced, which saves both memory and computation.
The factorization of input vector in octave convolutions is denoted by , where and are respectively the high and low frequency maps. The ratio of channels allocated to the low frequency feature representations (i.e., at half of spatial resolution) is defined by . The factorized output vector is denoted by , where and are the output high and low frequency maps given by and , where and are intrafrequency update and and denote interfrequency communication. Intrafrequency component is used to update the information for each high and low convolutions, while interfrequency communication further enables information exchange between them.
The octave convolution kernel is given by with which the inputs and are respectively convolved. and are further divided into intra and interfrequency components as follows: and
. For the intrafrequency update, the regular vanilla convolution is used. However, up and downsampling interpolations are applied to compute the interfrequency communication formulated as:
(2) 
where denotes a vanilla convolution with parameters .
As reported in [9], due to the effective interfrequency communications, the octave convolution can have better performance in classification and recognition performance compared to the vanilla convolution.
3 Generalized Octave Convolution
In the original octave convolutions, the average pooling and nearest interpolation are respectively employed for down and upsampling operations in interfrequency communication [9]. Such conventional interpolations do not preserve spatial information and structure of the input feature map. In addition, in convolutional autoencoders where subsampling needs to be reversed at the decoder side, fixed operations such as pooling result in a poor performance [20].
In this work, we propose a novel generalized octave convolution (GoConv) in which strided convolutions are used to subsample the feature vectors and compute interfrequency communication in a more effective way. Unlike conventional subsampling methods, strided convolutions can learn to preserve properties such as spatial information, which stablizes and improves the performance [20], especially in autoencoder architectures. Moreover, as in ResNet, applying strided convolution (i.e., convolution and downsampling at the same time) reduces the computational cost rather than a convolution followed by a fixed downsampling operation (e.g., average pooling). The architecture of the proposed GoConv is shown in Figure 1.
The output high and low frequency feature maps in GoConv are formulated as follows:
(3) 
where and are respectively Vanilla convolution and transposedconvolution operations with stride of 2.
We also propose a generalized octave transposedconvolution denoted by GoTConv (Figure 1), which can replace the conventional transposedconvolution commonly employed in deep autoencoder (encoderdecoder) architectures. Let and respectively be the factorized input and output feature vectors, the output high and low frequency maps and ) in GoTConv are obtained as follows:
(4) 
where and . Unlike GoConv in which regular convolution operation is used, transposedconvolution denoted by is applied for intrafrequency update in GoTConv. For up and downsampling operations in interfrequency communication, the same strided convolutions and as in GoConv are respectively utilized.
Similar to the original octave convolution, the proposed GoConv and GoTConv are designed and formulated as generic, plugandplay units. As a result, they can respectively replace vanilla convolution and transposedconvolution units in any convolutional neural network (CNN) architecture, especially autoencoderbased frameworks such as image compression, image denoising, and semantic segmentation. When used in an autoencoder, the input image to the encoder is not represented as a multifrequency tensor. In this case, to compute the output of the first GoConv layer in the encoder, Equation
3 is modified as follows:(5) 
Similarly, at the decoder side, the output of the last GoTConv is a single tensor representation, which can be formulated by modifying Equation 4 as:
(6) 
In original octave convolutions, activation layers (e.g., ReLU) are applied to the output high and low frequency maps. However, in this work, we utilize activations for each internal convolution performed in our proposed GoConv and GoTConv. In this case, we assure activation functions are properly applied to each feature map computed by convolution operations. Each of the inter and intrafrequency components is then followed by an activation layer in GoConv. This process is inverted in the proposed GoTConv where the activation layer is followed by inter and intrafrequency communications as shown in Figure
1.4 MultiFrequency Entropy Model
Octave convolution is similar to the wavelet transform [4], since it has lower resolution in low frequency than in high frequency. Therefore, it can be used to improve the RD performance in learningbased image compression frameworks. Octave convolutions stores the features in a multifrequency representation where the low frequency is stored in half spatial resolution, which results in a higher compression ratio. Moreover, due to the effective high and low frequency communication as well as the receptive field enlargement in octave convolutions, they also improve the performance of the analysis (encoding) and synthesis (decoding) transforms in a compression framework.
The overall architecture of the proposed multifrequency image compression framework is shown in Figure 2. Similar to [18], our architecture is composed of two subnetworks: the core autoencoder and the entropy subnetwork. The core autoencoder is used to learn a quantized latent vector of the input image, while the entropy subnetwork is responsible for learning a probabilistic model over the quantized latent representations, which is utilized for entropy coding.
We have made several improvements to the scheme in [18]
. In order to handle multifrequency entropy coding, all vanilla convolutions in the core encoder, hyper encoder, and parameters estimator are replaced by the proposed GoConv, and all vanilla transposedconvolutions in the core and hyper decoders are replaced by GoTConv. In
[18], each convolution/transposedconvolution is accompanied by an activation layer (e.g., GDN/IGDN or Leaky ReLU). In our architecture, we move these layers into the GoConv and GoTConv architectures and directly apply them to the inter and intrafrequency components as described in Section 3. GDN/IGDN transforms are respectively used for the GoConv and GoTConv employed in the proposed deep encoder and decoder, while Leaky ReLU is utilized for the hyper autoencoder and the parameters estimator. The convolution properties (i.e., size and number of filters and strides) of all networks including the core and hyper autoencoders, context models, and parameter estimator are the same as in [18].Let be the input image, the multifrequency latent representations are denoted by where and are generated using the parametric deep encoder (i.e., analysis transform) represented as:
(7) 
where is the parameter vector to be optimized. denotes the total number of output channels in , which is divided into channels for high frequency and channels for low frequency (i.e., at half spatial resolution of the high frequency part). The calculation in Equation 5 is used for the first GoConv layer, while the other encoder layers are formulated using Equation 3.
At the decoder side, the parametric decoder (i.e., synthesis transform) with the parameter vector reconstructs the image as follows:
(8) 
where represents the addition of uniform noise to the latent representations during training, or uniform quantization (i.e., round function in this work) and arithmetic coding/decoding of the latents during the test. As illustrated in Figure 2, the quantized high and low frequency latents and are entropycoded using two separate arithmetic encoder and decoder.
The entropy subnetwork in our architecture contains two models: a context model and a hyper autoencoder. The context model is an autoregressive model over multifrequency latent representations. Unlike the other networks in our architecture where GoConv are incorporated for their convolutions, we use Vanilla convolutions in the context model to assure the causality of the contexts is not spoiled due to the intrafrequency communication in GoConv. The contexts of the high and low frequency latents, denoted by and , are then predicted with two separate models and defined as follows:
(9) 
where and are the parameters to be generalized. Both and are composed of one 5*5 masked convolution [25] with stride of 2.
The hyper autoencoder learns to represent side information useful for correcting the contextbased predictions. The spatial dependencies of are then captured into the multifrequency hyper latent representations using the parametric hyper encoder (with the parameter vector ) defined as:
(10) 
The quantized hyper latents are also part of the generated bitstream that is required to be entropycoded and transmitted. Similar to the core latents, two separate arithmetic coders are used for the quantized high and low frequency and . Given the quantized hyper latents, the side information used for the entropy model estimation is reconstructed using the hyper decoder (with the parameter vector ) formulated as:
(11) 
As shown in Figure 2, to estimate the mean and scale parameters required for a conditional Gaussian entropy model, the information from both context model and hyper decoder is combined by another network, denoted by (with the parameter vector ), represented as follows:
(12) 
where and are the parameters for entropy modelling of the high frequency information, and and are for the low frequency information.
The objective function for training is composed of two terms: rate , which is the expected length of the bitstream, and distortion , which is the expected error between the input and reconstructed images. The RD balance is determined by a Lagrange multiplier denoted by . The RD optimization problem is then defined as follows:
(13) 
where is the unknown distribution of natural images and can be any distortion metric such as mean squared error (MSE) or MSSSIM. and are the rates corresponding to the high and low frequency information (bitstreams) defined as follows:
(14) 
where and are respectively the conditional Gaussian entropy models for high and low frequency latent representations ( and ) formulated as:
(15) 
where each latent is modelled as a Gaussian convolved with a unit uniform distribution, which ensures a good match between encoder and decoder distributions of both quantized and continuousvalues latents. The mean and scale parameters
, , , and are generated via the network defined in Equation 12.Since the compressed hyper latents and are also part of the generated bitstream, their transmission costs are also considered in the rate term formulated in Equation 14. As in [6, 18], nonparametric, fullyfactorized density model is used for the entropy model of the high and low frequency hyper latents as follows:
(16) 
where and denote the parameter vectors for the univariate distributions and .
5 Experimental Results
The ADE20K dataset [29] with images of at least 512 pixels in height or width (9272 images in total) were used for training the proposed model. All images were rescaled to and to have a fixed size for training. We set so that 50% of the latent representations is assigned to the low frequency part with half spatial resolution. Sample high and low frequency latent representations are shown in Figure 3.
Considering four layers of strided convolutions (with stride of 2) and the output channel size in the core encoder (Figure 2), the high and low frequency latents and will respectively be of size 161696 and 8896 for training. As discussed in [6], the optimal number of filters (i.e., ) increases with the RD balance factor , which indicates higher network capacity is required for the models with higher bit rates. As a result, in order to avoid dependent performance saturation and to boost the network capacity, we set
for higher bit rates (BPPs > 0.5). All models in our framework were jointly trained for 200 epochs with minibatch stochastic gradient descent and a batch size of 8. The Adam solver with learning rate of 0.00005 was fixed for the first 100 epochs, and was gradually decreased to zero for the next 100 epochs.
We compare the performance of the proposed scheme with standard codecs including JPEG, JPEG2000 [10], WebP [12], BPG (both YUV4:2:0 and YUV4:4:4 formats) [7], VTM (version 7.1) [11], and also stateoftheart learned image compression methods in [18, 16, 14, 15, 30]. We use both PSNR and
as the evaluation metrics, where
represents MSSSIM scores in dB defined as: . The comparison results on the popular Kodak image set (averaged over 24 test images) are shown in Figure 4. For the PSNR results, we optimized the model for the MSE loss as the distortion metric in Equation 13, while the perceptual MSSSIM metric was used for the MSSSIM results reported in Figure 4. In order to obtain the seven different bit rates on the RD curve illustrated in Figure 4, seven models with seven different values for were trained.As shown in Figure 4, our method outperforms all standard codecs as well as the stateoftheart learningbased image compression methods in terms of both PSNR and MSSSIM. Compared to the new versatile video coding test model (VTM) and the method in [15], our approach provides 0.5dB better PSNR at lower bit rates (BPPs < 0.4) and almost the same performance at higher rates.
= 0.5  = 0.25  = 0.75  ActOut  CoreOct  OrgOct  




0.270  0.267  0.266  
PSNR (dB)  32.10  32.12  31.11  31.84  31.62  28.70  
MSSSIM (dB)  16.31  16.34  16.13  16.04  15.34  13.23  
Model Size (MB)  95.2  110  84  94.9  72.1  73  
Inference Time (sec)  0.447  0.476  0.411  0.421  0.487  0.445 
One visual example from the Kodak image set is given in Figure 5 in which our results are qualitatively compared with JPEG2000 and BPG (4:4:4 chroma format) at 0.1bpp. As seen in the example, our method provides the highest visual quality compared to the others. JPEG2000 has poor performance due to the ringing artifacts. The BPG result is smoother compared to JPEG2000, but the details and fine structures are not preserved in many areas, for example, in the patterns on the shirt and the colors around the eye.
5.1 Ablation Study
In order to evaluate the performance of different components of the proposed framework, ablation studies were performed, which are reported in Table 1. The results are the average over the Kodak image set. All the models reported in this ablation study have been optimized for MSE distortion metric (for one single bitrate). However, the results were evaluated with both PSNR and MSSSIM metric.
Ratio of high and low frequency: in order to study varying choices of the ratio of channels allocated to the low frequency feature representations, we evaluated our model with three different ratios . As summarized in Table 1, compressing 50% of the low frequency part to half the resolution (i.e., ) results in the best RD performance in both PSNR and MSSSIM at 0.271bpp (where the contributions of high and low frequency latents are 0.217bpp and 0.054bpp).
As the ratio decreases to , less compression with a higher bit rate of 0.350bpp (0.323bpp for high and 0.027 for low frequency) is obtained, while no significant gain in the reconstruction quality is achieved. Although increasing the ratio to 75% provides a better compression with 0.243bpp (high: 0.104bpp, low: 0.139bpp), it significantly results in a lower PSNR. As indicated by the model size and inference time in the table, larger ratio results in a smaller and faster model since less space is required to store the low frequency maps with half spatial resolution.
Internal vs. external activation layers: in this scenario, we cancel the internal activations (i.e., GDN/IGDN) employed in our proposed GoConv and GoTConv. Instead, as in the original octave convolution [9], we apply GDN to the output high and low frequency maps in GoConv, and IGDN before the input high and low frequency maps for GoTConv. This experiment is denoted by ActOut in Table 1. As the comparison results indicate, the proposed architecture with internal activations ( = 0.5) provides a better performance (with 0.26dB higher PNSR) since all internal feature maps corresponding to the inter and intra communications are benefited from the activation function.
Octave only for core autoencoder: as described in Section 4, we utilized the proposed multifrequency entropy model for both latents and hyper latents. In order to evaluate the effectiveness of multifrequency modelling of hyper latents, we also report the results in which the proposed entropy model is only used for the core latents (denoted by CoreOct in Table 1). To deal with the high and low frequency latents resulted from the multifrequency core autoencoder, we used two separate networks (similar to [18] with Vanilla convolutions) for each of the hyper encoder, hyper decoder, and the parameters estimator model. As summarized in the table, a PSNR gain of 0.48dB is achieved when both core and hyper autoencoders benefit from the proposed multifrequency model.
Original octave convolutions: in this experiment, the performance of the proposed GoConv and GoTConv architectures compared with the original octave convolutions (denoted by OrgOct in Table 1) is analyzed. We replace all GoConv layers in the proposed framework (Figure 2) by original octave convolutions. For the octave transposedconvolution used in the core and hyper decoders, we reverse the octave convolution operation formulated as follows:
(17) 
where and are the input and output feature maps, and is vanilla transposedconvolution. Similar to the octave convolution defined in [9], average pooling and nearest interpolation are respectively used for downsample and upsample operations. As reported in Table 1, OrgOct provides a significantly lower performance than the architecture with the proposed GoConv and GoTConv, which is due to the fixed subsampling operations incorporated for its interfrequency components. The PSNR and MSSSIM of the proposed architecture are respectively 3.4dB and 3.08dB higher than OrgConv at the same bit rate. Note that the ratio was used for the ActOut, CoreOct, and OrgOct models.
6 Conclusion
In this paper, we proposed a new multifrequency image compression scheme with octave convolutions in which the latents were factorized into high and low frequency, and the low frequency was stored at lower resolution to reduce the spatial redundancy. To retain the spatial structure of the input, novel generalized octave convolution and transposedconvolution architectures denoted by GoConv and GoTConv were also introduced. Our experiments showed that the proposed method significantly improves the RD performance and achieves the new stateoftheart learned image compression.
Further improvements can be achieved by multiresolution factorization of latents into a sequence of high to low frequencies as in wavelet transform. Another potential direction of this work is to employ the proposed GoConv/GoTConv in other CNNbased architectures, particularly autoencoderbased schemes such as image denoising and semantic segmentation (please see the Appendix).
Acknowledgements
This work is supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada under grant RGPIN201506522.
References
 [1] (2017) Softtohard vector quantization for endtoend learning compressible representations. arXiv preprint arXiv:1704.00648. Cited by: §1, §1.
 [2] (2019) Learned variablerate image compression with residual divisive normalization. arXiv preprint arXiv:1912.05688. Cited by: §1.
 [3] (2019) DSSLIC: deep semantic segmentationbased layered image compression. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2042–2046. Cited by: §1, §1.
 [4] (1992) Image coding using wavelet transform. IEEE Transactions on image processing 1 (2), pp. 205–220. Cited by: §1, §4.
 [5] (2016) Endtoend optimization of nonlinear transform codes for perceptual quality. In Picture Coding Symposium, pp. 1–5. Cited by: §1, §1.
 [6] (2018) Variational image compression with a scale hyperprior. arXiv preprint arXiv:1802.01436. Cited by: §1, §4, §5.
 [7] (2017) BPG image format. Accessed. Note: http://bellard.org/bpg Cited by: §1, §5.
 [8] (2018) Efficient variable rate image compression with multiscale decomposition network. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §1.
 [9] (2019) Drop an octave: reducing spatial redundancy in convolutional neural networks with octave convolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3435–3444. Cited by: §1, §2, §2, §3, §5.1, §5.1.
 [10] (2000) The JPEG2000 still image coding system: an overview. IEEE transactions on consumer electronics 46 (4), pp. 1103–1127. Cited by: §5.
 [11] (2019) VVC official test model VTM. Accessed. Note: https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM Cited by: §1, §5.
 [12] (2016) WebP. Accessed. Note: https://developers.google.com/speed/webp Cited by: §5.
 [13] (2017) Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. arXiv preprint arXiv:1703.10114. Cited by: §1.
 [14] (2018) Contextadaptive entropy model for endtoend optimized image compression. arXiv preprint arXiv:1809.10452. Cited by: §1, §1, §5.
 [15] (2019) A hybrid architecture of jointly learning image compression and quality enhancement with improved entropy minimization. arXiv preprint arXiv:1912.12817. Cited by: §1, §5, §5.
 [16] (2019) Learning contentweighted deep image compression. arXiv preprint arXiv:1904.00664. Cited by: §1, §1, §5.

[17]
(2018)
Multilevel waveletcnn for image restoration.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, pp. 773–782. Cited by: §A.1.  [18] (2018) Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, pp. 10771–10780. Cited by: §1, §1, §4, §4, §4, §5.1, §5.

[19]
(2017)
Realtime adaptive image compression.
In
Proceedings of the 34th International Conference on Machine Learning
, Vol. 70, pp. 2922–2930. Cited by: §1.  [20] (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §3, §3.
 [21] (2017) Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395. Cited by: §1, §1.
 [22] (2019) Deep learning on image denoising: an overview. arXiv preprint arXiv:1912.13171. Cited by: §A.1.
 [23] (2015) Variable rate image compression with recurrent neural networks. arXiv preprint arXiv:1511.06085. Cited by: §1, §1.
 [24] (2017) Full resolution image compression with recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5306–5314. Cited by: §1, §1.
 [25] (2016) Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pp. 4790–4798. Cited by: §4.
 [26] (2003) Multiscale structural similarity for image quality assessment. In Record of the ThirtySeventh Asilomar Conference on Signals, Systems and Computers, Vol. 2, pp. 1398–1402. Cited by: §1.
 [27] (2012) Image denoising and inpainting with deep neural networks. In Advances in neural information processing systems, pp. 341–349. Cited by: §A.1.
 [28] (2019) Learned scalable image compression with bidirectional context disentanglement network. In IEEE International Conference on Multimedia and Expo, pp. 1438–1443. Cited by: §1.
 [29] (2017) Scene parsing through ADE20K dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 4. Cited by: §5.
 [30] (2019) Multiscale and contextadaptive entropy model for image compression. arXiv preprint arXiv:1910.07844. Cited by: §1, §5.
Appendix A Appendix
a.1 MultiFrequency Image Denoising
As explained Section 3, since the proposed GoConv and GoTConv are designed as generic, plugandplay units, they can be used in any autoencoderbased architecture. In this experiment, we build a simple convolutional autoencoder and use it for image denoising problem [27, 17, 22]. In this problem, we try to denoise images corrupted by white Gaussian noise, which is the common result of many acquisition channels. The architecture of the autoencoder used in this experiment is summarized in Table 2
where the encoder and decoder are respectively composed of a sequence of vanilla convolutions and transposedconvolutions each followed by batch normalization and ReLU.
We performed our experiments on MNIST and CIFAR10 datasets. After 100 epochs of training, an average PSNR of 23.19dB and 23.29dB for MNIST and CIFAR10 test sets were respectively achieved. In order to analyze the performance of GoConv and GoTConv in this experiment, we replaced all the vanilla convolution layers with GoConv and all transposedconvolutions with GoTConv. The other properties of the encoder and decoder networks (e.g., numebr of layers, filters, and strides) were the same as the baseline in Table 2. We set and trained the model for 100 epochs. For MNIST dataset, the multifrequency autoencoder achieved an average PSNR of 23.20dB (almost the same as the baseline with vanilla convolutions). However, for CIFAR10, we achieved an average PSNR of 23.54dB, which is 0.25dB higher than the baseline, due to the effective communication between high and low frequencies. In addition, the proposed multifrequency autoencoder is 5.5% smaller than the baseline model. The comparison results are presented in Table 3.
In Figure 7, 8 visual examples from CIFAR10 test set are given. Compared to the baseline model with vanilla convolutions and transposedconvolutions, the multifrequency model with the proposed GoConv and GoTConv results in higher visual quality in the denoised images (e.g., the red car in the second column from right).
Encoder  Decoder 

Conv (3*3, 32, s1)  TConv (3*3, 128, s2) 
Conv (3*3, 32, s1)  TConv (3*3, 128, s1) 
Conv (3*3, 64, s1)  TConv (3*3, 64, s1) 
Conv (3*3, 64, s2)  TConv (3*3, 64, s2) 
Conv (3*3, 128, s1)  TConv (3*3, 32, s1) 
Conv (3*3, 128, s1)  TConv (3*3, 32, s1) 
Conv (3*3, 256, s2)  TConv (3*3, 3, s1) 
Baseline  MultiFrequency  
MNIST  CIFAR10  MNIST  CIFAR10  
PSNR (dB)  23.19  23.29  23.20  23.54 
Size (MB)  4.49  4.50  4.44  4.45 
Comments
There are no comments yet.