I Introduction
Image compression is a fundamental and wellstudied problem in image processing and computer vision [8, 20, 16]. The goal is to design binary representations (i.e. bitstream) with minimal entropy [15] that minimize the number of bits required to represent an image (i.e. rate) at a given level of fidelity (i.e. distortion) [6]. In many communication scenarios the network or storage device may impose a constraint on the maximum bitrate, which requires the image encoder to adapt to a given bitrate budget. In other scenarios that constrain may even change dynamically over time (e.g. video). In all these cases, a rate control mechanism is required, and it is available in most traditional image and video compression codecs. In general, reducing the rate causes an increase in the distortion (i.e. ratedistortion tradeoff). This mechanism is typically based on scaling the latent representation prior to quantization to obtain finer or coarser quantization, and then inverting the scaling at the decoder side (see Fig. 1a).
Recent studies show that deep image compression (DIC) achieves comparable or even better results than classical image compression techniques [18, 2, 5, 19, 17, 7, 10, 11, 12, 9]. In this paradigm, the parameters of the encoder and decoder are learned from certain image data by jointly minimizing rate and distortion at a particular ratedistortion tradeoff (instead of engineered by experts). However, variable bitrate requires an independent model for every RD tradeoff. This is an obvious limitation, since it requires storing each model separately, resulting in large memory requirement.
Addressing this limitation, Theis et al. [17] use a single autoencoder whose bottleneck representation is scaled before quantization depending on the target rate (see Fig. 1b). However, this approach only considers the importance of different channels from the bottleneck representation of learned autoencoders under RD tradeoff constraint. In addition, the autoencoder is optimized for a single specific RD tradeoff (typically high rate). These two aspects lead to a drop in performance for low rates and a narrow effective range of bit rates.
Addressing the limitations of multiple independent models and bottleneck scaling, we formulate the problem of variable ratedistortion optimization for DIC, and propose the modulated autoencoder (MAE) framework, where the representations of a shared autoencoder at different layers are adapted to a specific ratedistortion tradeoff via a modulating network. The modulating network is conditioned on the target RD tradeoff, and synchronized with the actual tradeoff optimized to learn the parameters of the autoencoder and the modulating network. MAEs can achieve almost the same operational RD points of independent models with much fewer overall parameters (i.e. just the shared autoencoder plus the small overhead of the modulating network). Multilayer modulation does not suffer from the main limitations of bottleneck scaling, namely, drop in performance for low rates, and shrinkage of the effective range of rates.
Ii background
Almost all lossy image and video compression approaches follow the transform coding paradigm [4]. The basic structure is a transform that takes an input image and obtains a transformed representation , followed by a quantizer where
is discretevalued vector. The decoder reverses the quantization (i.e. dequantizer
) and the transform (i.e. inverse transform) as reconstructing the output image . Before the transmission (or storage), the discretevalued vectoris binarized and serialized into a
bitstream . Entropy coding [21] is used to exploit the statistical redundancy in that bitstream and reduce its length.In deep image compression [17, 18, 2], the handcrafted analysis and synthesis transforms are replaced by the encoder and decoder of a convolutional autoencoder, parametrized by and . The fundamental difference is that the transforms are not designed but learned from training data.
The model is typically trained endtoend minimizing the following optimization problem
(1) 
where measures the rate of the bitstream and represents a distortion metric between and , and the Lagrange multiplier controls the tradeoff between rate and distortion, i.e. RD tradeoff. Note that
is a fixed constant in this case. The problem is solved using gradient descent and backpropagation
[14].To make the model differentiable, which is required to apply backpropagation, during training the quantizer is replaced by a differentiable proxy function [17, 18, 2]. Similarly, entropy coding is invertible, but it is necessary to compute the length of the bitstream . This is usually approximated by the entropy of the distribution of quantized vector, , which is a lower bound of the actual bitstream length.
In this paper, we will use scalar quantization by (elementwise) rounding to the nearest neighboor, i.e. , which will be replaced by additive uniform noise as proxy during training [2], i.e. , with . There is no dequantization in the decoder, and the reconstructed representation is simply
. To estimate the entropy we will use the entropy model described in
[2] to approximate by . Finally, we will use mean squared error (MSE) as a distortion metric. With these particular choices, (1) becomes(2) 
with
(3)  
(4) 
Iii Multirate deep image compression with modulated autoencoders
Iiia Problem definition
We are interested in deep image compression models that can operate satisfactorily on different RD tradeoffs, and adapt to the specific RD tradeoff when required. Note that (Eq. 2) optimizes rate and distortion for a fixed tradeoff . We extend that formulation to multiple RD tradeoffs (i.e. ) as the multiratedistortion problem
(5) 
with
(6)  
(7) 
Here we omitted the explicit dependency on of the features and (implicitely) the encoder and decoder), i.e. and . In the following we may also omit the explicit dependency for conciseness. Note also that this formulation can be easily extended to a continuous range of tradeoffs. Note also that these optimization problems assume that all RD operational points are equally important. It could be possible to integrate an importance function to further give more importance to certain RD operational points if the application requires that. We assume uniform importance (continuous or discrete) for simplicity.
IiiB Bottleneck scaling
A possible way to make the encoder and decoder aware of is by simply scaling the latent representation in the bottleneck before quantization (implicitly scaling the quantization bin), and then inverting the scaling in the decoder. In that case, and , where is the scaling factor specific for the tradeoff . Conventional codecs use predefined tables for (the descaling is often implicitly subsumed in the dequantization, e.g. JPEG). Instead [17] learns them while keeping the encoder and decoder fixed, optimized for a particular RD tradeoff (see Fig. 1(b)).
We observe several limitations in this approach [17]: (1) scaling only the bottleneck feature is not flexible enough to adapt to a large range of RD tradeoffs, (2) using the inverse of the scaling factor in the decoder may also limit the flexibility of the adaptation mechanism, (3) optimizing the parameters of the autoencoder only for a single RD tradeoff leads to suboptimal parameters for other distant tradeoffs, (4) training the autoencoder and the scaling factors separately may also be limiting. In order to overcome this limitations we propose the modulated autoencoder (MAE) framework.
IiiC Modulated autoencoders
Variable rate is achieved in MAEs by modulating the internal representations in the encoder and the decoder (see Fig. 2). Given a set of internal representations in the encoder and in the decoder , they are replaced by the corresponding modulated and demodulated versions and , where and are the modulating and demodulating functions.
Our MAE architecture extends the deep image compression architecture proposed in [2] which combines convolutional layers and GDN/IGDN layers [1]. In our experiments we choose to modulate the outputs of the convolutional layers in the encoder and decoder, i.e. and , respectively.
The modulating function for the encoder is learned by a modulating network as and the demodulating function by the demodulating network as . As a result, the encoder has learnable parameters and the decoder .
Finally, the optimization problem for the MAE is
(8) 
which extends equation (5) with the modulating/demodulating networks and their corresponding parameters. All parameters are learned jointly using gradient descent and backpropagation.
This mechanism is more flexible than bottleneck scaling since it allows multilevel modulation, decouples encoder and decoder scaling and allows effective joint training of both autoencoder and modulating network, therefore optimizing jointly to all RD tradeoffs of interest.
IiiD Modulating and demodulating networks
The modulating network is a perceptron with two fully connected layers and ReLU
[13] and exponential nonlinearities (see Fig. 2). The exponential nonlinearity guarantees positive outputs which we found beneficial in training. The output is directly. A small first hidden layer allows learning a meaningful nonlinear function between tradeoffs and modulation vectors, which is more flexible than simple scaling factors and allows more expressive interpolation between tradeoffs. In practice, we use normalized tradeoffs as
. The demodulating network follows a similar architecture.Iv Experiments
Iva Experimental setup
We evaluated MAE on the CLIC dataset^{1}^{1}1https://www.compression.cc, with 1,633 train images (containing images of both “professional” and “mobile” sets) and 44 test images (from the “professional” set). Our implementation is based on the autoencoder architecture of [2], which is augmented with modulation mechanisms and modulating networks (two fully connected layers, with and units, respectively) for all the convolutional layers. We use MSE as distortion metric. The model is trained with crops of size
using Adam with a minibatch size of 8 and initial learning rates of 0.0004 and 0.002 for MAE and the entropy model, respectively. After 400k iterations, the learning rates are halved for another 150k iterations. We also tested MAEs with scale hyperpriors, as described in
[3]. In our experiments, we consider seven and four RD tradeoffs for the models without and with hyperprior, respectively. We consider two baselines:Independent models. Each RD operational point is obtained by training a new model with a different RD tradeoff in (2), requiring each model to be stored separately. This provides the optimal RD performance, but also requires more memory to store all the models for different RD tradeoffs.
Bottleneck scaling [17]. The autoencoder is optimized for the highest RD tradeoff in the range. Then the autoencoder is frozen and the scaling parameters are learned for the other tradeoffs.
IvB Results
Fig. 3 shows the RD operational curves for the proposed MAE and the two baselines, for both PSNR and MSSSIM. We can see that the best RD performance is obtained by using independent models. Hyperprior models also have superior RD performance. Bottleneck scaling is satisfactory for high bitrates, closer to the optimal RD operational point of the autoencoder, but degrades for lower bitrates. Interestingly, bottleneck scaling cannot achieve as low bitrates as independent models since the autoencoder is optimized for high bitrate. This can be observed in the RD curve as a narrower range of bitrates. The proposed MAEs can achieve an RD performance very close to the corresponding independent models, demonstrating that multilayer modulation with joint training is a more powerful mechanism to achieve effective variable rate compression.
The main advantage of bottleneck scaling and MAEs is that the autoencoder is shared, which results in much fewer parameters than independent models, which depend on the number of RD tradeoffs (see Table. I). Both methods have a small overhead due to the modulating networks or the scaling factors (which is smaller in bottleneck scaling).
In order to illustrate the differences between the bottleneck scaling and MAE rate adaptation mechanisms, we consider the image in Fig. 4b and the reconstructions for high and low bitrates (see Fig. 4a). We show two of the 192 channels in the bottleneck feature before quantization (see Fig. 4a), and observe that the maps for the two bitrates are similar but the range is higher for , so the quantization will be finer. This is also what we would expect in bottleneck scaling. However, a closer look highlights the difference between both methods. We also compute the elementwise ratio between the bottleneck features at and , and show the ratio image for the same channels of the example (see Fig. 4c). We can see that the MAE learns to perform a more complex adaptation of the features beyond simple channelwise bottleneck scaling since different parts of the image have different ratios (the ratio image would be constant in bottleneck scaling), which allows MAE to allocate bits more freely when optimizing for different RD tradeoffs, especially for low bitrate.
V Conclusion
In this work, we introduce the modulated autoencoder, a novel variable rate deep image compression framework, based on multilayer feature modulation and joint learning of autoencoder parameters. MAEs can realize variable rate image compression with a single model, while keeping the performance close to the upper bound of independent models that require significantly more memory. We show that MAE outperforms bottleneck scaling [17], especially for low bitrates.
References
 [1] (2015) Density modeling of images using a generalized normalization transformation. arXiv preprint arXiv:1511.06281. Cited by: §IIIC.
 [2] (2016) Endtoend optimized image compression. arXiv preprint arXiv:1611.01704. Cited by: §I, §II, §II, §II, §IIIC, §IVA, TABLE I.
 [3] (2018) Variational image compression with a scale hyperprior. arXiv preprint arXiv:1802.01436. Cited by: §IVA, TABLE I.
 [4] (2001) Theoretical foundations of transform coding. IEEE Signal Processing Magazine 18 (5), pp. 9–21. Cited by: §II.
 [5] (2016) Towards conceptual compression. In Advances In Neural Information Processing Systems, pp. 3549–3557. Cited by: §I.
 [6] (1989) Fundamentals of digital image processing. Englewood Cliffs, NJ: Prentice Hall,. Cited by: §I.

[7]
(2018)
Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 4385–4393. Cited by: §I.  [8] (1992) Image compression using the 2d wavelet transform. IEEE Transactions on image Processing 1 (2), pp. 244–250. Cited by: §I.
 [9] (2018) Learning convolutional networks for contentweighted image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3214–3223. Cited by: §I.
 [10] (2018) CNNbased dctlike transform for image compression. In International Conference on Multimedia Modeling, pp. 61–72. Cited by: §I.

[11]
(2018)
Conditional probability models for deep image compression
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4394–4402. Cited by: §I.  [12] (2018) Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, pp. 10771–10780. Cited by: §I.

[13]
(2010)
Rectified linear units improve restricted boltzmann machines.
In
Proceedings of the 27th international conference on machine learning (ICML10)
, pp. 807–814. Cited by: §IIID.  [14] (1985) Learning internal representations by error propagation. Technical report California Univ San Diego La Jolla Inst for Cognitive Science. Cited by: §II.
 [15] (1948) A mathematical theory of communication. Bell system technical journal 27 (3), pp. 379–423. Cited by: §I.
 [16] (2012) JPEG2000 image compression fundamentals, standards and practice: image compression fundamentals, standards and practice. Vol. 642, Springer Science & Business Media. Cited by: §I.
 [17] (2017) Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395. Cited by: Fig. 1, §I, §I, §II, §II, §IIIB, §IIIB, §IVA, TABLE I, §V.

[18]
(2015)
Variable rate image compression with recurrent neural networks
. arXiv preprint arXiv:1511.06085. Cited by: §I, §II, §II.  [19] (2017) Full resolution image compression with recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5306–5314. Cited by: §I.
 [20] (1995) Wavelet filter evaluation for image compression. IEEE Transactions on image processing 4 (8), pp. 1053–1060. Cited by: §I.
 [21] (1972) Transform picture coding. Proceedings of the IEEE 60 (7), pp. 809–820. Cited by: §II.
Comments
There are no comments yet.