1 Introduction
Deep latent variable models have started to outperform conventional baselines on lossy compression of images [4, 7, 25, 14, 15, 24, 23, 33, 36], video [19, 8, 21, 31, 37, 20, 27, 6, 12], and audio [38, 36]
. Nearly all of these methods use a loss function of the form
, where measures distortion, measures bitrate, and is a fixed tradeoff parameter. We refer to this approach as VAE [13], because this loss can be motivated from a variational perspective [12].Despite its popularity, VAE has several drawbacks. Firstly, setting to target a specific point in the / plane can be tricky. One can show that a model trained with a given should end up at that point on the / curve where the slope equals [1]. However, because the shape of the /
curve depends on the model and hyperparameters, and because the
/ curve can be very steep or flat in the low or high bitrate regime, choosing can be difficult.Secondly, in order to compare models it is not sufficient to train one instance of each model because the converged models would likely differ in both rate and distortion, which yields inconclusive results unless one model dominates the other on both metrics. Instead, to compare models we need to train both at several values to generate / curves that can be compared, which is computationally costly and slows down the research iteration cycle.
A more natural way to target different regions of the / plane is to set a distortion constraint and find our model parameters through constrained optimization:
(1) 
where refers to the joint parameters of the encoder, decoder and prior, and is a distortion target.
We can control the ratedistortion tradeoff by setting the distortion target value . Setting this value is more intuitive than setting , as it is independent of the slope of the / curve, and hence independent of model and hyperparameters.
As a result, we can easily compare two different models trained with the same distortion constraint; as we have fixed the axis we only have to look at the performance for each model.
Note that one could also minimize the distortion subject to a rate constraint. This is less straightforward as putting too much emphasis on the rate loss at the beginning of training can lead to posterior collapse [3, 11, 40, 28, 32].
There is a large literature on constrained optimization, but most of it does not consider stochastic optimization and is limited to convex loss functions. In this paper we evaluate, in addition to
VAE, two constrained optimization methods that are compatible with stochastic gradient descent training of deep networks: A simple method based on the hinge loss (free bits
[17, 5, 1] but applied to distortion rather than rate), and the Lagrangian distortionconstrained optimization method of [29] (CO).We evaluate these methods on a modern image compression system applied to a realistic compression benchmark. We report on suitable hyperparameters and practical considerations that are relevant in this domain. We show that CO outperforms the hinge method, and reaches a similar performance to VAE. At the same time, CO is easier to work with and allows for pointwise model selection.
2 Related Work
2.1 Constrained Optimization
Several works have proposed algorithms to train deep networks under equality or inequality constraints [22, 10, 9, 29]. We deploy the algorithm of [29] as the VAE context is most similar to our setup.
The focus of [29] is on generative modelling rather than data compression, and there are a number of reasons why the models trained in [29] are not directly applicable to data compression. Firstly, their models contain a stochastic encoder which is not suitable for lossy compression, where bitsback coding is inapplicable. Secondly, [29] do not report / performance but instead report loglikelihood. Furthermore, their latent space is continuous while most compression papers use a discrete latent space that allows for entropy coding of the latents under the prior. Lastly, they use a fixed Guassian prior whereas in lossy compression powerful learnable priors are used to decrease the bitrate as much as possible. In this paper we focus on the implementation and evaluation of constrained optimization for practical lossy image compression.
2.2 Hinge Loss
Another approach that was proposed for constrained optimization (in the context of avoiding posterior collapse) is freebits, where the rate loss is hinged [17, 5, 1]. Like constrained optimization, this loss allows us to set a target value, and as such has been used in lossy compression [23]. However, we find that this method is inferior to constrained optimization in terms of / performance and has difficulty converging to the target value.
2.3 Variable Bitrate Models
A different approach of dealing with the ratedistortion tradeoff is to train a single model that can compress at different bitrates [34, 30, 7, 39]. However, some of these works do not meet the performance achieved with specialized models [39]
or require disjoint training of autoencoder and prior
[34]. Other methods could benefit from constrained optimization (e.g. [39] still uses multipliers that could be replaced by a distortion target), an exercise that is left for future research.3 Method
3.1 Constrained Optimization
The Lagrangian of the primal problem in equation 1 is:
(2) 
For a convex problem, we would find the minimum of the dual at .
For nonconvex deep learning models, we deploy the algorithm proposed by
[29] and iteratively update and using stochastic gradient descent and ascent respectively.Note that the VAE loss is the Lagrangian of a rateconstrained optimization. However, the multiplier
is either fixed or updated according to a heuristic schedule
[3, 11, 40, 28, 32], and thus no constrained optimization is performed.Because we found that the optimal CO hyperparameters were different depending on the target value, we normalize our constraint function by the target value. Our loss function thus becomes:
(3) 
3.1.1 Weight and Multiplier Updates
For each minibatch, we update using the Adam optimizer, and using SGD with momentum, to respectively minimize and maximize the batchwise Lagrangian (Eq. 3).
Like [29], we reparametrize in order to enforce the positivity of (to satisfy the K.K.T. [18, 16] conditions for inequality constraints). We also follow them in updating as this resulted in smoother updates of our multipliers than using the actual gradient .
4 Experiments
We conduct a series of experiments to show how constrained optimization is more suitable for training lossy compression models than VAE or distortion hinge baselines.
4.1 General Setup
We use the autoencoder architecture of [23] but without the mask. Our prior is the gated pixelCNN [35] as used in [12]. Like [12]
we jointly train our codemodel and autoencoder, without any detaching of the gradients. We use scalar quantization with a learned codebook and a straighttrough estimator (hardmax during forward pass and softmax gradient during backward)
[2, 23, 12].We train our model on random 160x160 crops of ImageNet Train, and evaluate on 160x160 center crops of ImageNet Validation. Like
[23] we resize the smallest side of all images to 256 to reduce compression artifacts.We train using the rate loss expressed in bits per pixel (bpp) and using the distortion loss expressed in average MSE computed on unnormalized images on a 0255 scale.
We update our parameters using Adam with a learning rate of for the autoencoder and
for the prior. We decay both learning rates every 3 epochs (120087 iterations) by a factor of
. For the multiplier updates, we use SGD with a learning rate of . We use a batch size of 32.4.2 CO vs. hinge
For this experiment we choose exponentially spaced constraint values (60, 65, 70, 80, 100, 125, 150, 200, 300 MSE) and look at how well the methods converge to the set target. We compare our CO training with the simpler hinge baselines of the form:
(4) 
Unlike CO, is fixed during training, but we train models with different values (0.01, 0.1, 1, 10, 100). In line with CO, we use the normalized constraint function as we verified that it worked better than the unnormalized one.
Results are shown in Figure (a)a. Observe that the CO models converge very closely to the set target (within 1 MSE point for achievable constraints). For the hinge models, the constraint is not satisfied reliably and overall / performance is worse (some models converged to / values outside of the chosen display range). Furthermore, the hinge models are sensitive to the value of , and the optimal value differs per target.
Figure 3 shows the trajectories of the CO multipliers. For stricter constraints, it takes longer before the multiplier starts to drop, changing emphasis from to . In the limit of an unachievable constraint (MSE ), the multiplier remains constant at the clip value. All multipliers converge to a relatively stable final value, which is dependent on the target (as expected since the slope is different).
4.3 CO vs. Vae
In the next experiment, we compare the / performance of CO to the VAE baseline. We first train VAE models for exponentially spaced values (0.1, 10, 50, 100, 200, 250, 500, 750). For each VAE, we use the distortion loss over the last training epoch as the target for training a CO model.
Results are shown in Figure (b)b (PSNR results in Figure A.1). The / performance of the CO models is similar to that of the VAEs. For bitrates higher than 0.4 bpp, we see a slight advantage for the VAE. For these target values, the CO multipliers are almost constant (see the strict constraints in Figure 3) and we thus attribute this difference to the optimization hyperparameters being finetuned for the scale of the VAE loss.
4.4 Model Selection
In the final experiment we highlight how constrained optimization can simplify the model selection process. We adapt our architecture by changing the number of latent channels from to , effectively halving the maximum channel capacity from 1.29 bpp to 0.64 bpp. We train VAE models for the s from Section 4.3 and CO models using the targets from Section 4.2.
Results are shown in Figure 2. For both optimization methods, the lowest achievable distortion has increased from MSE to MSE for the model with decreased channel capacity.
For the VAE optimization, points with the same now end up at very different points on the / plane. For the halfcapacity model, we cover a narrow range of 240128 MSE.
In contrast, CO produces two comparable / curves. Distortion targets below 130 MSE are unachievable for the halfcapacity model and are all collapsed into a single point. However, for any achievable distortion target, both models end up with a similar distortion which allows us to do a pointwise comparison.
5 Conclusion
We present distortion constrained optimization (CO) as an alternative to VAE training for lossy compression. We report suitable hyperparameters and propose to normalize the constraint function for better performance. We demonstrate that CO gives similar performance to VAE on a realistic image compression task, while at the same time providing a more intuitive way to balance the rate and distortion losses. Finally, we show how CO can facilitate the model selection process by allowing pointwise model comparisons.
References

[1]
(2018)
Fixing a broken ELBO, 2018.
In
International Conference on Machine Learning
, Cited by: §1, §1, §2.2. 
[2]
(2013)
Estimating or propagating gradients through stochastic neurons for conditional computation
. CoRR abs/1308.3432. External Links: Link, 1308.3432 Cited by: §4.1.  [3] (201608) Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, pp. 10–21. External Links: Link, Document Cited by: §1, §3.1.
 [4] (201911) A novel deep progressive image compression framework. In 2019 Picture Coding Symposium (PCS), pp. 1–5. Cited by: §1.
 [5] (2017) Variational lossy autoencoder. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, External Links: Link Cited by: §1, §2.2.
 [6] (2020) Learning for video compression. IEEE Transactions on Circuits and Systems for Video Technology 30 (2), pp. 566–576. Cited by: §1.

[7]
(2019)
Variable rate deep image compression with a conditional autoencoder.
In
Proceedings of the IEEE International Conference on Computer Vision
, pp. 3146–3154. Cited by: §1, §2.3.  [8] (2019) Neural InterFrame compression for video coding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6421–6429. Cited by: §1.
 [9] (201306) Practical methods of optimization. John Wiley & Sons (en). Cited by: §2.1.
 [10] (2010) Leastsquares minimization under constraints. Technical report infoscience.epfl.ch, EPFL. Cited by: §2.1.
 [11] (2017) PixelVAE: a latent variable model for natural images. In 5th International Conference on Learning Representations, Cited by: §1, §3.1.
 [12] (2019) Video compression with ratedistortion autoencoders. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7033–7042. Cited by: §1, §4.1.
 [13] (2017) Betavae: learning basic visual concepts with a constrained variational framework. International Conference on Learning Representations. Cited by: §1.
 [14] (2019) Computationally efficient neural image compression. CoRR abs/1912.08771. External Links: Link, 1912.08771 Cited by: §1.

[15]
(2018)
Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 4385–4393. Cited by: §1.  [16] (1939) Minima of functions of several variables with inequalities as side constraints. M. Sc. Dissertation. Dept. of Mathematics, Univ. of Chicago. Cited by: §3.1.1.
 [17] (2016) Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 4743–4751. Cited by: §1, §2.2.
 [18] (2014) Nonlinear programming. In Traces and emergence of nonlinear programming, pp. 247–258. Cited by: §3.1.1.
 [19] (201912) Learned video compression via joint SpatialTemporal correlation exploration. CoRR abs/1912.06348. External Links: 1912.06348 Cited by: §1.
 [20] (2019) Deep generative video compression. In Advances in Neural Information Processing Systems 32, pp. 9287–9298. External Links: Link Cited by: §1.
 [21] (201906) DVC: an endtoend deep video compression framework. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
 [22] (201706) Imposing hard constraints on deep networks: promises and limitations. CoRR abs/1706.02025. External Links: 1706.02025 Cited by: §2.1.

[23]
(201801)
Conditional Probability Models for Deep Image Compression
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4394–4402. Cited by: §1, §2.2, §4.1, §4.1.  [24] (2018) Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 10771–10780. External Links: Link Cited by: §1.
 [25] (2020) Channelwise autoregressive entropy models for learned image compression. Cited by: §1.
 [26] (2019) PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §3.1.1.
 [27] (2018) Endtoend learning of video compression using spatiotemporal autoencoders. Cited by: §1.
 [28] (2007) Building blocks for variational bayesian learning of latent variable models. J. Mach. Learn. Res. 8 (Jan), pp. 155–201. Cited by: §1, §3.1.
 [29] (2018) Generalized ELBO with constrained optimization, GECO. In Workshop on Bayesian Deep Learning, NeurIPS, Cited by: Lossy Compression with Distortion Constrained Optimization, §1, §2.1, §2.1, §3.1.1, §3.1.1, §3.1.
 [30] (2017) Realtime adaptive image compression. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2922–2930. Cited by: §2.3.
 [31] (2019) Learned video compression. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3454–3463. Cited by: §1.
 [32] (2016) Ladder variational autoencoders. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 3738–3746. Cited by: §1, §3.1.
 [33] (2017) Lossy image compression with compressive autoencoders. In International Conference on Learning Representations, External Links: Link Cited by: §1.

[34]
(201511)
Variable Rate Image Compression with Recurrent Neural Networks
. CoRR abs/1511.06085. Cited by: §2.3.  [35] (2016) Conditional Image Generation with PixelCNN Decoders. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 4790–4798. Cited by: §4.1.
 [36] (2017) Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315. Cited by: §1.

[37]
(2018)
Video compression through image interpolation
. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 416–431. Cited by: §1.  [38] (201911) Feedback Recurrent AutoEncoder. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). External Links: 1911.04018 Cited by: §1.
 [39] (2020) Variablebitrate neural compression via bayesian arithmetic coding. CoRR abs/2002.08158. Cited by: §2.3.
 [40] (2017) Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3881–3890. Cited by: §1, §3.1.