Log In Sign Up

Lossy Compression with Distortion Constrained Optimization

by   Ties van Rozendaal, et al.

When training end-to-end learned models for lossy compression, one has to balance the rate and distortion losses. This is typically done by manually setting a tradeoff parameter β, an approach called β-VAE. Using this approach it is difficult to target a specific rate or distortion value, because the result can be very sensitive to β, and the appropriate value for β depends on the model and problem setup. As a result, model comparison requires extensive per-model β-tuning, and producing a whole rate-distortion curve (by varying β) for each model to be compared. We argue that the constrained optimization method of Rezende and Viola, 2018 is a lot more appropriate for training lossy compression models because it allows us to obtain the best possible rate subject to a distortion constraint. This enables pointwise model comparisons, by training two models with the same distortion target and comparing their rate. We show that the method does manage to satisfy the constraint on a realistic image compression task, outperforms a constrained optimization method based on a hinge-loss, and is more practical to use for model selection than a β-VAE.


page 1

page 2

page 3

page 4


Learning True Rate-Distortion-Optimization for End-To-End Image Compression

Even though rate-distortion optimization is a crucial part of traditiona...

End-to-End Rate-Distortion Optimization for Bi-Directional Learned Video Compression

Conventional video compression methods employ a linear transform and blo...

Variable Rate Deep Image Compression with Modulated Autoencoder

Variable rate is a requirement for flexible and adaptable image and vide...

Substitutional Neural Image Compression

We describe Substitutional Neural Image Compression (SNIC), a general ap...

Critical Slowing Down Near Topological Transitions in Rate-Distortion Problems

In Rate Distortion (RD) problems one seeks reduced representations of a ...

Root Tracking for Rate-Distortion: Approximating a Solution Curve with Higher Implicit Multivariate Derivatives

The rate-distortion curve captures the fundamental tradeoff between comp...

Multi-Rate VAE: Train Once, Get the Full Rate-Distortion Curve

Variational autoencoders (VAEs) are powerful tools for learning latent r...

1 Introduction

Deep latent variable models have started to outperform conventional baselines on lossy compression of images [4, 7, 25, 14, 15, 24, 23, 33, 36], video [19, 8, 21, 31, 37, 20, 27, 6, 12], and audio [38, 36]

. Nearly all of these methods use a loss function of the form

, where measures distortion, measures bitrate, and is a fixed tradeoff parameter. We refer to this approach as -VAE [13], because this loss can be motivated from a variational perspective [12].

Despite its popularity, -VAE has several drawbacks. Firstly, setting to target a specific point in the / plane can be tricky. One can show that a model trained with a given should end up at that point on the / curve where the slope equals [1]. However, because the shape of the /

curve depends on the model and hyperparameters, and because the

/ curve can be very steep or flat in the low or high bitrate regime, choosing can be difficult.

Secondly, in order to compare models it is not sufficient to train one instance of each model because the converged models would likely differ in both rate and distortion, which yields inconclusive results unless one model dominates the other on both metrics. Instead, to compare models we need to train both at several values to generate / curves that can be compared, which is computationally costly and slows down the research iteration cycle.

A more natural way to target different regions of the / plane is to set a distortion constraint and find our model parameters through constrained optimization:


where refers to the joint parameters of the encoder, decoder and prior, and is a distortion target.

We can control the rate-distortion tradeoff by setting the distortion target value . Setting this value is more intuitive than setting , as it is independent of the slope of the / curve, and hence independent of model and hyperparameters.

As a result, we can easily compare two different models trained with the same distortion constraint; as we have fixed the axis we only have to look at the performance for each model.

Note that one could also minimize the distortion subject to a rate constraint. This is less straightforward as putting too much emphasis on the rate loss at the beginning of training can lead to posterior collapse [3, 11, 40, 28, 32].

There is a large literature on constrained optimization, but most of it does not consider stochastic optimization and is limited to convex loss functions. In this paper we evaluate, in addition to

-VAE, two constrained optimization methods that are compatible with stochastic gradient descent training of deep networks: A simple method based on the hinge loss (free bits

[17, 5, 1] but applied to distortion rather than rate), and the Lagrangian distortion-constrained optimization method of [29] (-CO).

We evaluate these methods on a modern image compression system applied to a realistic compression benchmark. We report on suitable hyperparameters and practical considerations that are relevant in this domain. We show that -CO outperforms the hinge method, and reaches a similar performance to -VAE. At the same time, -CO is easier to work with and allows for pointwise model selection.

2 Related Work

2.1 Constrained Optimization

Several works have proposed algorithms to train deep networks under equality or inequality constraints [22, 10, 9, 29]. We deploy the algorithm of [29] as the VAE context is most similar to our setup.

The focus of [29] is on generative modelling rather than data compression, and there are a number of reasons why the models trained in [29] are not directly applicable to data compression. Firstly, their models contain a stochastic encoder which is not suitable for lossy compression, where bits-back coding is inapplicable. Secondly, [29] do not report / performance but instead report log-likelihood. Furthermore, their latent space is continuous while most compression papers use a discrete latent space that allows for entropy coding of the latents under the prior. Lastly, they use a fixed Guassian prior whereas in lossy compression powerful learnable priors are used to decrease the bitrate as much as possible. In this paper we focus on the implementation and evaluation of constrained optimization for practical lossy image compression.

2.2 Hinge Loss

Another approach that was proposed for constrained optimization (in the context of avoiding posterior collapse) is free-bits, where the rate loss is hinged [17, 5, 1]. Like constrained optimization, this loss allows us to set a target value, and as such has been used in lossy compression [23]. However, we find that this method is inferior to constrained optimization in terms of / performance and has difficulty converging to the target value.

2.3 Variable Bitrate Models

A different approach of dealing with the rate-distortion tradeoff is to train a single model that can compress at different bitrates [34, 30, 7, 39]. However, some of these works do not meet the performance achieved with specialized models [39]

or require disjoint training of autoencoder and prior

[34]. Other methods could benefit from constrained optimization (e.g. [39] still uses multipliers that could be replaced by a distortion target), an exercise that is left for future research.

(a) -CO: s.t.
(b) -VAE:
Figure 2: Model selection using (a) distortion constrained optimization and (b) -VAE. The baseline model (dark line) is altered by halving the number of latent channels (light line). Models that are trained using the same distortion target (a) or the same value of (b) are connected using a dotted line.

3 Method

3.1 Constrained Optimization

The Lagrangian of the primal problem in equation 1 is:


For a convex problem, we would find the minimum of the dual at .

For non-convex deep learning models, we deploy the algorithm proposed by

[29] and iteratively update and using stochastic gradient descent and ascent respectively.

Note that the -VAE loss is the Lagrangian of a rate-constrained optimization. However, the multiplier

is either fixed or updated according to a heuristic schedule

[3, 11, 40, 28, 32], and thus no constrained optimization is performed.

Because we found that the optimal -CO hyperparameters were different depending on the target value, we normalize our constraint function by the target value. Our loss function thus becomes:


3.1.1 Weight and Multiplier Updates

For each minibatch, we update using the Adam optimizer, and using SGD with momentum, to respectively minimize and maximize the batchwise Lagrangian (Eq. 3).

Like [29], we reparametrize in order to enforce the positivity of (to satisfy the K.K.T. [18, 16] conditions for inequality constraints). We also follow them in updating as this resulted in smoother updates of our multipliers than using the actual gradient .

We use a high momentum (

) for our multiplier updates, to ensure a smooth multiplier trajectory despite the high variance of the MSE loss. As we use the PyTorch

[26] SGD implementation, we make sure to set dampening to be equal to momentum. We clip our log-multiplier s.t. for stability.

Unlike [29] we choose to set our initial value of to the clip value . This way, we focus on training the autoencoder for distortion at the beginning of training, which we found to be essential for high performance. The final multiplier trajectories are shown in Figure 3.

Figure 3: Trajectory of Lagrangian multipliers during training. Each curve represents a model with a different constraint.

4 Experiments

We conduct a series of experiments to show how constrained optimization is more suitable for training lossy compression models than -VAE or distortion hinge baselines.

4.1 General Setup

We use the autoencoder architecture of [23] but without the mask. Our prior is the gated pixelCNN [35] as used in [12]. Like [12]

we jointly train our code-model and autoencoder, without any detaching of the gradients. We use scalar quantization with a learned codebook and a straight-trough estimator (hardmax during forward pass and softmax gradient during backward)

[2, 23, 12].

We train our model on random 160x160 crops of ImageNet Train, and evaluate on 160x160 center crops of ImageNet Validation. Like

[23] we resize the smallest side of all images to 256 to reduce compression artifacts.

We train using the rate loss expressed in bits per pixel (bpp) and using the distortion loss expressed in average MSE computed on unnormalized images on a 0-255 scale.

We update our parameters using Adam with a learning rate of for the autoencoder and

for the prior. We decay both learning rates every 3 epochs (120087 iterations) by a factor of

. For the multiplier updates, we use SGD with a learning rate of . We use a batch size of 32.

4.2 -CO vs. -hinge

For this experiment we choose exponentially spaced -constraint values (60, 65, 70, 80, 100, 125, 150, 200, 300 MSE) and look at how well the methods converge to the set target. We compare our -CO training with the simpler -hinge baselines of the form:


Unlike -CO, is fixed during training, but we train models with different values (0.01, 0.1, 1, 10, 100). In line with -CO, we use the normalized constraint function as we verified that it worked better than the unnormalized one.

Results are shown in Figure (a)a. Observe that the -CO models converge very closely to the set target (within 1 MSE point for achievable constraints). For the hinge models, the constraint is not satisfied reliably and overall / performance is worse (some models converged to / values outside of the chosen display range). Furthermore, the hinge models are sensitive to the value of , and the optimal value differs per target.

Figure 3 shows the trajectories of the -CO multipliers. For stricter constraints, it takes longer before the multiplier starts to drop, changing emphasis from to . In the limit of an unachievable constraint (MSE ), the multiplier remains constant at the clip value. All multipliers converge to a relatively stable final value, which is dependent on the target (as expected since the slope is different).

4.3 -CO vs. -Vae

In the next experiment, we compare the / performance of -CO to the -VAE baseline. We first train -VAE models for exponentially spaced values (0.1, 10, 50, 100, 200, 250, 500, 750). For each -VAE, we use the distortion loss over the last training epoch as the target for training a -CO model.

Results are shown in Figure (b)b (PSNR results in Figure A.1). The / performance of the -CO models is similar to that of the -VAEs. For bitrates higher than 0.4 bpp, we see a slight advantage for the -VAE. For these target values, the -CO multipliers are almost constant (see the strict constraints in Figure 3) and we thus attribute this difference to the optimization hyperparameters being fine-tuned for the scale of the -VAE loss.

4.4 Model Selection

In the final experiment we highlight how constrained optimization can simplify the model selection process. We adapt our architecture by changing the number of latent channels from to , effectively halving the maximum channel capacity from 1.29 bpp to 0.64 bpp. We train -VAE models for the s from Section 4.3 and -CO models using the targets from Section 4.2.

Results are shown in Figure 2. For both optimization methods, the lowest achievable distortion has increased from MSE to MSE for the model with decreased channel capacity.

For the -VAE optimization, points with the same now end up at very different points on the / plane. For the half-capacity model, we cover a narrow range of 240-128 MSE.

In contrast, -CO produces two comparable / curves. Distortion targets below 130 MSE are unachievable for the half-capacity model and are all collapsed into a single point. However, for any achievable distortion target, both models end up with a similar distortion which allows us to do a pointwise comparison.

5 Conclusion

We present distortion constrained optimization (-CO) as an alternative to -VAE training for lossy compression. We report suitable hyperparameters and propose to normalize the constraint function for better performance. We demonstrate that -CO gives similar performance to -VAE on a realistic image compression task, while at the same time providing a more intuitive way to balance the rate and distortion losses. Finally, we show how -CO can facilitate the model selection process by allowing pointwise model comparisons.


  • [1] A. A. Alemi, B. Poole, I. Fischer, J. V. Dillon, R. A. Saurous, and K. Murphy (2018) Fixing a broken ELBO, 2018. In

    International Conference on Machine Learning

    Cited by: §1, §1, §2.2.
  • [2] Y. Bengio, N. Léonard, and A. C. Courville (2013)

    Estimating or propagating gradients through stochastic neurons for conditional computation

    CoRR abs/1308.3432. External Links: Link, 1308.3432 Cited by: §4.1.
  • [3] S. R. Bowman, L. Vilnis, O. Vinyals, A. Dai, R. Jozefowicz, and S. Bengio (2016-08) Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, pp. 10–21. External Links: Link, Document Cited by: §1, §3.1.
  • [4] C. Cai, L. Chen, X. Zhang, G. Lu, and Z. Gao (2019-11) A novel deep progressive image compression framework. In 2019 Picture Coding Symposium (PCS), pp. 1–5. Cited by: §1.
  • [5] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel (2017) Variational lossy autoencoder. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §1, §2.2.
  • [6] Z. Chen, T. He, X. Jin, and F. Wu (2020) Learning for video compression. IEEE Transactions on Circuits and Systems for Video Technology 30 (2), pp. 566–576. Cited by: §1.
  • [7] Y. Choi, M. El-Khamy, and J. Lee (2019) Variable rate deep image compression with a conditional autoencoder. In

    Proceedings of the IEEE International Conference on Computer Vision

    pp. 3146–3154. Cited by: §1, §2.3.
  • [8] A. Djelouah, J. Campos, S. Schaub-Meyer, and C. Schroers (2019) Neural Inter-Frame compression for video coding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6421–6429. Cited by: §1.
  • [9] R. Fletcher (2013-06) Practical methods of optimization. John Wiley & Sons (en). Cited by: §2.1.
  • [10] P. Fua, A. Varol, R. Urtasun, and M. Salzmann (2010) Least-squares minimization under constraints. Technical report, EPFL. Cited by: §2.1.
  • [11] I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville (2017) PixelVAE: a latent variable model for natural images. In 5th International Conference on Learning Representations, Cited by: §1, §3.1.
  • [12] A. Habibian, T. van Rozendaal, J. M. Tomczak, and T. S. Cohen (2019) Video compression with rate-distortion autoencoders. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7033–7042. Cited by: §1, §4.1.
  • [13] I. Higgins, L. Matthey, A. Pal, P. C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework. International Conference on Learning Representations. Cited by: §1.
  • [14] N. Johnston, E. Eban, A. Gordon, and J. Ballé (2019) Computationally efficient neural image compression. CoRR abs/1912.08771. External Links: Link, 1912.08771 Cited by: §1.
  • [15] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, S. Jin Hwang, J. Shor, and G. Toderici (2018) Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 4385–4393. Cited by: §1.
  • [16] W. Karush (1939) Minima of functions of several variables with inequalities as side constraints. M. Sc. Dissertation. Dept. of Mathematics, Univ. of Chicago. Cited by: §3.1.1.
  • [17] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 4743–4751. Cited by: §1, §2.2.
  • [18] H. W. Kuhn and A. W. Tucker (2014) Nonlinear programming. In Traces and emergence of nonlinear programming, pp. 247–258. Cited by: §3.1.1.
  • [19] H. Liu, H. Shen, L. Huang, M. Lu, T. Chen, and Z. Ma (2019-12) Learned video compression via joint Spatial-Temporal correlation exploration. CoRR abs/1912.06348. External Links: 1912.06348 Cited by: §1.
  • [20] S. Lombardo, J. Han, C. Schroers, and S. Mandt (2019) Deep generative video compression. In Advances in Neural Information Processing Systems 32, pp. 9287–9298. External Links: Link Cited by: §1.
  • [21] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao (2019-06) DVC: an end-to-end deep video compression framework. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [22] P. Márquez-Neila, M. Salzmann, and P. Fua (2017-06) Imposing hard constraints on deep networks: promises and limitations. CoRR abs/1706.02025. External Links: 1706.02025 Cited by: §2.1.
  • [23] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool (2018-01)

    Conditional Probability Models for Deep Image Compression

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4394–4402. Cited by: §1, §2.2, §4.1, §4.1.
  • [24] D. Minnen, J. Ballé, and G. D. Toderici (2018) Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 10771–10780. External Links: Link Cited by: §1.
  • [25] D. Minnen and S. Singh (2020) Channel-wise autoregressive entropy models for learned image compression. Cited by: §1.
  • [26] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §3.1.1.
  • [27] J. Pessoa, H. Aidos, P. Tomás, and M. A. Figueiredo (2018) End-to-end learning of video compression using spatio-temporal autoencoders. Cited by: §1.
  • [28] T. Raiko, H. Valpola, M. Harva, and J. Karhunen (2007) Building blocks for variational bayesian learning of latent variable models. J. Mach. Learn. Res. 8 (Jan), pp. 155–201. Cited by: §1, §3.1.
  • [29] D. J. Rezende and F. Viola (2018) Generalized ELBO with constrained optimization, GECO. In Workshop on Bayesian Deep Learning, NeurIPS, Cited by: Lossy Compression with Distortion Constrained Optimization, §1, §2.1, §2.1, §3.1.1, §3.1.1, §3.1.
  • [30] O. Rippel and L. Bourdev (2017) Real-time adaptive image compression. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2922–2930. Cited by: §2.3.
  • [31] O. Rippel, S. Nair, C. Lew, S. Branson, A. G. Anderson, and L. Bourdev (2019) Learned video compression. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3454–3463. Cited by: §1.
  • [32] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther (2016) Ladder variational autoencoders. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 3738–3746. Cited by: §1, §3.1.
  • [33] L. Theis, W. Shi, A. Cunningham, and F. Huszár (2017) Lossy image compression with compressive autoencoders. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • [34] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar (2015-11)

    Variable Rate Image Compression with Recurrent Neural Networks

    CoRR abs/1511.06085. Cited by: §2.3.
  • [35] A. van den Oord, N. Kalchbrenner, L. Espeholt, K. Kavukcuoglu, O. Vinyals, and A. Graves (2016) Conditional Image Generation with PixelCNN Decoders. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 4790–4798. Cited by: §4.1.
  • [36] A. van den Oord, O. Vinyals, et al. (2017) Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315. Cited by: §1.
  • [37] C. Wu, N. Singhal, and P. Krahenbuhl (2018)

    Video compression through image interpolation

    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 416–431. Cited by: §1.
  • [38] Y. Yang, G. Sautière, J. Jon Ryu, and T. S. Cohen (2019-11) Feedback Recurrent AutoEncoder. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). External Links: 1911.04018 Cited by: §1.
  • [39] Y. Yang, R. Bamler, and S. Mandt (2020) Variable-bitrate neural compression via bayesian arithmetic coding. CoRR abs/2002.08158. Cited by: §2.3.
  • [40] Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick (2017) Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3881–3890. Cited by: §1, §3.1.

Appendix A Supplementary Material

a.1 PSNR Results

Figure A.1: Rate / PSNR performance for constrained optimization and -VAE.
Figure A.2: Validation Rate / PSNR performance for constrained optimization and -VAE.