TACTIC: Joint Rate-Distortion-Accuracy Optimisation for Low Bitrate Compression

by   Nikolina Kubiak, et al.

We present TACTIC: Task-Aware Compression Through Intelligent Coding. Our lossy compression model learns based on the rate-distortion-accuracy trade-off for a specific task. By considering what information is important for the follow-on problem, the system trades off visual fidelity for good task performance at a low bitrate. When compared against JPEG at the same bitrate, our approach is able to improve the accuracy of ImageNet subset classification by 4.5 problems, providing a 3.4 performance over task-agnostic compression for semantic segmentation.


page 1

page 2

page 3

page 4


Recognition-Aware Learned Image Compression

Learned image compression methods generally optimize a rate-distortion l...

Task-Oriented Semantic Communication Systems Based on Extended Rate-Distortion Theory

Considering the performance of intelligent task during signal exchange c...

Semantic Compression with Side Information: A Rate-Distortion Perspective

We consider the semantic rate-distortion problem motivated by task-orien...

JPAD-SE: High-Level Semantics for Joint Perception-Accuracy-Distortion Enhancement in Image Compression

While humans can effortlessly transform complex visual scenes into simpl...

Rate Distortion For Model Compression: From Theory To Practice

As the size of neural network models increases dramatically today, study...

Per Clip Lagrangian Multiplier Optimisation for HEVC

The majority of internet traffic is video content. This drives the deman...

Characterizing Generalized Rate-Distortion Performance of Video Coding: An Eigen Analysis Approach

Rate-distortion (RD) theory is at the heart of lossy data compression. H...

1 Introduction

The performance of image processing algorithms depends on the quality of data fed into the system. Algorithmic accuracy inevitably suffers when the data is compressed [poyser_2020, mandelli_2020]. Therefore, we ask ourselves – could this issue be resolved by creating a learnt compression scheme? More specifically, what would happen if compression was joined with the target task, and the data valuable to the machine, and not the Human Visual System (HVS), was extracted?

In TACTIC the task of data recovery and the downstream task are optimised jointly. Image compression and decompression are simulated using an information bottleneck and the latter task is performed using a convolutional neural network (CNN). In this solution changes in the compressed latent space that lead to improvements in the performance of the task head are encouraged. Simultaneously, the task head is trained to deal with compression artefacts introduced during the information bottleneck.

Typically, compression models optimise their performance based on either the rate-distortion (Fig. 0(a)) or rate-accuracy (Fig. 0(b)) trade-off. The former attempts to reconstruct the input as closely as possible, often for the HVS, without regard for whether the reconstructed information is useful for the further tasks. The latter solely tries to perform well on a fixed task. In this paper we adapt the reconstruction to the task and propose TACTIC - a way of learning a compact representation taking into consideration all three parameters at the same time (Fig. 0(c)). We show that TACTIC outperforms the ‘standard’ task-agnostic solution where a rate-distortion-optimised bottleneck is trained first and then a task head is trained separately after the compression scheme has been fixed.

(a) Rate-distortion trade-off
(b) Rate-accuracy trade-off
(c) Rate-distortion-accuracy trade-off (TACTIC)
Figure 1: Types of trade-offs discussed in the paper. denotes the reconstruction loss, loss on a downstream task and - rate loss. is the input image, is its reconstruction after compression and decompression, and is its quantized latent space representation. is the ground truth for the task.

What is more, our approach achieves better accuracy than training the same task head with JPEG data compressed at a comparable rate, beating the popular codec by a margin of 4.5%. We achieve the aforementioned gains with little increase in the run time or memory requirements, thanks to a simple compression architecture and small latent space size. This makes our system highly suitable for resource-limited applications such as Internet of Things or satellites. It also fits perfectly into today’s reality of machine-driven processing of large amounts of data, often performed without direct human supervision.

Finally, we believe that joint learning can be applied to other downstream tasks and produce similarly favourable results. We verify the effectiveness of TACTIC on another computer vision problem - semantic segmentation. We train the model in a task-aware and task-agnostic manner, and show that TACTIC outperforms the task-agnostic approach across multiple tasks.

In summary, the contributions of our paper are twofold:

  • We show that task-aware compression outperforms the same model trained in a task-agnostic manner, as measured by the performance on the downstream task.

  • We demonstrate that TACTIC achieves better downstream task accuracy in comparison with models trained on equivalently compressed JPEG data.

2 Literature review

In the following sections we review papers exploring the effects of compression on downstream computer vision tasks and different ways of mitigating compression-related artefacts. We also look at state-of-the-art learnt lossy compression schemes and at their contributions to learning with low bitrate compact representations.

2.1 Computer vision tasks vs JPEG compression

The quality of JPEG compressed data is regulated by the quantization (Q) tables. The Q values are psycho-visually weighted, i.e. defined in a way that preserves more of the low-frequency information salient to the HVS while quantizing higher frequencies more coarsely. Neural networks do not exhibit the HVS frequency bias so generic JPEG quantization can distort potentially salient information

[liu_2018]. Therefore, numerous works try to tackle this problem and alter the JPEG compression method to look past the HVS and instead improve performance of downstream tasks.

To this end, [liu_2018] re-design the Q-tables taking into consideration the energy associated with each frequency band and, hence, its contribution to the network feature learning. Doing so, they achieve better compression efficiency without detriment in quality. This idea is expanded by [li_optimizing_2020, luo_2020] who use a larger Q-value search space and rely on further hyper-parameter tuning. Similar joint learning strategies are exploited in QuanNet [chamain_2019] to optimise the quantization intervals of the JPEG2000 encoder, and by [brummer_2020] to tune the weights of JPEG XS.

Instead of improving the task performance by redesigning the codec, some aim to fix its artefacts [li_jpeg_2020, ehrlich_quantization_2020]. Galteri [galteri_2019] focus on artefact removal and then verify the improvements on computer vision tasks. Others, Ehrlich [ehrlich_analyzing_2020]

, propose a task-targeted artefact correction model, optimised using the error on the logits of the downstream task, measured as the difference between results obtained using original vs additionally compressed JPEG data.

2.2 State-of-the-art lossy compression schemes

While JPEG is de facto the standard for image compression, competitive learnt solutions have emerged in recent years. Their optimisation expands the traditional rate-distortion trade-off equation in a variety of ways.

Ballé [balle_2018]

introduced hyperpriors. Just like a standard autoencoder learns the representation of an image, this addition allows the model to learn the representation of the latent space and achieve better compressing performance than standard entropy coding methods.

In [rippel_2017, iwai_2020, mentzer_2020] traditional autoencoders are mixed with GANs. The decoder is treated as the generator in a standard GAN, and the reconstructions are fed into a discriminator alongside real examples. [agustsson_2019, wu_2020] show that data lost during compression can be synthesised and the model can still generate visually pleasing results; the balance between reconstruction and generative performance can also vary [tschannen_2018].

In Torfason [torfason_2018] the autoencoder and task network are trained separately and then finetuned together. During finetuning they share the encoder layers to skip decompression. This can be imagined as an encoder backbone with task heads for classification and regression.

3 Methodology

In TACTIC the compression model and the task network are linked, the output of the information bottleneck feeds directly into the following model. Instead of returning just the task output, the model now also outputs the reconstructed image. The two outputs are used to calculate two losses - the reconstruction loss and the downstream task loss. The bitrate of such a mapping is controlled using the rate loss. The three losses are added and optimised together. This is equivalent to the rate-distortion-accuracy trade-off pictured in Fig. 0(c).

Now lets formalise the above description of TACTIC. Firstly, the reconstruction loss, measuring the distortion, can be expressed as


where and are the input image and its reconstruction, respectively. is given by


denote the compressing/decompressing part of the model while and are the corresponding weights.

The accuracy parameter is optimised using the task loss

. The exact loss function (

) depends on the specific problem but it can be generalised as


where is the task function applied to the reconstructed image and is the ground truth used for loss calculation. If the task function is implemented as a neural network with weights , could be described as


Finally, we estimate the bitrate of our model. Instead of encoding the image pixels directly, we can operate on their latent representation. Compression, however, is a source of error since discarded data cannot be easily recovered. If the latent space encoding becomes part of the training process, we can learn what information to discard and what to preserve. Traditionally, the latent space is quantized and then entropy coded. Yet, such an approach does not allow for easy gradient flow. Therefore, we follow the approach of

[balle_2018] and simulate the quantization by adding uniform noise to the latent representation during training:


Now is the noisy ‘quantized’ version of . During inference no gradients are needed so actual quantization is applied to :


The rate loss and its relationship with the noisy / quantized representation can be expressed as


where refers to the latent space coding, mapping to the number of bits, and and

are image width and height, used for bits per pixel normalisation. The encoding is achieved by learning an approximation of the probability density function (PDF) of the data by fitting a parametric function

to it. The coding rate is then approximated as


Finally, to control the trade-off between the parameters, scaling was added to the losses. The complete TACTIC loss uniting Eqs. 1, 3 and 7 is thus


where denotes the weighting factor on the task loss and - on the rate loss .

4 Experimental results

The compression architecture used in the experiments was fully convolutional. The compressing part was formed by 2 conv blocks (convolution - ReLU - max pooling); decompression was based on 3 steps of up-convolutions and ReLUs. The downstream task chosen for the experiments was classification, demonstrated using Inception v3


and trained with cross-entropy loss. All models used Adam for optimisation and their learning rate was set to 0.001. All experiments were run with a batch size of 32. The solution was implemented in PyTorch.

The experiments were performed on a subset of the ILSVRC2012 [imagenet] version of ImageNet. The first 50 classes from ILSVRC2012 were chosen, which formed a dataset of roughly 55k images. The data was split 80:20, with a fixed random seed, into train and validation sets. In the interest of fairness, three different random seeds were tested; the results achieved with the models under different data splits varied only marginally (max. 0.5%).

In the following sections we will refer to ‘compressed’ and ‘uncompressed’ data. The former describes data compressed using the information bottleneck and the latter - data fed directly into the CNN, . In reality all ImageNet data is JPEG-encoded and, hence, compressed. Therefore, all mentions of compression should be understood as additional compression applied to the files.

4.1 Task-aware vs task-agnostic compression

In the initial experiments our goal was to investigate the difference between the task-aware and task-agnostic learning approaches to the distortion and accuracy optimisation problem. The latter approach is equivalent to training the compression network first, and then fixing it and training the classification network on the reconstructed data.

For this set of experiments, was set to 1 and to 0. The results, shown in Table 1, prove that when input data compression is necessary, joint learning results in greater information retention. Using a separately trained autoencoder led to a 10% decrease in accuracy while for TACTIC, this was only 5%.

Model version Accuracy [%]
no compression 63.7
task-agnostic 53.8
Table 1: Best recorded accuracy of the Inception model trained with different types of compression.

4.2 Comparisons with JPEG

To quantify the benefits of our approach, we make comparisons with JPEG. Compression of varying degree was applied to inputs to the Inception network using the quality setting available for saving PIL images; the values used were: 2, 5, 10, 15, 25 and 50%. For each quality setting, we simulated the compression to get the bit rates of the validation dataset. We also calculated the bit rate of the uncompressed data used with the standalone classifier.

In the experiments, we fixed at 1 and tested different values of for the rate-distortion-accuracy trade-off. A selection of operating points is shown in Fig. 2. Left-to-right, these correspond to = [4,2,1,]. The dashed green line represents the accuracy for uncompressed data. Thanks to the size of our latent space, we were able to target low bitrates. These proved to be almost exclusively lower than those achievable by JPEG. Higher bitrates were not explored for TACTIC as it would have been necessary to alter the network architecture and expand the latent space, broadening the scope of the problem. To make meaningful comparisons, we set our highest bitrate model ( = , bpp = 0.245) against JPEG 2% and observe an accuracy gain of 4.5%. The numerical results are shown in Table 2.

Compression Accuracy [%] Bits / pixel
no compression 63.7 5.2541
JPEG50 62.3 1.1302
JPEG25 61.5 0.7365
JPEG15 60.6 0.5390
JPEG10 60.6 0.4276
JPEG5 58.2 0.3010
JPEG2 54.2 0.2410
ours = 58.7 0.2450
ours = 58.0 0.2159
ours = 1 57.8 0.1278
ours = 2 56.6 0.0936
ours = 4 55.6 0.0694
Table 2: Best accuracy of the classification model trained with different degrees of JPEG compression vs TACTIC.

4.3 Task verification

While the previous experiments focused on classification, we believe that the joint learning approach could work just as well with other computer vision tasks . We verify this on semantic segmentation by running a standalone segmentation model as well as task-aware and task-agnostic compression networks.

Figure 2: Accuracy vs bitrate curves for TACTIC and models trained with JPEG-compressed data.

The experiments were performed using the FCN-Resnet101 [long_2015]

model and used the same hyper parameters as described before; only the learning rate was set to 0.0001 and the batch size to 3. As for data, the Cityscapes

[cityscapes] dataset was downsized and cropped to 512x512 pixels. The model was trained with three different losses: weighted cross-entropy loss (), Dice loss () and the weighted sum of both; was used as a scale factor on the Dice loss:


As in the initial classification experiments, was set to zero; and were set to 1. The weighted sum loss (Eq. 10) generated the best results as measured in terms of pixel-wise accuracy and mean intersection-over-union (mean IoU) score; these are shown in Table 3. Once again, these are more favourable for the joint learning scheme. In terms of accuracy, training in a task-aware setup results in 4.2% accuracy drop; meanwhile, for the standard task-agnostic scheme this is already 7.6%. For mean IoU, the decline is 12.5% for TACTIC and 17.4% for the task-agnostic model.

Model version Accuracy [%] MeanIoU [%]
no compression 80.0 41.0
task-agnostic 72.4 23.6
TACTIC 75.8 28.5
Table 3: Best recorded pixel-wise accuracy and mean IoU for FCN-Resnet101 trained with different types of compression.

5 Conclusions

The presented results aim to inspire a new approach to image compression and learning compact latent representations, with focus shifted from the HVS to machine-driven data processing. We show that optimising the compression for a specific task, instead of focusing on perceptual quality, results in better performance for the same model. TACTIC can be easily coupled with other models, adding little overhead in terms of run time or model weights.

The compression network used with TACTIC was designed to serve as a backbone to demonstrate a new idea and it is likely that even better accuracies could be achieved with further model optimisation. Nevertheless, even with such simple architectures we were able to outperform JPEG compression of similar bitrate.