I Introduction
In recent years, Deep Neural Networks (DNNs) have become the preferred method for solving many computer vision (CV) problems [1]. At the same time, the complexity in both memory and floating point operations of the aforementioned models has grown tremendously [2]. That growth, alongside the increasing variety of CV models and their uses, poses a challenge for deployment to lowresource edge devices such as mobile phones, smart speakers, etc. For this reason, in many CV tasks, the most common approach today [3]
is to simply avoid deploying resourceintensive models to edge devices. Instead, the bulk of the computation is performed on remote servers (the cloud) equipped with powerful graphics processing units (GPU) or tensor processing units (TPU).
Transmitting an image for inference on the cloud using traditional coding methods, which have been designed to accommodate human perception, has been shown to be suboptimal both theoretically [4] and empirically [4, 3]. Collaborative intelligence (CI) [3] suggests dividing a given CV model to a frontend, deployed on the edge, and backend, which remains on the cloud. Inference can then be started on the edge, creating intermediate representations at some layer of the CV model, which we denote . These are then sent to the cloud backend, where inference is finished. Of course, this approach requires defining an efficient method for the communication of said representations (generally floating point tensors) to the cloud, which has been explored in several subsequent works [5, 6, 7].
Inspired by CI, the work of [4] proposes replacing the CV frontend with an encoder from a learnable image coding backbone [8]. This replaces the floating point tensors with highly compressible latent features, which are named the base representation. On the decoder side, these can be used to match the intermediate features for the CV backend, which can then perform inference. The matching of intermediate CV features from the compressible latent space is performed by a learned transform named the Latent Space Transform (LST). An important aspect of [4], is that it allows for human vision of the analysed image in a scalable manner. The base representation is combined with a second representation, named enhancement, to reconstruct the original image on the cloud side. In terms of ratedistortion, this method outperforms traditional as well as learnable coding schemes for the CV task, without significant degradation with regards to reconstruction for humans.
In this work we further investigate the behaviour of [4] in the context of both object detection and input reconstruction. We provide theoretical proof that performing feature matching on deeper layers of a CV model is beneficial in the sense of ratedistortion. Based on our proofs, we propose changes to the training loss of [4] to improve ratedistortion behaviour for object detection. Using the proposed method, we achieve superior ratedistortion performance for the object detection task, which we consider to be empirical evidence of the theory presented in our proofs. Finally, we explore the tradeoff in ratedistortion performance between computer and human vision in scalable compression, and discuss the consequences of our approach in that regard.
Ii Background
Iia Feature Matching in Collaborative Intelligence
When splitting a model for use in CI, it is important to distinguish between models (known as singlestream) where layer operations are performed purely sequentially, and ones (known as multistream) where layers are combined in complex architectures including skipconnections and multiresolution computations [9, 10, 11]. Often, a CV model will be comprised of some singlestream layers followed by a multistream section. We consider the output of a certain layer, , to be a single stream feature if we can perform inference by obtaining from the input, and then passing it on to the rest of the model. Formally, we require , where is the CV task and is the operation described by all subsequent layers after . Conversely, if the output of the task model cannot be inferred solely from , we consider it a multistream feature.
In collaborative intelligence, the intermediate features produced on the edge device must be transmitted (or recreated from a latent representation) to be used downstream by the CV backend. Formally, we need to match, on the server side, the features at some small subset , of the model’s layers where . Following our definitions above, we see that in the case , we require to be a singlestream feature. In most multistream models, this is often only achievable by selecting a relatively early layer. Unfortunately, as we will prove below, matching earlier features comes at a cost in terms of ratedistortion performance.
IiB Compression and RateDistortion Analysis
Compression, in general, can be divided into lossless compression, where a source random variable (RV)
is reconstructed perfectly after decompressing; and lossy compression, where after decoding we are left with an approximation (approximations or quantizations are denoted with a hat operator). For a single pair of observations , we measure the amount of inaccuracy introduced by the approximation using some distortion metric . This leads to an important concept in lossy compression known as the ratedistortion function [12], given by the following:(1) 
Here, is the conditional distribution of the approximation given the source;
is the expected distortion with respect to the joint distribution
); is the mutual information between ; and ; and is some value of distortion.Because the marginal distribution of the source, is fixed, the joint distribution, only changes through the conditional distribution of the approximation . Using that, we can understand the minimization in the rate distortion function as finding an approximation that gives the lowest mutual information with the source while allowing an expected distortion no greater than . Then, using the source coding theorem [12] and its converse, it can be shown that the resulting bitrate^{1}^{1}1This assumes we calculate mutual information using . is the lowest achievable rate (per source symbol) giving an expected distortion bound by . In this sense, is a fundamental bound on the performance of lossy compression, similarly to entropy being a bound on lossless compression.
Iii Theoretical Discussion
As explained in Section IIA, a useful approach for communicating an image to a remoteserver for some downstream CV task it to perform featurematching on some intermediate layer of a DNN model. In fact, [4] provided theoretical and experimental proof that feature matching is preferable, in terms of rate distortion, to transmitting an entire image to the cloud. We build on this proof and show that the ratedistortion function when matching deeper layers is lower (better) compared to matching earlier ones. To provide this proof we first need to define some notation. Let our CV model, with a task output previously denoted be defined as , so that for a given input we have . Next, let the mapping from the input to a set of intermediate features be so that , and the mapping from to the output be so that . Next, define a second, deeper set of intermediate features (with ) and mappings and such that and . Note that under this notation, and . This notation is illustrated graphically below.
In the case of compression for machines, we are mainly interested in the output of our CV task, and thus we measure the distortion at . By denoting this as , we have:
(2)  
It is important to note that the choice of distortion metric depends on the CV task, and might generally differ from the metrics used for human vision. Next, we define the set of all approximations and their respective conditional distributions, which achieve a task distortion of at most as:
(3) 
Using (2), we can obtain equivalent formulations for and . This notation allows us to rewrite (1) as:
(4) 
Similarly, we can write equivalent formulations for and . We are now ready to state our first result.
Theorem 1.
For any distortion level , the minimum achievable rate for compressing is an upper bound to the minimum achievable rate for compressing , that is,
(5) 
Equality occurs when can be recreated exactly from .
Proof.
In [4], the authors prove that for any one intermediate layer , we have . To prove our theorem, we simply need to show that we can replace with , as well as replace with , all the while maintaining the conditions of the proof.
First, we recall that the only condition on the source is that it must have a fixed distribution in the sense that it does not depend on the approximation (or quantization) process. However, since is directly computed from , its own distribution, which we denote , induced by is also fixed (because is fixed).
To finish, we want to show that the structure relating , , and , is maintained between , , and . However, because there are no special requirements on the functions in the original formulation (not even that they are deterministic), this is trivial. To do this, we can simply replace the function with , with , and with , thus concluding our proof. ∎
In [4], the authors define a twolayer^{2}^{2}2The word layer here refers to the base and enhancement portion of the model, and not to DNN layers. network where the base representation is used to compress an input for object detection by a YOLOv3 vision backend [9], and the enhancement is used (together with base) for input reconstruction. To compress the image for object detection, the authors perform feature matching on the 13^{th} layer of YOLOv3, because is the deepest singlestream feature in YOLOv3.
In their threelayer model, the authors in [4] utilize the base and first enhancement layers to perform feature matching on more complex architectures of Faster RCNN [10] and Mask RCNN [11], which are used for object detection and segmentation, respectively. There, because of the multistream structure of a featurepyramid network (which is part of both RCNN architectures), they use an alternative approach to feature matching. In this method, the target feature is reconstructed in two steps. First, using an LST, a singlestream feature is reconstructed for some small . Next, using pretrained portions of the CV model (denoted , and collectively named the CV midmodel), several multistream features are recreated. For clarity, when using this approach we refer to as the partition point [3], and as downstream features.
Evaluating distortion in the manner used in Theorem 1, is equivalent to using the twostep approach to reconstructing the task output , from the intermediate features or . Using this distortion during training is impractical because it would require a labeled dataset of uncompressed images for each of the desired tasks. Unfortunately, datasets containing uncompressed images with CV task labels are rare, making supervised training of a scalable model impractical. Instead, both models in [4] use mean squared error (MSE) of the reconstructed features to evaluate distortion during training. We will prove next, that the use of deeper layers in this practical setting is still beneficial.
To evaluate the effect of measuring distortion on intermediate features, we need to extend the previously defined notation. First, we define the distortion measured at some variable as . Next, the set of all (including twostep) reconstructions of from that achieve a distortion of no more than (as measured at ) is defined as:
(6) 
Analogously, the ratedistortion function of from is:
(7) 
Note that is simply the conventional ratedistortion function of . Finally, we say that a function has a distortion magnitude of if, for any conditional distribution , we have .^{3}^{3}3
Note that commonly used operations, such as batch normalization, are designed to maintain relatively uniform feature magnitude in each layer, which in turn promotes distortion magnitude close to
. Alternatively, the distortion function can be scaled to achieve this behaviour in practice. We are now ready to state and prove our next result.Theorem 2.
Given a function with a distortion magnitude of , the minimum achievable rate for compressing from is upper bounded by the minimum achievable rate for compressing from itself. That is:
(8) 
Proof.
We begin by assuming we have an approximation , represented by , which achieves the conventional ratedistortion function of . This means that and also . We then define , which induces a conditional distribution . Because the distortion magnitude of is , we know that as:
(9)  
Next, we note that the conditional distribution
is induced by the the following Markov chain:
We apply the data processing inequality [12] to this chain to get .
Because we have already shown we can use the definition of the ratedistortion function of from . As defined in (7), is the minimum of the mutual information across all conditional distributions in and thus we see that , thereby concluding our proof. ∎
Iv Experiments
Iva Experimental setup
To verify that our results hold in a practical setting, we modify the twolayer model presented in [4]. Utilizing the original training algorithm, we use the following toss formulation to ensure good performance in terms of both task performance as well as bitrate:
(10) 
Here is the training loss,
is the bitrate as estimated by a learned entropy model inspired by
[8], and are the distortions of the image reconstruction and feature matching, respectively (both use mean square error, MSE, as their metric), and and are Lagrange multipliers. By training the model using various values of , we produce ratedistortion curves for object detection and input reconstruction.Similarly to the approach taken in the threelayer case, we train the twolayer model using twostep feature matching, which can be seen in Fig. 1. In practice, using this twostep approach is equivalent to changing the distortion from being measured at the partition point to measuring it at some downstream layer features. Based on the multiresolution structure of YOLOv3, we explore several combinations of layers, seen in Table I, to use for feature matching during training. When using more than one layer we combine the distortions by first flattening the tensors in question, then concatenating them, before calculating the MSE. Note that layer 108 is the final layer of YOLOv3, and it’s output consists of multiple tensors, corresponding to different resolutions.
Method  Feature  Base  Enhancement 

Layers  BDRate[%]  BDPSNR [dB]  
HEVC  N/A  
VVC  N/A  
Cheng et al.[8]  N/A  
Choi et al.[4],  
Choi et al.[4],  
Proposed  MidModel 1  
Proposed  MidModel 2  
Proposed  MidModel 3 
We train our model following a similar scheme to the one presented in [4]
, using a batch size of 16, and training in two stages. At first, we train for 400 epochs using a dataset comprised of randomly cropped patches from both the CLIC
[13] and JPEGAI [14] datasets. Then we replace the dataset to a subset of VIMEO90K [15] and proceed to train for another 350 epochs (the use of a subset is due to the large amount of models to be trained). The model is trained using an Adam optimizer with a fixed learning rate of for the first stage which is then reduced with a polynomial decay every 10 epochs during the second stage.After training, we evaluate our models in terms of ratedistortion performance for both object detection (base) and reconstruction (enhancement) tasks. For the base task we use the validation set of the COCO2014 dataset
[16], which we resize to following the procedure in [4]. In accordance with common practice for object detection, we use the mean average precision (mAP) as our metric. For the enhancement task we use the Kodak dataset [17], and PSNR as our evaluation metric. As we have proven in our theoretical discussion, we expect our modifications to lead to improved ratedistortion performance in the computer vision task, perhaps at some cost to the reconstruction task.
We compare our model against the original twolayer model, which we retrain using the same data for a fair comparison. Since our method favors the base task slightly, we also train a second version of the twolayer model, using the original layers for distortion but with a larger Lagrange multiplier for , . Lastly, we also compare our model against three baselines: HEVC [18], VVC [19], and a learned compression model [8], where an image is fully decoded on the serverside before being passed to YOLOv3 for detection.
IvB Results
We begin by reviewing the results for the base task, as seen in Fig. 2. We can see that, as expected, the use of deeper layers, has led to improved ratedistortion performance, with the largest improvement obtained using the final layer of YOLOv3. To quantify the differences in ratedistortion performance, we compute the BDRate metric [20], which measures the average difference in bits required to achieve equal performance (in terms of mAP in this case) to a baseline (we use [4]). Once again, using the final layer achieves the best performance. Interestingly, increasing the Lagrange multiplier for the base task ( in 10) to 0.012, had a comparable effect to that of our method using midmodel 1 or 2.
Next, we observe our results on the enhancement task, shown in Fig. 3. We summarise the performance using the BDPSNR metric which measures the average difference in PSNR compared to a baseline (we use VVC [19]) at an equal bitrate, seen in Table I. Comparing the performance here to the base task we clearly see the tradeoff between the two. This is most notable in the performance of our method using midmodel 3, which was the best method for object detection, and is the weakest here on image reconstruction. Importantly, our method using midmodel 1 or 2 performs slightly better than the modified version of [4] using , which had comparable performance on the base task.
V Summary and Discussion
The growing prevalence of DNNs in the field of computer vision have lead to an increasing volume of images seen predominantly by machines. This, in turn, requires efficient coding methods for machines and a good understanding of their ratedistortion behaviour. In this paper we provided important theoretical background for understanding the implication of layer selection when encoding an image for feature matching in a downstream DNN model. We have proved that using deeper layers achieves superior ratedistortion performance compared to earlier ones. This is true both when evaluating distortion at the task level, as well as when the distortion is measured at some intermediate point.
To demonstrate the usefulness of our theoretical results, we used them to modify the training loss of an image coding method for humans and machines [4]. The results clearly show that using our approach yields improvement in the ratedistortion performance of the machine side of the model, as expected. Importantly, the modifications do not require any change to the encoder side architecture, or to the decoding process, making our approach easy to implement in practice. This is critical as it does not require more edgedevice resources than simply using the learned image coding backbone [8], on which it is based.
As might have been expected, the improved performance on the machine performance comes at some cost on the human side. Using a hyperparameter (), the framework in [4] allows a designer to balance between performance on each side during training. When considering either model for a specific application, one might estimate the frequency at which images will need to be fully reconstructed and adjust the training procedure accordingly. Notably, when both models are adjusted to give equal ratedistortion performance on object detection, our model resulted in superior performance on image reconstruction compared with [4].
References

[1]
J. Chai, H. Zeng, A. Li, and E. W. Ngai, “Deep learning in computer vision: A critical review of emerging techniques and application scenarios,”
Machine Learning with Applications, vol. 6, p. 100134, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2666827021000670  [2] J. Dean, D. Patterson, and C. Young, “A new golden age in computer architecture: Empowering the machinelearning revolution,” IEEE Micro, vol. 38, no. 2, pp. 21–29, 2018.
 [3] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” ACM SIGARCH Computer Architecture News, vol. 45, no. 1, pp. 615–629, 2017.
 [4] H. Choi and I. V. Bajić, “Scalable image coding for humans and machines,” IEEE Transactions on Image Processing, vol. 31, pp. 2739–2754, 2022.
 [5] S. R. Alvar and I. V. Bajić, “Multitask learning with compressible features for collaborative intelligence,” in Proc. IEEE ICIP’19, Sep. 2019, pp. 1705–1709.

[6]
H. Choi and I. V. Bajić, “Nearlossless deep feature compression for collaborative intelligence,” in
Proc. IEEE MMSP, Aug. 2018.  [7] H. Choi and I. V. Bajić, “Deep feature compression for collaborative object detection,” in Proc. IEEE ICIP, 2018.
 [8] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image compression with discretized gaussian mixture likelihoods and attention modules,” in Proc. IEEE/CVF CVPR, 2020, pp. 7939–7948.
 [9] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv preprint arXiv:1804.02767, Apr. 2018.
 [10] S. Ren, K. He, R. Girshick, and J. Sun, “Faster RCNN: Towards realtime object detection with region proposal networks,” arXiv preprint arXiv:1506.01497, 2015.
 [11] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask RCNN,” in Proc. IEEE/CVF ICCV, 2017, pp. 2961–2969.
 [12] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Wiley, 2006.
 [13] “Challenge on learned image compression (CLIC),” [Online]: http://www.compression.cc/, accessed: 20201026.
 [14] “JPEG AI dataset,” [Online]: https://jpeg.org/jpegai/dataset.html, accessed: 20201026.
 [15] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman, “Video enhancement with taskoriented flow,” Int. J. Comput. Vision, vol. 127, no. 8, pp. 1106–1125, 2019.
 [16] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in Proc. ECCV, Sept. 2014.
 [17] E. Kodak, “Kodak lossless true color image suite (PhotoCD PCD0992),” http://r0k.us/graphics/kodak, accessed: 20190319.
 [18] “High efficiency video coding,” rec. ITUT H.265 and ISO/IEC 230082, 2019, Int. Telecommun. UnionTelecommun. (ITUT) and Int. Standards Org./Int/Electrotech. Commun. (ISO/IEC JTC 1).
 [19] “Versatile video coding,” rec. ITUT H.266 and ISO/IEC 230903, 2020, Int. Telecommun. UnionTelecommun. (ITUT) and Int. Standards Org./Int/Electrotech. Commun. (ISO/IEC JTC 1).
 [20] G. Bjøntegaard, “Calculation of average PSNR differences between RDcurves,” https://www.itu.int/wftp3/avarch/videosite/0104_Aus/VCEGM33.doc, Apr. 2001, vCEGM33.