In recent years, Deep Neural Networks (DNNs) have become the preferred method for solving many computer vision (CV) problems . At the same time, the complexity in both memory and floating point operations of the aforementioned models has grown tremendously . That growth, alongside the increasing variety of CV models and their uses, poses a challenge for deployment to low-resource edge devices such as mobile phones, smart speakers, etc. For this reason, in many CV tasks, the most common approach today 
is to simply avoid deploying resource-intensive models to edge devices. Instead, the bulk of the computation is performed on remote servers (the cloud) equipped with powerful graphics processing units (GPU) or tensor processing units (TPU).
Transmitting an image for inference on the cloud using traditional coding methods, which have been designed to accommodate human perception, has been shown to be sub-optimal both theoretically  and empirically [4, 3]. Collaborative intelligence (CI)  suggests dividing a given CV model to a frontend, deployed on the edge, and backend, which remains on the cloud. Inference can then be started on the edge, creating intermediate representations at some layer of the CV model, which we denote . These are then sent to the cloud backend, where inference is finished. Of course, this approach requires defining an efficient method for the communication of said representations (generally floating point tensors) to the cloud, which has been explored in several subsequent works [5, 6, 7].
Inspired by CI, the work of  proposes replacing the CV frontend with an encoder from a learnable image coding backbone . This replaces the floating point tensors with highly compressible latent features, which are named the base representation. On the decoder side, these can be used to match the intermediate features for the CV backend, which can then perform inference. The matching of intermediate CV features from the compressible latent space is performed by a learned transform named the Latent Space Transform (LST). An important aspect of , is that it allows for human vision of the analysed image in a scalable manner. The base representation is combined with a second representation, named enhancement, to reconstruct the original image on the cloud side. In terms of rate-distortion, this method outperforms traditional as well as learnable coding schemes for the CV task, without significant degradation with regards to reconstruction for humans.
In this work we further investigate the behaviour of  in the context of both object detection and input reconstruction. We provide theoretical proof that performing feature matching on deeper layers of a CV model is beneficial in the sense of rate-distortion. Based on our proofs, we propose changes to the training loss of  to improve rate-distortion behaviour for object detection. Using the proposed method, we achieve superior rate-distortion performance for the object detection task, which we consider to be empirical evidence of the theory presented in our proofs. Finally, we explore the trade-off in rate-distortion performance between computer and human vision in scalable compression, and discuss the consequences of our approach in that regard.
Ii-a Feature Matching in Collaborative Intelligence
When splitting a model for use in CI, it is important to distinguish between models (known as single-stream) where layer operations are performed purely sequentially, and ones (known as multi-stream) where layers are combined in complex architectures including skip-connections and multi-resolution computations [9, 10, 11]. Often, a CV model will be comprised of some single-stream layers followed by a multi-stream section. We consider the output of a certain layer, , to be a single stream feature if we can perform inference by obtaining from the input, and then passing it on to the rest of the model. Formally, we require , where is the CV task and is the operation described by all subsequent layers after . Conversely, if the output of the task model cannot be inferred solely from , we consider it a multi-stream feature.
In collaborative intelligence, the intermediate features produced on the edge device must be transmitted (or recreated from a latent representation) to be used downstream by the CV backend. Formally, we need to match, on the server side, the features at some small subset , of the model’s layers where . Following our definitions above, we see that in the case , we require to be a single-stream feature. In most multi-stream models, this is often only achievable by selecting a relatively early layer. Unfortunately, as we will prove below, matching earlier features comes at a cost in terms of rate-distortion performance.
Ii-B Compression and Rate-Distortion Analysis
Compression, in general, can be divided into lossless compression, where a source random variable (RV)is reconstructed perfectly after decompressing; and lossy compression, where after decoding we are left with an approximation (approximations or quantizations are denoted with a hat operator). For a single pair of observations , we measure the amount of inaccuracy introduced by the approximation using some distortion metric . This leads to an important concept in lossy compression known as the rate-distortion function , given by the following:
Here, is the conditional distribution of the approximation given the source;
is the expected distortion with respect to the joint distribution); is the mutual information between ; and ; and is some value of distortion.
Because the marginal distribution of the source, is fixed, the joint distribution, only changes through the conditional distribution of the approximation . Using that, we can understand the minimization in the rate distortion function as finding an approximation that gives the lowest mutual information with the source while allowing an expected distortion no greater than . Then, using the source coding theorem  and its converse, it can be shown that the resulting bit-rate111This assumes we calculate mutual information using . is the lowest achievable rate (per source symbol) giving an expected distortion bound by . In this sense, is a fundamental bound on the performance of lossy compression, similarly to entropy being a bound on lossless compression.
Iii Theoretical Discussion
As explained in Section II-A, a useful approach for communicating an image to a remote-server for some downstream CV task it to perform feature-matching on some intermediate layer of a DNN model. In fact,  provided theoretical and experimental proof that feature matching is preferable, in terms of rate distortion, to transmitting an entire image to the cloud. We build on this proof and show that the rate-distortion function when matching deeper layers is lower (better) compared to matching earlier ones. To provide this proof we first need to define some notation. Let our CV model, with a task output previously denoted be defined as , so that for a given input we have . Next, let the mapping from the input to a set of intermediate features be so that , and the mapping from to the output be so that . Next, define a second, deeper set of intermediate features (with ) and mappings and such that and . Note that under this notation, and . This notation is illustrated graphically below.
In the case of compression for machines, we are mainly interested in the output of our CV task, and thus we measure the distortion at . By denoting this as , we have:
It is important to note that the choice of distortion metric depends on the CV task, and might generally differ from the metrics used for human vision. Next, we define the set of all approximations and their respective conditional distributions, which achieve a task distortion of at most as:
Similarly, we can write equivalent formulations for and . We are now ready to state our first result.
For any distortion level , the minimum achievable rate for compressing is an upper bound to the minimum achievable rate for compressing , that is,
Equality occurs when can be recreated exactly from .
In , the authors prove that for any one intermediate layer , we have . To prove our theorem, we simply need to show that we can replace with , as well as replace with , all the while maintaining the conditions of the proof.
First, we recall that the only condition on the source is that it must have a fixed distribution in the sense that it does not depend on the approximation (or quantization) process. However, since is directly computed from , its own distribution, which we denote , induced by is also fixed (because is fixed).
To finish, we want to show that the structure relating , , and , is maintained between , , and . However, because there are no special requirements on the functions in the original formulation (not even that they are deterministic), this is trivial. To do this, we can simply replace the function with , with , and with , thus concluding our proof. ∎
In , the authors define a two-layer222The word layer here refers to the base and enhancement portion of the model, and not to DNN layers. network where the base representation is used to compress an input for object detection by a YOLOv3 vision backend , and the enhancement is used (together with base) for input reconstruction. To compress the image for object detection, the authors perform feature matching on the 13th layer of YOLOv3, because is the deepest single-stream feature in YOLOv3.
In their three-layer model, the authors in  utilize the base and first enhancement layers to perform feature matching on more complex architectures of Faster R-CNN  and Mask R-CNN , which are used for object detection and segmentation, respectively. There, because of the multi-stream structure of a feature-pyramid network (which is part of both R-CNN architectures), they use an alternative approach to feature matching. In this method, the target feature is reconstructed in two steps. First, using an LST, a single-stream feature is reconstructed for some small . Next, using pre-trained portions of the CV model (denoted , and collectively named the CV mid-model), several multi-stream features are recreated. For clarity, when using this approach we refer to as the partition point , and as downstream features.
Evaluating distortion in the manner used in Theorem 1, is equivalent to using the two-step approach to reconstructing the task output , from the intermediate features or . Using this distortion during training is impractical because it would require a labeled dataset of uncompressed images for each of the desired tasks. Unfortunately, datasets containing uncompressed images with CV task labels are rare, making supervised training of a scalable model impractical. Instead, both models in  use mean squared error (MSE) of the reconstructed features to evaluate distortion during training. We will prove next, that the use of deeper layers in this practical setting is still beneficial.
To evaluate the effect of measuring distortion on intermediate features, we need to extend the previously defined notation. First, we define the distortion measured at some variable as . Next, the set of all (including two-step) reconstructions of from that achieve a distortion of no more than (as measured at ) is defined as:
Analogously, the rate-distortion function of from is:
Note that is simply the conventional rate-distortion function of . Finally, we say that a function has a distortion magnitude of if, for any conditional distribution , we have .333 Note that commonly used operations, such as batch normalization, are designed to maintain relatively uniform feature magnitude in each layer, which in turn promotes distortion magnitude close to
Note that commonly used operations, such as batch normalization, are designed to maintain relatively uniform feature magnitude in each layer, which in turn promotes distortion magnitude close to. Alternatively, the distortion function can be scaled to achieve this behaviour in practice. We are now ready to state and prove our next result.
Given a function with a distortion magnitude of , the minimum achievable rate for compressing from is upper bounded by the minimum achievable rate for compressing from itself. That is:
We begin by assuming we have an approximation , represented by , which achieves the conventional rate-distortion function of . This means that and also . We then define , which induces a conditional distribution . Because the distortion magnitude of is , we know that as:
Next, we note that the conditional distribution
is induced by the the following Markov chain:
We apply the data processing inequality  to this chain to get .
Because we have already shown we can use the definition of the rate-distortion function of from . As defined in (7), is the minimum of the mutual information across all conditional distributions in and thus we see that , thereby concluding our proof. ∎
Iv-a Experimental setup
To verify that our results hold in a practical setting, we modify the two-layer model presented in . Utilizing the original training algorithm, we use the following toss formulation to ensure good performance in terms of both task performance as well as bit-rate:
Here is the training loss,
is the bit-rate as estimated by a learned entropy model inspired by, and are the distortions of the image reconstruction and feature matching, respectively (both use mean square error, MSE, as their metric), and and are Lagrange multipliers. By training the model using various values of , we produce rate-distortion curves for object detection and input reconstruction.
Similarly to the approach taken in the three-layer case, we train the two-layer model using two-step feature matching, which can be seen in Fig. 1. In practice, using this two-step approach is equivalent to changing the distortion from being measured at the partition point to measuring it at some downstream layer features. Based on the multi-resolution structure of YOLOv3, we explore several combinations of layers, seen in Table I, to use for feature matching during training. When using more than one layer we combine the distortions by first flattening the tensors in question, then concatenating them, before calculating the MSE. Note that layer 108 is the final layer of YOLOv3, and it’s output consists of multiple tensors, corresponding to different resolutions.
|Cheng et al.||N/A|
|Choi et al.,|
|Choi et al.,|
|Proposed - Mid-Model 1|
|Proposed - Mid-Model 2|
|Proposed - Mid-Model 3|
We train our model following a similar scheme to the one presented in 
, using a batch size of 16, and training in two stages. At first, we train for 400 epochs using a dataset comprised of randomly cropped patches from both the CLIC and JPEG-AI  datasets. Then we replace the dataset to a subset of VIMEO-90K  and proceed to train for another 350 epochs (the use of a subset is due to the large amount of models to be trained). The model is trained using an Adam optimizer with a fixed learning rate of for the first stage which is then reduced with a polynomial decay every 10 epochs during the second stage.
After training, we evaluate our models in terms of rate-distortion performance for both object detection (base) and reconstruction (enhancement) tasks. For the base task we use the validation set of the COCO-2014 dataset, which we resize to following the procedure in . In accordance with common practice for object detection, we use the mean average precision (mAP) as our metric. For the enhancement task we use the Kodak dataset 
, and PSNR as our evaluation metric. As we have proven in our theoretical discussion, we expect our modifications to lead to improved rate-distortion performance in the computer vision task, perhaps at some cost to the reconstruction task.
We compare our model against the original two-layer model, which we retrain using the same data for a fair comparison. Since our method favors the base task slightly, we also train a second version of the two-layer model, using the original layers for distortion but with a larger Lagrange multiplier for , . Lastly, we also compare our model against three baselines: HEVC , VVC , and a learned compression model , where an image is fully decoded on the server-side before being passed to YOLOv3 for detection.
We begin by reviewing the results for the base task, as seen in Fig. 2. We can see that, as expected, the use of deeper layers, has led to improved rate-distortion performance, with the largest improvement obtained using the final layer of YOLOv3. To quantify the differences in rate-distortion performance, we compute the BD-Rate metric , which measures the average difference in bits required to achieve equal performance (in terms of mAP in this case) to a baseline (we use ). Once again, using the final layer achieves the best performance. Interestingly, increasing the Lagrange multiplier for the base task ( in 10) to 0.012, had a comparable effect to that of our method using mid-model 1 or 2.
Next, we observe our results on the enhancement task, shown in Fig. 3. We summarise the performance using the BD-PSNR metric which measures the average difference in PSNR compared to a baseline (we use VVC ) at an equal bit-rate, seen in Table I. Comparing the performance here to the base task we clearly see the trade-off between the two. This is most notable in the performance of our method using mid-model 3, which was the best method for object detection, and is the weakest here on image reconstruction. Importantly, our method using mid-model 1 or 2 performs slightly better than the modified version of  using , which had comparable performance on the base task.
V Summary and Discussion
The growing prevalence of DNNs in the field of computer vision have lead to an increasing volume of images seen predominantly by machines. This, in turn, requires efficient coding methods for machines and a good understanding of their rate-distortion behaviour. In this paper we provided important theoretical background for understanding the implication of layer selection when encoding an image for feature matching in a downstream DNN model. We have proved that using deeper layers achieves superior rate-distortion performance compared to earlier ones. This is true both when evaluating distortion at the task level, as well as when the distortion is measured at some intermediate point.
To demonstrate the usefulness of our theoretical results, we used them to modify the training loss of an image coding method for humans and machines . The results clearly show that using our approach yields improvement in the rate-distortion performance of the machine side of the model, as expected. Importantly, the modifications do not require any change to the encoder side architecture, or to the decoding process, making our approach easy to implement in practice. This is critical as it does not require more edge-device resources than simply using the learned image coding backbone , on which it is based.
As might have been expected, the improved performance on the machine performance comes at some cost on the human side. Using a hyper-parameter (), the framework in  allows a designer to balance between performance on each side during training. When considering either model for a specific application, one might estimate the frequency at which images will need to be fully reconstructed and adjust the training procedure accordingly. Notably, when both models are adjusted to give equal rate-distortion performance on object detection, our model resulted in superior performance on image reconstruction compared with .
J. Chai, H. Zeng, A. Li, and E. W. Ngai, “Deep learning in computer vision: A critical review of emerging techniques and application scenarios,”Machine Learning with Applications, vol. 6, p. 100134, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2666827021000670
-  J. Dean, D. Patterson, and C. Young, “A new golden age in computer architecture: Empowering the machine-learning revolution,” IEEE Micro, vol. 38, no. 2, pp. 21–29, 2018.
-  Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” ACM SIGARCH Computer Architecture News, vol. 45, no. 1, pp. 615–629, 2017.
-  H. Choi and I. V. Bajić, “Scalable image coding for humans and machines,” IEEE Transactions on Image Processing, vol. 31, pp. 2739–2754, 2022.
-  S. R. Alvar and I. V. Bajić, “Multi-task learning with compressible features for collaborative intelligence,” in Proc. IEEE ICIP’19, Sep. 2019, pp. 1705–1709.
H. Choi and I. V. Bajić, “Near-lossless deep feature compression for collaborative intelligence,” inProc. IEEE MMSP, Aug. 2018.
-  H. Choi and I. V. Bajić, “Deep feature compression for collaborative object detection,” in Proc. IEEE ICIP, 2018.
-  Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image compression with discretized gaussian mixture likelihoods and attention modules,” in Proc. IEEE/CVF CVPR, 2020, pp. 7939–7948.
-  J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv preprint arXiv:1804.02767, Apr. 2018.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” arXiv preprint arXiv:1506.01497, 2015.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. IEEE/CVF ICCV, 2017, pp. 2961–2969.
-  T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Wiley, 2006.
-  “Challenge on learned image compression (CLIC),” [Online]: http://www.compression.cc/, accessed: 2020-10-26.
-  “JPEG AI dataset,” [Online]: https://jpeg.org/jpegai/dataset.html, accessed: 2020-10-26.
-  T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman, “Video enhancement with task-oriented flow,” Int. J. Comput. Vision, vol. 127, no. 8, pp. 1106–1125, 2019.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in Proc. ECCV, Sept. 2014.
-  E. Kodak, “Kodak lossless true color image suite (PhotoCD PCD0992),” http://r0k.us/graphics/kodak, accessed: 2019-03-19.
-  “High efficiency video coding,” rec. ITU-T H.265 and ISO/IEC 23008-2, 2019, Int. Telecommun. Union-Telecommun. (ITU-T) and Int. Standards Org./Int/Electrotech. Commun. (ISO/IEC JTC 1).
-  “Versatile video coding,” rec. ITU-T H.266 and ISO/IEC 23090-3, 2020, Int. Telecommun. Union-Telecommun. (ITU-T) and Int. Standards Org./Int/Electrotech. Commun. (ISO/IEC JTC 1).
-  G. Bjøntegaard, “Calculation of average PSNR differences between RD-curves,” https://www.itu.int/wftp3/av-arch/video-site/0104_Aus/VCEG-M33.doc, Apr. 2001, vCEG-M33.