Image compression is in de facto use within environments relying upon efficient image and video transmission and storage such as security surveillance systems within our transportation infrastructure and our daily use of mobile devices. However, the use of the commonplace lossy compression techniques, such as JPEG  and MPEG  to lower the storage/transmission overheads for such smart cameras leads to reduced image quality that is either noticeable or commonly undetectable to the human observer. With the recent rise of deep convolutional neural networks (CNN [11, 13]) for video analytics across a broad range of image-based detection applications, a primary consideration for classification and prediction tasks is the empirical trade-off between the performance of these approaches and the level of lossy compression that can be afforded within such practical system deployments (for storage/transmission).
This is of particular interest as CNN are themselves known to contain lossy compression architectures - removing redundant image information to facilitate both effective feature extraction and retaining an ability for full or partial image reconstruction from their internals[11, 13].
Prior work on this topic [5, 28, 9, 30] largely focuses on the use of compressed imagery within the train and test cycle of deep neural network development for specific tasks. However, relatively few studies investigate the impact upon CNN task performance with respect to differing levels of compression applied to the input imagery at inference (deployment) time.
In this paper we investigate whether (a) existing pre-trained CNN models exhibit linear degradation in performance as image quality is impacted by the use of lossy compression and (b) whether training CNN models on such compressed imagery thus improves performance under such conditions. In contrast to prior work topic [5, 28, 9, 30], we investigate these aspects across multiple CNN architectures and domains spanning segmentation (SegNet, ), human pose estimation (OpenPose, ), object recognition (R-CNN, ), human action recognition (dual-stream, ), and depth estimation (GAN, ). Furthermore, we determine within which domains compression is most impactful to performance and thus where image quality is most pertinent to deployable CNN model performance.
Ii Prior Work
Overall, prior work in this area is limited in scope and diversity [5, 28, 9, 30]. Dodge et al.  analyze the performance of now seminal CNN image classification architectures (AlexNet , VGG  and InceptionV1 ) performance under JPEG  compression and other distortion methods. They find that these architectures are resilient to compression artifacts (performance drops only for JPEG quality 10) and contrast changes, but under-perform when noise and blur are introduced.
Similarly, Zanjani et al.  consider the impact of JPEG 2000 compression  on CNN, and whether retraining the network on lossy compressed imagery would afford better resultant model performance. They identify similar performance from the retrained model on higher quality images but are able to achieve up to as much as 59% performance increase on low quality images.
Rather than image compression, Yeo et al.  compare different block sizes and group-of-pictures (GOP) sizes within MPEG  compression against Human Action Recognition (HAR). They determine that both smaller blocks and smaller groups increase performance. Furthermore, B frames introduce propagation errors in computing block texture, and should be avoided within the compression process. Tom et al. 
add that there is a near-linear relationship between HAR performance and the number of motion vectors (MV) corrupted within H.264 video data, with performance levelling off when 75% of MV are corrupted. Klare and Burge 
, however, demonstrate that there is a non-linear relationship between face recognition performance and bit rate within H.264 video data, with sudden performance degradation around 128kbps (CRF). These contrasting results therefore demonstrate the need to investigate compression quality across multiple challenge domains, whose respective model architectures might have different resilience to lossy compression artifacts.
Multiple authors have developed impressive architectures trained on compressed data, indicating both the potential and need for in-depth investigation within the compressed domain. Zhuang and Lai  demonstrate that acceptable face detection performance can be obtained from H.264 video data, while Wang and Chang  use the DCT coefficients from MPEG compression  to directly locate face regions. The same authors even achieve accurate face tracking results in , still within the compressed video domain. The question is evidently:- by how much can data be compressed?
These limited studies open the door only slightly on this very question - what is generalized impact of compression on varying deep neural network architectures? Here we consider multiple CNN variants spanning region-based, encoder-decoder and GAN architectures in addition to a wide range of target tasks spanning both discrete and regressive outputs. From our observations, we aim to form generalized conclusions on the hitherto unknown relationship between (lossy) image input to target function outputs within the domain of contemporary CNN approaches.
To determine how much lossy image compression is viable within CNN architectures before performance is significantly impacted we must study a range of second generation tasks, beyond simple and holistic image classification, requiring more complex CNN output granularity. We examine five CNN architectural variants across five different challenge domains, emulating the dataset and evaluation metrics characterized in their respective originating study in each case as closely as possible. Inference models processing images were tested six times, with a JPEG quality parameter in the set, while video-based models were tested with H.264 CRF compression parameters in the set . Each model is then retrained with imagery compressed at each of the five higher levels of lossy compression to determine whether resilience to compression could be improved, and how much compression we can afford before a significant impact on performance is observed. Our methodology for each of our representative challenge domains is outlined in the following sections:- semantic segmentation (Section: III-A), depth estimation (Section: III-B), object detection (Section: III-C), human pose estimation (Section: III-D), and human action recognition (Section: III-E).
Iii-a Semantic Segmentation
Pixel-wise Segmantic segmentation involves assigning each pixel in an image (Fig. (A)A, above) its respective class label (Fig. (A)A, below). SegNet  uses an encoder-decoder neural network architecture followed by a pixel-wise classification layer to approach this challenge.
Implementing SegNet from 
, we evaluate global accuracy (percentage of pixels correctly classified), mean class accuracy (mean prediction accuracy over each class), and mean intersection over union (mIoU) against compressed imagery from the Cityscapes dataset
. When retraining the network, we use 33000 epochs, with a batch size of 12, fixed learning rate () of 0.1, and momentum () of 0.9.
Iii-B Depth Estimation
In order to evaluate GAN architecture performance under compression, we need a task decoupled from reconstructing high quality output, to which compression would be clearly detrimental. One such example is computing the depth map of a scene (Fig. (A)A, below) from monocular image sequences (Fig. (A)A, above).
Iii-C Object Detection
In object detection, we must locate and classify foreground objects within a scene (as opposed to semantic segmentation, which classifies each pixel), and compute the confidence of each classification (Fig. (A)A). We evaluate mAP of the Detectron FasterRCNN  implementation  against the Pascal VOC 2007 dataset , over mIoU with threshold 0.5:0.95. When training the network, we use and weight decay of 0.0005 over 60000 epochs.
Iii-D Human Pose Estimation
Human Pose Estimation involves computing (and overlaying) the skeletal position of people detected within a scene (Fig. (A)A). Recent work uses part affinity fields to map body parts to individuals, thus distinguishing between visually similar features.
Using OpenPose  we compute the skeletal overlay of detected people in images from the COCO dataset . We evaluate with mean average precision (mAP), over 10 object key-point similarity (OKS) thresholds, where OKS represents IoU scaled over person size. When retraining the network, we use , and a batch size of 8 over 40 epochs.
Iii-E Human Action Recognition
To classify a single human action - from a handstand to knitting - with a reasonable level of accuracy, we must inspect spatial information from each frame, and temporal information across the entire video sequence.
We implement the dual-stream model from ; recognising human activity by fusing spatial and temporal predictions from the UCF101 video dataset presented in  (see Fig. 5 for example frames, dramatically deteriorating in quality as H.264 CRF value is increased). To train the temporal stream, we pass 20 frames randomly sampled from the pre-computed stack of optical flow images. Across both streams, we use a batch size of 12, , and for 500 epochs.
In this section, we contrast the performance of the considered CNN architectures under their respective evaluation metrics before and after retraining. From this, we can determine how much we can safely compress the imagery while maintaining acceptable performance. We then propose possible explanations for the variations in resilience of the network architectures to image compression.
Iv-a Semantic Segmentation
From results presented in Table I we can observe that the impact of lossy compression (Table (A)A) is minimal, indicating high resilience to compression within the network. At the highest (most compressed) compression level, we see global accuracy reduce by 14%, down to 78.2%, while affording 95% less storage cost on average per input image. However, at these heaviest compression rates, the compression artifacts introduced can lead to false labelling. This is particularly prominent where there are varying levels of lighting, affecting even plain roads (Fig. (C)C). Subsequently, from Table (B)B we can see that retraining the network further minimizes performance loss, especially minimizing false labelling of regions. At a JPEG compression level of 5, performance loss is reduced to 3.5%, resulting in global accuracy narrowly dropping below 0.9. Such resilience may stem from the up-sampling by the pooling layers within the decoder pipeline, which are innately capable of recovering information that has been lost during compression, but further investigation is left to future work.
Iv-B Depth Estimation
Analyzing the results in Table II, it is evident that lossy compression markedly diminishes RMSE performance of depth estimation when heavy compression rates are employed (Table (A)A). At a JPEG compression level of 15, RMSE has not increased by more than 1.9%, but at a JPEG compression level of 10 and lower, performance begins to dramatically decline (in keeping with that of ). However, by retraining the network at the same compression level that is employed during testing (Table (B)B), performance loss can be thoroughly constrained. Even at a JPEG compression level of 5, RMSE can be constrained to under 0.0600, improving performance by as much as 20% over the pre-trained network. Other performance measures demonstrate the same trend.
This performance is surprising: we might expect that RMSE would increase (thus lowering performance) after training on compressed imagery, since the GAN generates low quality imagery as the textures and features used to calculate depth estimation are lost, and is therefore unable to improve depth estimation performance. It is possible that it exceeds our expectation due to the encoder-decoder pipeline within the estimation process, which is also employed in the SegNet architecture, and thereby shares its compression resilience.
Iv-C Object Detection
From Table III, we can again discern that performance degrades rapidly at high lossy compression levels (JPEG compression level of 15 or less, see Table (A)A). Applying a JPEG compression level of 15 leads to a 22.5% drop, down to mAP of 0.545, while a JPEG compression level of 5 causes mAP to drop by as much as 73.4%. Furthermore, with higher compression rates, fewer objects are detected, and their classification confidence also falls (Fig. (C)C). Their classification accuracy remains unhindered, however. When the network is retrained on imagery lossily compressed at the same level, performance is noticeably improved (Table (B)B). The performance drop as compression rate is increased is delayed from a JPEG compression level of 15 to a JPEG compression level of 5. In fact, the retrained network is able to maintain an mAP above 0.6 even at a JPEG compression level of 10; reducing performance degradation to only 10.8%, while affording a lossy compression rate almost 10-fold higher in terms of reduced image storage requirements.
Iv-D Human Pose Estimation
Results in Table IV once again illustrate that lossy image compression (Table (A)A) dramatically impacts performance at high rates. Similar to object detection, performance considerably lowers at 15% compression rate, in this case with performance falling by 41.9% to 0.413 mAP. Qualitatively, the network computes precisely located skeletal positions at higher compression rates, but detects and locates fewer joints (Fig. (B)B). With high levels of compression (Fig. (C)C), the false positive rate increases, and limbs are falsely detected and located. It is likely that optimizing the detection confidence threshold required of joints before computing their location, and thereby maximizing limb detection while minimizing false positives increases performance, especially during high compression. With a retrained network (Table (B)B), a compression rate of 15% can be safely achieved before performance degradation exceeds 10%.
While impressive, the results are relatively insubstantial compared to those of other architectures, such as SegNet (Section IV-A, Table I). The difference can perhaps be attributed to the double prediction task within the pose estimation network. Inaccuracies stemming from the lower quality images are not just propagated but multiplied through the network, as the architecture must simultaneously predict both detection confidence maps and the affinity fields for association encodings.
Iv-E Human Action Recognition
From results presented in Table V, it is evident that the impact of lossy compression (Table (A)A) dramatically increases when we apply CRF factor 50. Conversely to all other examined architectures, we can see from Table (B)B that retraining the network in fact decreases performance.
At first glance, we might expect similar performance to pose detection as with the two stream network for human action recognition, as the errors introduced by compression artifacts propagate through both streams in the network. However, the spatial and motion streams are not trained in tandem. While the spatial stream remains resilient, once again due to the up-sampling within the architecture (Section IV-A), the motion stream is almost entirely unable to learn from compressed imagery. As such, retraining the network on compressed imagery in fact reduces overall performance (aside from when using CRF 50, as the spatial stream improvement outweighs the motion stream degradation). Future work may reveal whether better performance might be achieved by retraining just the spatial stream network on compressed imagery, and fusing its predictions with a motion stream trained only on uncompressed imagery.
This study has investigated the impact of lossy image compression on a multitude of existing deep CNN architectures. We have considered how much compression can be achieved while maintaining acceptable performance, and to what extent performance degradation can be ameliorated by retraining the networks with compressed imagery.
Across all challenges, retraining the network on compressed imagery recovers performance to a certain degree. This study has brought to attention in particular, however, that in very prevalent and so far unexamined network architectures, we can afford to compress imagery at extremely high rates. Segmentation and depth estimation in particular demonstrate resilience against even very significant compression, both by employing an encoder-decoder pipeline. By using retrained models, compression can safely reach as high as 85% across all domains. In doing so, current storage costs can be markedly diminished before performance is noticeably impacted. Hyper parameter optimization of the retrained model can assumedly capitalize on this even further, and in certain domains, such as segmentation, we can already afford to reduce to a twentieth of the original storage cost. It should be noted however, that even a 1 or 2% performance loss may be unacceptable in safety critical operations, such as depth estimation for vehicular visual odometry.
We can further suggest that lossy image compression is potentially viable as a data augmentation technique within RCNN  and pose estimation  architectures, which receive only mild performance degradation. Networks employing an encoder-decoder architecture (SegNet , GAN ) would only notably benefit from very significant levels of image compression for data augmentation. However, human action recognition networks, or sub-networks in the case of the two stream approach , that consider motion input will not readily benefit from image compression as a data augmentation technique, since they appear unable to learn under such training conditions.
Future work will investigate whether performance is improved by retraining the network with more heavily or lightly compressed imagery than at testing, or even a variety of compression levels. Furthermore, evaluating performance of compressed networks such as MobileNet against compressed imagery would be pertinent, as such light network architectures are prevalent amidst compressed imagery application domains.
This work was supported by Durham University, the European Regional Development Fund Intensive Industrial Innovation Grant No. 25R17P01847
-  (2018-06) Real-time monocular depth estimation using synthetic data with domain adaptation. In , pp. 1–8. Cited by: §I, §III-B, (A)A, (B)B, §V.
-  (2015) SegNet: A deep convolutional encoder-decoder architecture for image segmentation. Computing Research Repository abs/1511.00561. External Links: Cited by: §I, Fig. 1, §III-A, (A)A, (B)B, §V.
-  (2017) Realtime multi-person 2d pose estimation using part affinity fields. In Conference on Computer Vision and Pattern Recognition, pp. 1302–1310. Cited by: §I, Fig. 4, §III-D, (A)A, (B)B, §V.
The cityscapes dataset for semantic urban scene understanding. Computing Research Repository abs/1604.01685. External Links: Cited by: §III-A.
-  (2016) Understanding how image quality affects deep neural networks. Computing Research Repository abs/1604.04004. External Links: Cited by: §I, §I, §II, §IV-B.
-  (2015-01) The pascal visual object classes challenge: a retrospective. International Journal of Computer Vision 111 (1), pp. 98–136. Cited by: §III-C.
-  (2018) Detectron. Note: https://github.com/facebookresearch/detectron Cited by: §III-C.
-  (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. Computing Research Repository abs/1704.04861. External Links: Cited by: §V.
-  (2010-04) Assessment of H.264 video compression on automated face recognition performance in surveillance and mobile video scenarios. Proceedings of SPIE - The International Society for Optical Engineering 7667, pp. . External Links: Cited by: §I, §I, §II, §II.
-  (2012) ImageNet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems - Volume 1, USA, pp. 1097–1105. Cited by: §II.
-  (2012) ImageNet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems - Volume 1, USA, pp. 1097–1105. Cited by: §I, §I.
-  (1991-04) MPEG: a video compression standard for multimedia applications. Commun. ACM 34 (4), pp. 46–58. External Links: Cited by: §I, §II, §II.
-  (2015-05) Deep learning. Nature 521, pp. 436–44. External Links: Cited by: §I, §I.
-  (2014) Microsoft COCO: common objects in context. Computing Research Repository abs/1405.0312. External Links: Cited by: §III-D.
-  (2015) Faster R-CNN: towards real-time object detection with region proposal networks. Computing Research Repository abs/1506.01497. External Links: Cited by: §I, Fig. 3, §III-C, (A)A, (B)B, §V.
-  (2016-06) The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In Conference on Computer Vision and Pattern Recognition, pp. 3234–3243. Cited by: §III-B.
Semantic segmentation architectures implemented in pytorch.. https://github.com/meetshah1995/pytorch-semseg. Cited by: §III-A.
-  (2014-09) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv e-prints, pp. arXiv:1409.1556. External Links: Cited by: §II.
-  (2014) Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems 27, pp. 568–576. Cited by: §I, Fig. 5, §III-E, (A)A, (B)B, §V.
-  (2001-Sep.) The JPEG 2000 still image compression standard. IEEE Signal Processing Magazine 18 (5), pp. 36–58. External Links: Cited by: §II.
-  (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. Computing Research Repository abs/1212.0402. External Links: Cited by: §III-E.
-  (2014) Going deeper with convolutions. Computing Research Repository abs/1409.4842. External Links: Cited by: §II.
-  (2015-11) Compressed domain human action recognition in H.264/AVC video streams. Multimedia Tools Appl. 74 (21), pp. 9323–9338. External Links: Cited by: §II.
-  (1991-04) The JPEG still picture compression standard. Commun. ACM 34 (4), pp. 30–44. External Links: Cited by: §I, §II.
-  (1997-08) A highly efficient system for automatic face region detection in mpeg video. IEEE Transactions on Circuits and Systems for Video Technology 7 (4), pp. 615–628. External Links: Cited by: §II.
-  (1999) FaceTrack: tracking and summarizing faces from compressed video. In Multimedia Storage and Archiving Systems IV, S. Panchanathan, S. Chang, and C.-C. J. Kuo (Eds.), Vol. 3846, pp. 222 – 234. External Links: Cited by: §II.
-  (2003-07) Overview of the H.264/AVC video coding standard. IEEE Trans. Cir. and Sys. for Video Technol. 13 (7), pp. 560–576. External Links: Cited by: §II.
-  (2008-08) High-speed action recognition and localization in compressed domain videos. IEEE Transactions on Circuits and Systems for Video Technology 18 (8), pp. 1006–1015. External Links: Cited by: §I, §I, §II, §II.
-  (2019) Impact of JPEG 2000 compression on deep convolutional neural networks for metastatic cancer detection in histopathological images. Journal of Medical Imaging 6 (2), pp. 1 – 9. External Links: Cited by: §II.
-  (2009-12) Face detection directly from H.264 compressed video with convolutional neural network. In International Conference on Image Processing, pp. 2485 – 2488. External Links: Cited by: §I, §I, §II, §II.