1 Introduction
Unlike in image recognition where the network maps an image to a semantic label, a network used for image processing maps an input image to an output image with some desired properties. Examples include image super-resolution
(dong2014learning), denoising (xie2012image), deblurring (eigen2013restoring)(zhang2016colorful) and style transfer (gatys2015neural). The goal of such systems is to produce images of high perceptual quality to a human observer. For example, in image denoising, we aim to remove noise in the signal that is not useful to an observer and restore the image to its original “clean” form. Metrics like PSNR and SSIM (ssim) are often used (dong2014learning; srdensenet) to approximate human-perceived similarity between the processed images with the original images, and direct human assessment on the fidelity of the output is often considered the “gold-standard” assessment (srgan; zhang2018unreasonable). Therefore, many techniques (johnson2016perceptual; srgan; pix2pix) have been proposed for making the output images look natural to human.However, the processed image might not look “natural” to machines, or in other words, they may not be accurately recognized by image recognition systems. As shown in Fig. 1
, the output image of an denoising model could easily be recognized by a human as a bird, but a recognition model classifies it as a kite. One could specifically train a recognition model only on these output images produced by the denoising model to achieve better performance on such images, but the performance on natural images can be harmed. This retraining/adaptation scheme might also be impractical considering the significant overhead induced by catering to various image processing tasks and models.
With the fast-growing size of image data, many images are often “viewed” and analyzed more by machines than by humans. Nowadays, any image uploaded to the Internet is likely to be analyzed by certain vision systems. For example, Facebook uses a system called Rosetta to extract texts from over 1 billion user-uploaded images every day (facebook). It is of great importance that the processed images be recognizable by not only humans, but also by machines. In this way, we make them potentially easier to search, recommended to more interested audience, and so on, as these procedures are mostly executed by machines based on their understanding of the images. Therefore, we argue that image processing systems should maintain/enhance machine semantics. We call this problem “Recognition-Aware Image Processing”.
It is also important that the enhanced machine semantics is not specific to any concrete recognition model, i.e., the improvement on recognition performance is only achieved when the output images are evaluated on that particular model. Instead, the improvement should ideally be transferable when evaluated on different models, to support its usage without access to possible future recognition systems, since we may not decide what model will be used for recognizing the processed image, for example if we upload it to the Internet or share it on social media. We may not know what network architectures (e.g. ResNet or VGG) will be used for inference, what object categories the downstream model recognizes (e.g. animals or scenes), or even what task will be performed on the processed image (e.g. classification or detection). Without these specifications, it might be hard to enhance image’s machine semantics.
In this work, we propose simple and highly effective approaches to make image processing outputs more accurately recognized by downstream recognition systems, transferable among different recognition architectures, categories and tasks. The approaches we investigate add a recognition loss optimized jointly with the image processing loss. The recognition loss is computed using a fixed recognition model that is pretrained on natural images, and can be done in an unsupervised manner, e.g., without semantic labels of the image. It can be optimized either directly by the original image processing network, or through an intermediate transforming network. We conduct extensive experiments, on multiple image processing (super-resolution, denoising, and JPEG-deblocking) and recognition (classification and detection) tasks, and demonstrate that our approaches can substantially boost the recognition accuracy on the downstream systems, with minimal or no loss in the image processing quality measured by conventional metrics. Also, the accuracy improvement transfers favorably among different recognition model architectures, object categories, and recognition tasks, which renders our simple solution effective even when we do not have access to the downstream recognition models. Our contributions can be summarized as follows:
-
[leftmargin=15pt]
-
We propose to study the problem of enhancing the machine interpretability of image processing outputs, a desired property considering the amount of images analyzed by machines nowadays.
-
We propose simple and effective methods towards this goal, suitable for different use cases, e.g., without ground truth semantic labels. Extensive experiments are conducted on multiple image processing and recognition tasks, demonstrating the wide applicability of the proposed methods.
-
We show that using our simple approaches, the recognition accuracy improvement could transfer among recognition architectures, categories and tasks, a desirable behavior making the proposed methods applicable without access to the downstream recognition model.
2 Related Work
Since the initial success of deep neural networks on image enhancement and restoration problems
(dong2014learning; xie2012image; wang2016d3), a large body of works try to investigate better model architecture design and training techniques (dong2016accelerating; kim2016accurate; shi2016real; Kim_2016_CVPR; mao2016image; lai2017deep; tai2017image; srdensenet; memnet; lim2017enhanced; zhang2018residual; ahn2018fast; lefkimmiatis2018universal; chen2018image; haris2018deep), mostly on the image super-resolution task. These works focus on generating high visual quality images under conventional metrics or human evaluation, without considering recognition performance on the output.There are also a number of works that relate image recognition with processing. Some works (zhang2016colorful; larsson2016learning; zhang2018image; sajjadi2017enhancenet)
use image classification accuracy as an evaluation metric for image colorization/super-resolution, but without optimizing for it during training.
bai2018findingtrain a super-resolution and refinement network simultaneously to better detect faces in the wild.
sicnn train networks for face hallucination and recognition jointly to achieve better recover the face identity from low-resolution images. liu2018disentangling considers 3D face reconstruction and trains the recognition model jointly with the reconstructor. sharma2018classification trains a classification model together with an enhancement module. Our problem setting is different from these works, in that we assume we do not have the control on the recognition model, as it might be on the cloud or decided in the future, thus we advocate adapting the image processing model only. This also ensures the recognition model is not harmed on natural images. haris2018task investigate how super-resolution could help object detection in low-resolution images. vidalmata2019bridging and banerjee2019report also aims to enhance machine accuracy on poor-conditioned images but mostly focus on better image processing techniques without using recognition models. wang2019segmentation propose a method to make denoised images more accurately segmented, also presenting some interesting findings in transferability. Most existing works only consider one image processing task or image domain, and develop specific techniques, while our simpler approach is task-agnostic and potentially more widely applicable.3 Method
In this section we first introduce the problem setting of “recognition-aware” image processing, and then we develop various approaches to address it, each suited for different use cases.
3.1 Problem Setting
In a typical image processing problem, given a set of training input images and corresponding target images (), we aim to train a neural network that maps an input image to its corresponding target. For example, in image denoising, is a noisy image and is the corresponding clean image. Denoting this mapping network as (for processing), parameterized by , during training our optimization objective is:
(1) |
where is simply the output of the processing model , and
is the loss function for each sample. The pixel-wise mean-squared-error (MSE, or
) loss is one of the most popular choices. During evaluation, the performance is typically measured by average similarity (e.g., PSNR, SSIM) between and , or through human assessment.In our problem setting of recognition-aware processing, we are interested in a recognition task, with a trained recognition model ( for recognition), parameterized by . We assume each input/target image pair is associated with a ground truth semantic label for the recognition task. Our goal is to train a image processing model such that the recognition performance on the output images is high, when evaluated using with the semantic labels . In practice, the recognition model might not be available (e.g., on the cloud), in which case we could resort to other models if the performance improvement transfers among models.
3.2 Optimizing Recognition Loss

Given our goal is to make the output images by more recognizable by , it is natural to add a recognition loss on top of the objective of the image processing task (Eqn. 1) during training:
(2) |
is the per-example recognition loss defined by the downstream recognition task. For example, for image classification, could be the cross-entropy (CE) loss. Adding the image processing loss (Eqn. 1) and recognition loss (Eqn. 2) together, our total training objective becomes
(3) |
where is the coefficient controlling the weights of relative to . We denote this simple solution as “RA (Recognition-Aware) processing”, which is visualized in Fig. 2 left.
A potential shortcoming of directly optimizing is that it might deviate from optimizing the original loss , and the trained will generate images that are not as good as if we only optimize . We will show that in experiments, however, with proper choice of , we could substantially boost the recognition performance with minimal or no sacrifice on image quality.
If using as a fixed loss function can only boost the recognition accuracy on itself, the use of the method could be restricted. Sometimes we do not have the knowledge about the downstream recognition model or even task, but we still would like to improve future recognition performance. Interestingly, we find that image processing models trained with the loss of one recognition model , can also boost the performance when evaluated using recognition model , even if model has a different architecture, recognizes a different set of categories or even is trained for a different task. This makes our method effective even when we cannot access the target downstream model, in which case we could use another trained model we do have access to as the loss function. This phenomenon also implies that the “recognizability” of a processed image can be a more general notion than just the extent it fits to a specific model. More details on how the improvement is transferable among different recognition models will be presented in the experiments.
3.3 Unsupervised Optimization of Recognition Loss
The solution mentioned above requires semantic labels available for training images, which however, may not be satisfied all the time. In this case, we could instead resort to regress the recognition model’s output of the target image , given the target images at hand, and that the recognition model is pretrained and fixed. The recognition objective in Eqn. 2 changes to
(4) |
where is a distance metric between ’s output given input of processed image and ground truth target image . For example, when
is a classification model and outputs a probability distribution over classes,
could be the KL divergence or simply a distance. During evaluation, the output of is still compared to the ground truth semantic label . We call this approach “unsupervised RA”. Note that it is only “unsupervised” for training model , but the target pretrained model can still be trained in full supervision. This approach is to some extent related to the “knowledge distillation” paradigm (kd) used for network model compression, where the output of a large model is used to guide the output of a small model, given the same input images. Instead we use the same recognition model but guide the upstream processing model to generate input to which produces similar output with that of the target image.3.4 Using an Intermediate Transformer
Sometimes we do want to guarantee that the added recognition loss will not deviate the model from optimizing its original loss. We can achieve this by introducing another intermediate transformation model . After the input image going through the image processing model , the output image is first fed to the model , and ’s output image serves as the input for the recognition model (Fig. 2 right). In this case, ’s parameters are optimized for minimizing the recognition loss:
(5) |
In this way, with the help of on optimizing the recognition loss, the model can now “focus on” its original image processing loss . The optimization objective becomes:
(6) |
In Eqn. 6, is still solely optimizing as in the original image processing problem (Eqn. 1). is learned as if there is no recognition loss, and therefore the image processing quality of its output will not be affected. This could be achieved by “cutting” the gradient generated by between the model and (Fig. 2 right). The responsibility for a better recognition performance falls on the model . We term this solution as “RA with transformer”.
The downside of using a transformer compared with directly optimizing recognition loss using the processing model, is that there are two instances for each image (the output of model and ), one is “for human” and the other is “for machines”. Also, as we will show later, it can sometimes harm the transferability of the performance improvement, possibly because there is no image processing loss as a constraint on ’s output. Therefore, the transformer is best suited for the case where we want to guarantee the image processing quality not affected at all, at the expense of maintaining another image and losing some transferability.
4 Experiments
We evaluate our proposed methods on three image processing tasks, namely image super-resolution, denoising, and JPEG-deblocking, paired with two common visual recognition tasks, image classification and object detection. We adopt the SRResNet (srgan) as the architecture of the image processing model , due to its popularity and simplicity. For the transformer model , we use the 6-block ResNet architecture in CycleGAN (cyclegan)
, a general-purpose image to image transformation network. For classification we use the ImageNet and for detection we use PASCAL VOC as our benchmark. The recognition architectures are ResNet, VGG and DenseNet. Training is performed with the training set and results on the validation set are reported. For more details on the training settings and hyperparameters of each task, please refer to Appendix
A.4.1 Evaluation on the Same Recognition Model
|
|
We first show our results when evaluating on the same recognition model, i.e., the used for evaluation is the same as the we use as the recognition loss in training. Table 0(a) shows our results on ImageNet classification. ImageNet-pretrained classification models ResNet-18/50/101, DenseNet-121 and VGG-16 are denoted as R18/50/101, D121, V16 in Table 0(a)
. The “No Processing” row denotes the recognition performance on the input of the image processing model: for denoising/JPEG-deblocking, this corresponds to the noisy/JPEG-compressed images; for super-resolution, the low-resolution images are bicubic interpolated to the original resolution. “Plain Processing” denotes conventional image processing models trained without recognition loss as described in Eqn.
1. We observe that a plainly trained processing model can boost the accuracy over unprocessed images. These two are considered as baselines in our experiments.From Table 0(a), using RA processing can significantly boost the accuracy of output images over plainly processed ones, for all image processing tasks and recognition models. This is more prominent when the accuracy of plain processing is lower, e.g., in super-resolution and JPEG-deblocking, in which case we mostly obtain 10% accuracy improvement. Even without semantic labels, our unsupervised RA can still in most cases outperform baseline methods, despite achieves lower accuracy than its supervised counterpart. Also in super-resolution and JPEG-deblocking, using an intermediate transformer can bring additional improvement over RA processing.
The results for object detection are shown in Table 0(b). We observe similar trend as in classification: using recognition loss can consistently improve the mAP over plain image processing by a notable margin. On super-resolution, RA processing mostly performs on par with RA with transformer, but on the other two tasks using a transformer is slightly better.
4.2 Transfer between Recognition Architectures
In reality, sometimes the recognition model we want to eventually evaluate the output images on might not be available for us to use as a loss for training, e.g., it could be on the cloud, kept confidential or decided later. In this case, we could train an image processing model using recognition model that is accessible to us, and after we obtain the trained model , evaluate its output images’ recognition accuracy using another unseen recognition model . We evaluate all model architecture pairs on ImageNet classification in Table 2 and Table 3, for RA Processing and RA with Transformer respectively, where row corresponds to the model used as recognition loss (), and column corresponds to the evaluation model (). For RA with Transformer, we use the processing model and transformer trained with together when evaluating on .
Task | Super-resolution | Denoising | JPEG-deblocking | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Evaluation on | R18 | R50 | R101 | D121 | V16 | R18 | R50 | R101 | D121 | V16 | R18 | R50 | R101 | D121 | V16 |
Plain Processing | 52.6 | 58.8 | 61.9 | 57.7 | 50.2 | 61.9 | 68.0 | 69.1 | 66.4 | 60.9 | 48.2 | 53.8 | 56.0 | 52.9 | 42.4 |
RA w/ R18 | 61.8 | 66.7 | 68.8 | 64.7 | 58.2 | 65.1 | 70.6 | 71.9 | 69.1 | 63.8 | 57.7 | 62.3 | 64.3 | 60.7 | 52.8 |
RA w/ R50 | 59.3 | 67.3 | 68.8 | 64.3 | 59.1 | 64.2 | 71.2 | 72.2 | 69.2 | 64.7 | 55.8 | 63.6 | 64.7 | 61.0 | 53.5 |
RA w/ R101 | 58.8 | 66.0 | 69.6 | 63.4 | 58.2 | 64.0 | 70.5 | 72.7 | 68.9 | 64.8 | 54.9 | 61.5 | 65.8 | 60.3 | 52.8 |
RA w/ D121 | 59.0 | 65.6 | 67.8 | 66.0 | 57.4 | 64.2 | 70.6 | 72.0 | 69.8 | 64.3 | 54.8 | 61.8 | 64.4 | 62.3 | 52.9 |
RA w/ V16 | 57.9 | 64.8 | 67.0 | 63.0 | 61.9 | 63.9 | 70.4 | 72.0 | 68.8 | 66.5 | 54.5 | 60.9 | 63.1 | 59.7 | 56.7 |
In Table 2’s each column, training with any model produces substantially higher accuracy than plainly processed images on . Thus, we conclude that the improvement on the recognition accuracy is transferable among different recognition architectures. A possible explanation for this is that these models are all trained on the same ImageNet dataset, such that their mapping functions from input to output are similar, and optimizing the loss of one would lead to the lower loss of another. This phenomenon enables us to use RA processing without the knowledge of the downstream recognition model architecture. However, among all rows, the that achieves the highest accuracy is still the same model as , indicated by the diagonal boldface numbers in Table 2.
Task | Super-resolution | Denoising | JPEG-deblocking | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Evaluation on | R18 | R50 | R101 | D121 | V16 | R18 | R50 | R101 | D121 | V16 | R18 | R50 | R101 | D121 | V16 |
Plain Processing | 52.6 | 58.8 | 61.9 | 57.7 | 50.2 | 61.9 | 68.0 | 69.1 | 66.4 | 60.9 | 48.2 | 53.8 | 56.0 | 52.9 | 42.4 |
RA w/ w/ R18 | 63.0 | 59.2 | 67.0 | 63.9 | 27.0 | 65.2 | 69.4 | 71.6 | 68.4 | 40.3 | 59.8 | 58.7 | 62.6 | 60.3 | 19.9 |
RA w/ w/ R50 | 60.5 | 68.2 | 68.9 | 65.8 | 40.4 | 63.1 | 70.9 | 71.5 | 68.6 | 48.7 | 55.0 | 65.1 | 63.9 | 61.9 | 31.5 |
RA w/ w/ R101 | 59.6 | 66.2 | 70.1 | 65.1 | 35.6 | 62.4 | 68.8 | 72.3 | 67.6 | 52.3 | 54.8 | 61.3 | 66.7 | 24.8 | 60.5 |
RA w/ w/ D121 | 58.5 | 64.2 | 66.9 | 66.5 | 27.3 | 58.0 | 66.8 | 67.3 | 69.6 | 46.7 | 46.6 | 57.2 | 59.0 | 63.9 | 9.0 |
RA w/ w/ V16 | 59.2 | 64.7 | 67.8 | 65.0 | 63.0 | 57.6 | 64.0 | 67.1 | 55.7 | 63.1 | 56.1 | 61.2 | 63.4 | 58.7 | 60.1 |
Meanwhile in Table 3, in most cases improvement is still transferable when we use a transformer , but there are a few exceptions. For example, when is ResNet or DenseNet and when is VGG-16, in most cases the accuracy fall behind plain processing by a large margin. This weaker transferability is possibly caused by the fact that there is no constraint imposed by the image processing loss on ’s output, thus it “overfits” more to the specific it is trained with. For more results on object detection and unsupervised RA, please refer to the Appendix B.1. This is intuitive since the processing model optimizes the same recognition loss during training as that used in evaluation.
4.3 Transfer between Object Categories
What if the and recognize different categories of objects? Can RA processing still bring transferable improvement? To answer this question, we divide the 1000 classes from ImageNet into two splits (denoted as category ), each with 500 classes, and train two 500-way classification models (ResNet-18) on both splits, obtaining and . Next, we train two image processing models and with the and as recognition loss, using images from category and respectively. Note that neither the image processing model nor the recognition model has seen any images from the other split of categories during training, and and learn completely different mappings from input to output. The plain processing counterparts of and are also trained with category and respectively, but without the recognition loss. We evaluate obtained image processing models on both splits, and the results are shown in Table 4.
Task | Super-resolution | Denoising | JPEG-deblocking | |||
---|---|---|---|---|---|---|
Train/Eval Category | Cat | Cat | Cat | Cat | Cat | Cat |
Cat Plain | 59.6 | 60.1 | 67.6 | 68.0 | 54.2 | 55.5 |
Cat RA | 67.2 | 66.5 | 69.7 | 69.4 | 63.0 | 62.3 |
Cat Plain | 59.6 | 60.2 | 67.0 | 67.5 | 54.7 | 56.0 |
Cat RA | 66.4 | 67.8 | 69.4 | 69.7 | 62.1 | 63.5 |
We observe that using RA processing still benefits the recognition accuracy even when transferring across categories (e.g., in super-resolution, from 60.1% to 66.5% when transferring from category to category , on super resolution). The improvement is only marginally lower than directly training with recognition model of the same category (e.g., from 60.2% to 67.8% when trained and evaluated both on category ). Such transferability between categories suggest the learned image processing models do not improve accuracy by adding category-specific signals to the output images, instead they generate more general signals that enable a wider set of classes to be better recognized.
4.4 Transfer between Recognition Tasks
What if we take a further step to the case when and not only recognize different categories, but also perform different tasks? We evaluate such task transferability for when task is classification and task is object detection in Table 5. For results on the opposite direction and results for unsupervised RA, please refer to Appendix B.2.
Task | Super-resolution | Denoising | JPEG-deblocking | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Evaluation on | R18 | R50 | R101 | V16 | R18 | R50 | R101 | V16 | R18 | R50 | R101 | V16 |
Plain Processing | 68.5 | 69.7 | 73.1 | 63.2 | 68.1 | 71.6 | 74.1 | 65.7 | 62.4 | 65.6 | 69.5 | 58.3 |
RA w/ R18 | 71.3 | 73.5 | 75.6 | 67.8 | 70.6 | 73.1 | 75.5 | 64.1 | 67.7 | 70.3 | 73.2 | 62.4 |
RA w/ R50 | 70.8 | 73.2 | 74.8 | 67.8 | 70.4 | 73.1 | 75.8 | 66.2 | 67.8 | 70.2 | 73.1 | 62.8 |
RA w/ R101 | 70.7 | 73.2 | 75.3 | 67.0 | 70.5 | 73.5 | 75.7 | 66.9 | 68.1 | 70.2 | 72.8 | 63.2 |
RA w/ D121 | 71.2 | 73.6 | 75.3 | 67.2 | 70.5 | 73.2 | 75.7 | 65.7 | 68.1 | 70.5 | 73.1 | 62.6 |
RA w/ V16 | 70.4 | 72.4 | 74.6 | 67.5 | 70.6 | 73.0 | 75.7 | 67.7 | 67.8 | 70.3 | 73.2 | 63.7 |
In Table 5, note that rows indicate classification models used as loss and columns indicate detection models, so even if they are of the same name (e.g., “R18”), they are still different models, and are trained on different datasets for different tasks. We are transferring between architectures, categories as well as tasks in this experiment. There is even a domain shift since the model is trained with ImageNet training set but fed with PASCAL VOC input images during evaluation. Here “Plain Processing” models are trained on the ImageNet instead of PASCAL VOC dataset, thus the results are different from those in Table 0(b). We observe that except two cases on the “V16” column in denoising, using classification loss on model (row) can boost the detection accuracy on model notably upon plain processing. This improvement is even comparable with directly training using the detection loss, as in Table 0(b). Such task transferability suggests the “machine semantics” of the image could even be a task-agnostic property, and makes our method even more broadly applicable.
4.5 Image Processing Quality Comparison
We have analyzed the recognition accuracy of the output images, now we compare the output image quality using conventional metrics PSNR and SSIM. When using RA with transformer, the output quality of is guaranteed unaffected, therefore here we evaluate RA processing. We use ResNet-18 on ImageNet as , and report results with different s (Eqn. 3) in Table 6.
Super-resolution | Denoising | JPE-deblocking | |
---|---|---|---|
0 | 26.29/0.795/52.6 | 31.24/0.895/61.9 | 27.50/0.825/48.2 |
26.33/0.803/59.2 | 31.18/0.894/64.4 | 27.50/0.823/56.0 | |
26.31/0.792/61.8 | 30.78/0.884/65.1 | 27.17/0.810/57.7 | |
25.47/0.760/61.3 | 29.71/0.855/64.3 | 26.32/0.776/56.6 |
corresponds to plain processing. When , in super-resolution, the PSNR/SSIM metrics are even slightly higher, and in denoising and JPEG-deblocking they are only marginally worse. However, the accuracy obtained is significantly higher. This suggests that the added recognition loss is not harmful when is chosen properly. When is excessively large (), the image quality is hurt more, and interestingly even the recognition accuracy start to decrease. A proper balance between image processing loss and recognition loss is needed for both image quality and performance on downstream recognition tasks.
Target Image | Input Image | Plain Processing | RA | RA | RA |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Label: bear | Low-resolution | 19.24/0.603/lion | 19.25/0.602/bear | 19.20/0.600/bear | 19.03/0.585/bear |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Label: finch | Noisy | 30.45/0.909/kite | 30.45/0.908/finch | 30.18/0.899/finch | 29.41/0.871/finch |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Label: crab | JPEG-compressed | 27.26/0.859/goldfish | 27.18/0.857/crab | 26.87/0.845/crab | 26.19/0.823/crab |
In Fig. 3, we visualize some examples where the output image is incorrectly classified with a plain image processing model, and correctly recognized with RA processing. With smaller ( and ), the image is nearly the same as the plainly processed images. When is too large (), we could see some extra textures when zooming in. For more results please refer to Appendix C.
5 Analysis
In this section we analyze some alternatives to our approaches. All experiments in this section are conducted using RA processing on super-resolution, with ResNet-18 trained on ImageNet as the recognition model, and = 10 if used.
Training without the Image Processing Loss. It is possible to train the processing model on the recognition loss , without even keeping the original image processing loss (Eqn. 3). This may presumably lead to better recognition performance since the model can now “focus on” optimizing the recognition loss. However, we found removing the original image processing loss hurts the recognition performance: the accuracy drops from 61.8% to 60.9%; even worse, the SSIM/PSNR metrics drop from 26.33/0.792 to 16.92/0.263, which is reasonable since the image processing loss is not optimized during training. This suggests the original image processing loss is helpful for the recognition accuracy, since it helps the corrupted image to restore to its original form.
Fine-tuning the Recognition Model. Instead of fixing the recognition model , we could fine-tune it together with the training of image processing model , to optimize the recognition loss. Many prior works (sharma2018classification; bai2018finding; sicnn) do train/fine-tune the recognition model jointly with the image processing model. We use SGD with momentum as ’s optimizer, and the final accuracy reaches 63.0%. However, since we do not fix , it becomes a model that specifically recognizes super-resolved images, and we found its performance on original target images drops from 69.8% to 60.5%. Moreover, when transferring the trained on ResNet-56, the accuracy is 62.4 %, worse than 66.7% when we train with a fixed ResNet-18. We lose some transferability if we do not fix the recognition model .
Training Recognition Models from Scratch. Other than fine-tuning a pretrained recognition model , we could first train a super-resolution model, and then train from scratch on the output images. We achieve 66.1% accuracy on the output images in the validation set, higher than 61.8% in RA processing. However, the accuracy on original clean images drops from 69.8% to 66.1%. Alternatively, we could even train from scratch on the interpolated low-resolution images, in which case we achieve 66.0% on interpolated validation data but only 50.2% on the original validation data. In summary, training or fine-tuning to cater the need of super-resolved or interpolated images can harm its performance on the original clean images, and causes additional overhead in storing models. In contrast, using our RA processing technique could boost the accuracy of output images with the performance on original images intact.
6 Conclusion
We investigated the problem of enhancing the machine interpretability of image processing outputs. We find our simple approach – optimizing with the additional recognition loss during training can significantly boost the recognition accuracy with minimal or no loss in image processing quality. Moreover, such improvement can transfer to recognition architectures, object categories, and vision tasks unseen during training, indicating the enhanced interpretability is not specific to one particular model but generalizable to others. This makes the proposed approach feasible even when the future downstream recognition models are unknown.
References
Appendix
Appendix A Experimental Details
General Setup. We evaluate our proposed methods on three image processing tasks: image super-resolution, denoising, and JPEG-deblocking. In those tasks, the target images are all the original images from the datasets. To obtain the input images, for super-resolution, we use a downsampling scale factor of 4
; for denoising, we add Gaussian noise on the images with a standard deviation of 0.1 to obtain the noisy images; for JPEG deblocking, a quality factor of 10 is used to compress the image to JPEG format. The image processing loss used is the mean squared error (MSE, or
) loss. For the recognition tasks, we consider image classification and object detection, two common tasks in computer vision. In total, we have 6 (3
2) task pairs to evaluate.We adopt the SRResNet (srgan) as the architecture of the image processing model , which is simple yet effective in optimizing the MSE loss. Even though SRResNet is originally designed for super-resolution, we find it also performs well on denoising and JPEG deblocking when its upscale parameter set to 1 for the same input-output sizes. Throughout the experiments, on both the image processing network and the transformer, we use the Adam optimizer (adam) with an initial learning rate of , following the original SRResNet (srgan)
. Our implementation is in PyTorch
(pytorch).Image Classification. For image classification, we evaluate our method on the large-scale ImageNet benchmark (imagenet). We use five pre-trained image classification models, ResNet-18/50/101 (resnet), DenseNet-121 (densenet) and VGG-16 (vgg) with BN (bn) (denoted as R18/50/101, D121, V16 in Table 0(a)
), on which the top-1 accuracy (%) of the original validation images is 69.8, 76.2, 77.4, 74.7, and 73.4 respectively. We train the processing models for 6 epochs on the training set, with a learning rate decay of 10
at epoch 5 and 6, and a batch size of 20. In evaluation, we feed unprocessed validation images to the image processing model, and report the accuracy of the output images evaluated on the pre-trained classification networks. For unsupervised RA, we use distance as the function in Eqn. 4. The hyperparameter is chosen using super-resolution with the ResNet-18 recognition model, on two small subsets for training/validation from the original large training set. The chosen for RA processing, RA with transformer, and unsupervised RA is 10, 10 and 10 respectively.Object Detection. For object detection, we evaluate on PASCAL VOC 2007 and 2012 dataset, using Faster-RCNN (ren2015faster) as the recognition model. Our implementation is based on the code from (jjfaster2rcnn). Following common practice (yolo; ren2015faster; dai2016r), we use VOC 07 and 12 trainval data as the training set, and evaluate on VOC 07 test data. The Faster-RCNN training uses the same hyperparameters in (jjfaster2rcnn). For the recognition model’s backbone architecture, we evaluate ResNet-18/50/101 and VGG-16 (without BN (bn)), obtaining mAP of 74.2, 76.8, 77.9, 72.2 on the test set respectively. Given those trained detectors as recognition loss functions, we train the models on the training set for 7 epochs, with a learning rate decay of 10 at epoch 6 and 7, and a batch size of 1. We report the mean Average Precision (mAP) of processed images in the test set. As in image classification, we use for RA processing, and for RA with transformer.
Appendix B More Results on Transferability
We present some additional results on transferability here.
b.1 Transferring between Architectures
Task | Super-resolution | Denoising | JPEG-deblocking | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Evaluation on | R18 | R50 | R101 | V16 | R18 | R50 | R101 | V16 | R18 | R50 | R101 | V16 |
Plain Processing | 69.2 | 70.7 | 73.3 | 64.2 | 68.9 | 72.0 | 74.7 | 65.8 | 63.7 | 66.5 | 70.4 | 60.3 |
RA w/ R18 | 71.2 | 73.8 | 75.2 | 66.9 | 70.9 | 74.0 | 75.5 | 67.2 | 67.4 | 70.0 | 72.3 | 63.5 |
RA w/ R50 | 70.6 | 74.4 | 75.4 | 66.4 | 70.6 | 73.7 | 75.5 | 67.2 | 67.0 | 70.4 | 72.4 | 63.2 |
RA w/ R101 | 71.1 | 73.8 | 75.6 | 65.8 | 70.3 | 73.6 | 75.6 | 66.2 | 65.9 | 69.3 | 72.9 | 61.3 |
RA w/ V16 | 70.4 | 72.8 | 74.9 | 68.1 | 69.9 | 73.4 | 75.6 | 67.6 | 66.1 | 69.3 | 72.1 | 63.9 |
We provide the model transferability results of RA processing on object detection in Table 7. Rows indicate the models trained as recognition loss and columns indicate the evaluation models. We see similar trend as in classification (Table 0(a)): using other architectures as loss can also improve recognition performance over plain processing; the loss model that achieves the highest performance is mostly the model itself, as can be seen from the fact that most boldface numbers are on the diagonals.
Task | Super-resolution | Denoising | JPEG-deblocking | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Evaluation on | R18 | R50 | R101 | D121 | V16 | R18 | R50 | R101 | D121 | V16 | R18 | R50 | R101 | D121 | V16 |
Plain Processing | 52.6 | 58.8 | 61.9 | 57.7 | 50.2 | 61.9 | 68.0 | 69.1 | 66.4 | 60.9 | 48.2 | 53.8 | 56.0 | 52.9 | 42.4 |
Unsup. RA w/ R18 | 61.3 | 66.3 | 68.6 | 64.5 | 57.3 | 61.7 | 67.9 | 69.7 | 66.4 | 60.5 | 53.8 | 59.1 | 62.0 | 57.5 | 50.0 |
Unsup. RA w/ R50 | 58.9 | 66.9 | 68.6 | 64.1 | 58.2 | 61.2 | 68.6 | 70.3 | 66.6 | 61.3 | 52.8 | 60.4 | 62.5 | 58.3 | 50.3 |
Unsup. RA w/ R101 | 57.8 | 64.9 | 69.0 | 62.9 | 56.9 | 60.6 | 68.0 | 70.7 | 66.3 | 60.7 | 52.3 | 58.7 | 63.4 | 57.9 | 49.0 |
Unsup. RA w/ D121 | 58.0 | 64.7 | 67.2 | 65.3 | 56.0 | 60.7 | 67.8 | 69.7 | 67.1 | 60.3 | 52.2 | 59.2 | 62.2 | 59.7 | 49.9 |
Unsup. RA w/ V16 | 57.7 | 64.6 | 67.3 | 63.2 | 61.0 | 60.4 | 67.1 | 69.6 | 65.9 | 63.6 | 52.0 | 58.4 | 61.5 | 57.4 | 53.1 |
As a complement in Section 4.2, we present the results when transferring between recognition architectures, using unsupervised RA, in Table 8. We note that for super-resolution and JPEG-deblocking, similar trend holds as in (supervised) RA processing, as using any architecture in training will improve over plain processing. But for denoising, this is not always the case. Some models trained with unsupervised RA are slightly worse than the plain processing counterpart. A possible reason for this is the noise level in our experiments is not large enough and plain processing achieve very high accuracy already.
b.2 Transferring between Recognition Tasks
In Section 4.4, we investigated the transferability of improvement from classification to detection. Here we evaluate the opposite direction, from detection to classification. The results are shown in Table 9. Here, using RA processing can still consistently improve over plain processing for any pair of models, but we note that the improvement is not as significant as directly training using classification models as loss (Table 0(a) and Table 2).
Task | Super-resolution | Denoising | JPEG-deblocking | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Evaluation on | R18 | R50 | R101 | D121 | V16 | R18 | R50 | R101 | D121 | V16 | R18 | R50 | R101 | D121 | V16 |
Plain Processing | 53.0 | 58.9 | 62.0 | 57.3 | 50.9 | 59.7 | 65.1 | 67.3 | 63.9 | 59.2 | 48.8 | 54.6 | 56.8 | 53.1 | 44.7 |
RA w/ R18 | 54.6 | 60.2 | 63.4 | 58.8 | 52.7 | 60.8 | 66.7 | 68.8 | 65.2 | 61.1 | 50.8 | 57.2 | 59.6 | 55.4 | 48.5 |
RA w/ R50 | 54.0 | 59.7 | 63.0 | 58.7 | 52.0 | 60.5 | 66.6 | 68.5 | 64.9 | 60.8 | 50.7 | 56.9 | 59.2 | 55.3 | 48.3 |
RA w/ R101 | 54.1 | 59.8 | 63.3 | 58.7 | 52.5 | 60.2 | 66.1 | 68.3 | 64.6 | 60.6 | 51.3 | 57.2 | 59.5 | 55.5 | 48.3 |
RA w/ V16 | 54.5 | 60.4 | 63.6 | 59.1 | 52.7 | 60.4 | 66.6 | 68.4 | 64.7 | 60.6 | 50.6 | 56.5 | 58.7 | 54.9 | 47.9 |
Additionally, the results when we transfer the model trained with unsupervised RA with image classification to object detection are shown in Table 10. In most cases, it improves over plain processing, but for image denoising, this is not always the case. Similar to results in Table 8, this could be because the noise level is relatively low in our experiments.
Super-resolution | Denoising | JPEG-deblocking | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Evaluation on | R18 | R50 | R101 | V16 | R18 | R50 | R101 | V16 | R18 | R50 | R101 | V16 |
Plain Processing | 68.5 | 69.7 | 73.1 | 63.2 | 68.1 | 71.6 | 74.1 | 65.7 | 62.4 | 65.6 | 69.5 | 58.3 |
Unsup. RA w/ R18 | 71.3 | 73.4 | 75.3 | 66.8 | 69.0 | 71.3 | 74.3 | 61.1 | 65.2 | 68.1 | 71.3 | 59.8 |
Unsup. RA w/ R50 | 70.7 | 73.3 | 75.0 | 66.6 | 68.9 | 71.7 | 74.4 | 63.1 | 65.4 | 68.5 | 71.2 | 60.0 |
Unsup. RA w/ R101 | 70.7 | 73.2 | 75.0 | 66.2 | 68.9 | 71.3 | 73.9 | 63.3 | 65.2 | 67.9 | 71.1 | 59.6 |
Unsup. RA w/ D121 | 71.0 | 73.2 | 75.1 | 66.6 | 68.7 | 70.3 | 73.0 | 63.8 | 65.9 | 68.6 | 71.4 | 61.1 |
Unsup. RA w/ V16 | 70.3 | 72.3 | 74.3 | 67.0 | 68.5 | 70.7 | 74.0 | 63.6 | 65.9 | 68.2 | 71.5 | 61.1 |
Appendix C More Visualizations
We provide more visualizations in Fig. 4 where the output image is incorrectly classified by ResNet-18 with a plain image processing model, and correctly recognized with RA processing, as in Fig. 3 at Section 4.5.
Target Image | Input Image | Plain Processing | RA | RA | RA |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Label: beer bottle | Low-resolution | 21.06/0.725/shoe shop | 21.16/0.731/beer bottle | 21.05/0.727/beer bottle | 20.46/0.687/beer bottle |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Label: dam | Low-resolution | 29.71/0.780/cliff | 29.76/0.783/dam | 29.60/0.778/dam | 28.92/0.755/dam |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Label: tiger shark | Low-resolution | 36.58/0.915/hammerhead | 36.17/0.917/tiger shark | 36.00/0.911/tiger shark | 33.59/0.834/tiger shark |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Label: pill bottle | Noisy | 33.69/0.935/lotion | 33.56/0.932/pill bottle | 33.09/0.920/pill bottle | 32.14/0.904/pill bottle |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Label: tabby cat | Noisy | 30.77/0.830/plastic bag | 30.74/0.830/tabby cat | 30.51/0.825/tabby cat | 29.93/0.811/tabby cat |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Label: tricycle | Noisy | 30.50/0.918/barber chair | 30.46/0.917/tricycle | 30.05/0.911/tricycle | 29.06/0.895/tricycle |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Label: mushroom | JPEG-compressed | 25.78/0.746/folding chair | 25.78/0.747/mushroom | 21.55/0.730/mushroom | 24.96/0.696/mushroom |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Label: pier | JPEG-compressed | 27.41/0.818/mobile home | 27.41/0.816/pier | 27.04/0.803/pier | 26.18/0.772/pier |
Comments
There are no comments yet.