Transferable Recognition-Aware Image Processing

10/21/2019 ∙ by Zhuang Liu, et al. ∙ 0

Recent progress in image recognition has stimulated the deployment of vision systems (e.g. image search engines) at an unprecedented scale. As a result, visual data are now often consumed not only by humans but also by machines. Meanwhile, existing image processing methods only optimize for better human perception, whereas the resulting images may not be accurately recognized by machines. This can be undesirable, e.g., the images can be improperly handled by search engines or recommendation systems. In this work, we propose simple approaches to improve machine interpretability of processed images: optimizing the recognition loss directly on the image processing network or through an intermediate transforming model, a process which we show can also be done in an unsupervised manner. Interestingly, the processing model's ability to enhance the recognition performance can transfer when evaluated on different recognition models, even if they are of different architectures, trained on different object categories or even different recognition tasks. This makes the solutions applicable even when we do not have the knowledge about future downstream recognition models, e.g., if we are to upload the processed images to the Internet. We conduct comprehensive experiments on three image processing tasks with two downstream recognition tasks, and confirm our method brings substantial accuracy improvement on both the same recognition model and when transferring to a different one, with minimal or no loss in the image processing quality.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Image processing has been used to generate images that look good for human, but not machines. In this work we study the problem of making processed images more recognizable by machines.

Unlike in image recognition where the network maps an image to a semantic label, a network used for image processing maps an input image to an output image with some desired properties. Examples include image super-resolution

(dong2014learning), denoising (xie2012image), deblurring (eigen2013restoring)

, colorization

(zhang2016colorful) and style transfer (gatys2015neural). The goal of such systems is to produce images of high perceptual quality to a human observer. For example, in image denoising, we aim to remove noise in the signal that is not useful to an observer and restore the image to its original “clean” form. Metrics like PSNR and SSIM (ssim) are often used (dong2014learning; srdensenet) to approximate human-perceived similarity between the processed images with the original images, and direct human assessment on the fidelity of the output is often considered the “gold-standard” assessment (srgan; zhang2018unreasonable). Therefore, many techniques (johnson2016perceptual; srgan; pix2pix) have been proposed for making the output images look natural to human.

However, the processed image might not look “natural” to machines, or in other words, they may not be accurately recognized by image recognition systems. As shown in Fig. 1

, the output image of an denoising model could easily be recognized by a human as a bird, but a recognition model classifies it as a kite. One could specifically train a recognition model only on these output images produced by the denoising model to achieve better performance on such images, but the performance on natural images can be harmed. This retraining/adaptation scheme might also be impractical considering the significant overhead induced by catering to various image processing tasks and models.

With the fast-growing size of image data, many images are often “viewed” and analyzed more by machines than by humans. Nowadays, any image uploaded to the Internet is likely to be analyzed by certain vision systems. For example, Facebook uses a system called Rosetta to extract texts from over 1 billion user-uploaded images every day (facebook). It is of great importance that the processed images be recognizable by not only humans, but also by machines. In this way, we make them potentially easier to search, recommended to more interested audience, and so on, as these procedures are mostly executed by machines based on their understanding of the images. Therefore, we argue that image processing systems should maintain/enhance machine semantics. We call this problem “Recognition-Aware Image Processing”.

It is also important that the enhanced machine semantics is not specific to any concrete recognition model, i.e., the improvement on recognition performance is only achieved when the output images are evaluated on that particular model. Instead, the improvement should ideally be transferable when evaluated on different models, to support its usage without access to possible future recognition systems, since we may not decide what model will be used for recognizing the processed image, for example if we upload it to the Internet or share it on social media. We may not know what network architectures (e.g. ResNet or VGG) will be used for inference, what object categories the downstream model recognizes (e.g. animals or scenes), or even what task will be performed on the processed image (e.g. classification or detection). Without these specifications, it might be hard to enhance image’s machine semantics.

In this work, we propose simple and highly effective approaches to make image processing outputs more accurately recognized by downstream recognition systems, transferable among different recognition architectures, categories and tasks. The approaches we investigate add a recognition loss optimized jointly with the image processing loss. The recognition loss is computed using a fixed recognition model that is pretrained on natural images, and can be done in an unsupervised manner, e.g., without semantic labels of the image. It can be optimized either directly by the original image processing network, or through an intermediate transforming network. We conduct extensive experiments, on multiple image processing (super-resolution, denoising, and JPEG-deblocking) and recognition (classification and detection) tasks, and demonstrate that our approaches can substantially boost the recognition accuracy on the downstream systems, with minimal or no loss in the image processing quality measured by conventional metrics. Also, the accuracy improvement transfers favorably among different recognition model architectures, object categories, and recognition tasks, which renders our simple solution effective even when we do not have access to the downstream recognition models. Our contributions can be summarized as follows:

  • [leftmargin=15pt]

  • We propose to study the problem of enhancing the machine interpretability of image processing outputs, a desired property considering the amount of images analyzed by machines nowadays.

  • We propose simple and effective methods towards this goal, suitable for different use cases, e.g., without ground truth semantic labels. Extensive experiments are conducted on multiple image processing and recognition tasks, demonstrating the wide applicability of the proposed methods.

  • We show that using our simple approaches, the recognition accuracy improvement could transfer among recognition architectures, categories and tasks, a desirable behavior making the proposed methods applicable without access to the downstream recognition model.

2 Related Work

Since the initial success of deep neural networks on image enhancement and restoration problems

(dong2014learning; xie2012image; wang2016d3), a large body of works try to investigate better model architecture design and training techniques (dong2016accelerating; kim2016accurate; shi2016real; Kim_2016_CVPR; mao2016image; lai2017deep; tai2017image; srdensenet; memnet; lim2017enhanced; zhang2018residual; ahn2018fast; lefkimmiatis2018universal; chen2018image; haris2018deep), mostly on the image super-resolution task. These works focus on generating high visual quality images under conventional metrics or human evaluation, without considering recognition performance on the output.

There are also a number of works that relate image recognition with processing. Some works (zhang2016colorful; larsson2016learning; zhang2018image; sajjadi2017enhancenet)

use image classification accuracy as an evaluation metric for image colorization/super-resolution, but without optimizing for it during training.

bai2018finding

train a super-resolution and refinement network simultaneously to better detect faces in the wild.

sicnn train networks for face hallucination and recognition jointly to achieve better recover the face identity from low-resolution images. liu2018disentangling considers 3D face reconstruction and trains the recognition model jointly with the reconstructor. sharma2018classification trains a classification model together with an enhancement module. Our problem setting is different from these works, in that we assume we do not have the control on the recognition model, as it might be on the cloud or decided in the future, thus we advocate adapting the image processing model only. This also ensures the recognition model is not harmed on natural images. haris2018task investigate how super-resolution could help object detection in low-resolution images. vidalmata2019bridging and banerjee2019report also aims to enhance machine accuracy on poor-conditioned images but mostly focus on better image processing techniques without using recognition models. wang2019segmentation propose a method to make denoised images more accurately segmented, also presenting some interesting findings in transferability. Most existing works only consider one image processing task or image domain, and develop specific techniques, while our simpler approach is task-agnostic and potentially more widely applicable.

3 Method

In this section we first introduce the problem setting of “recognition-aware” image processing, and then we develop various approaches to address it, each suited for different use cases.

3.1 Problem Setting

In a typical image processing problem, given a set of training input images and corresponding target images (), we aim to train a neural network that maps an input image to its corresponding target. For example, in image denoising, is a noisy image and is the corresponding clean image. Denoting this mapping network as (for processing), parameterized by , during training our optimization objective is:

(1)

where is simply the output of the processing model , and

is the loss function for each sample. The pixel-wise mean-squared-error (MSE, or

) loss is one of the most popular choices. During evaluation, the performance is typically measured by average similarity (e.g., PSNR, SSIM) between and , or through human assessment.

In our problem setting of recognition-aware processing, we are interested in a recognition task, with a trained recognition model ( for recognition), parameterized by . We assume each input/target image pair is associated with a ground truth semantic label for the recognition task. Our goal is to train a image processing model such that the recognition performance on the output images is high, when evaluated using with the semantic labels . In practice, the recognition model might not be available (e.g., on the cloud), in which case we could resort to other models if the performance improvement transfers among models.

3.2 Optimizing Recognition Loss

Figure 2: Left: RA (Recognition-Aware) processing. In addition to the image processing loss, we add a recognition loss using a fixed recognition model , for the processing model to optimize. Right: RA with transformer. “Recognition Loss” stands for the dashed box in the left figure. A Transformer is introduced between the output of and input of , to optimize recognition loss. We cut the gradient from recognition loss flowing to , such that only optimizes the image processing loss and the image quality is guaranteed not affected.

  

Given our goal is to make the output images by more recognizable by , it is natural to add a recognition loss on top of the objective of the image processing task (Eqn. 1) during training:

(2)

is the per-example recognition loss defined by the downstream recognition task. For example, for image classification, could be the cross-entropy (CE) loss. Adding the image processing loss (Eqn. 1) and recognition loss (Eqn. 2) together, our total training objective becomes

(3)

where is the coefficient controlling the weights of relative to . We denote this simple solution as “RA (Recognition-Aware) processing”, which is visualized in Fig. 2 left.

A potential shortcoming of directly optimizing is that it might deviate from optimizing the original loss , and the trained will generate images that are not as good as if we only optimize . We will show that in experiments, however, with proper choice of , we could substantially boost the recognition performance with minimal or no sacrifice on image quality.

If using as a fixed loss function can only boost the recognition accuracy on itself, the use of the method could be restricted. Sometimes we do not have the knowledge about the downstream recognition model or even task, but we still would like to improve future recognition performance. Interestingly, we find that image processing models trained with the loss of one recognition model , can also boost the performance when evaluated using recognition model , even if model has a different architecture, recognizes a different set of categories or even is trained for a different task. This makes our method effective even when we cannot access the target downstream model, in which case we could use another trained model we do have access to as the loss function. This phenomenon also implies that the “recognizability” of a processed image can be a more general notion than just the extent it fits to a specific model. More details on how the improvement is transferable among different recognition models will be presented in the experiments.

3.3 Unsupervised Optimization of Recognition Loss

The solution mentioned above requires semantic labels available for training images, which however, may not be satisfied all the time. In this case, we could instead resort to regress the recognition model’s output of the target image , given the target images at hand, and that the recognition model is pretrained and fixed. The recognition objective in Eqn. 2 changes to

(4)

where is a distance metric between ’s output given input of processed image and ground truth target image . For example, when

is a classification model and outputs a probability distribution over classes,

could be the KL divergence or simply a distance. During evaluation, the output of is still compared to the ground truth semantic label . We call this approach “unsupervised RA”. Note that it is only “unsupervised” for training model , but the target pretrained model can still be trained in full supervision. This approach is to some extent related to the “knowledge distillation” paradigm (kd) used for network model compression, where the output of a large model is used to guide the output of a small model, given the same input images. Instead we use the same recognition model but guide the upstream processing model to generate input to which produces similar output with that of the target image.

3.4 Using an Intermediate Transformer

Sometimes we do want to guarantee that the added recognition loss will not deviate the model from optimizing its original loss. We can achieve this by introducing another intermediate transformation model . After the input image going through the image processing model , the output image is first fed to the model , and ’s output image serves as the input for the recognition model (Fig. 2 right). In this case, ’s parameters are optimized for minimizing the recognition loss:

(5)

In this way, with the help of on optimizing the recognition loss, the model can now “focus on” its original image processing loss . The optimization objective becomes:

(6)

In Eqn. 6, is still solely optimizing as in the original image processing problem (Eqn. 1). is learned as if there is no recognition loss, and therefore the image processing quality of its output will not be affected. This could be achieved by “cutting” the gradient generated by between the model and (Fig. 2 right). The responsibility for a better recognition performance falls on the model . We term this solution as “RA with transformer”.

The downside of using a transformer compared with directly optimizing recognition loss using the processing model, is that there are two instances for each image (the output of model and ), one is “for human” and the other is “for machines”. Also, as we will show later, it can sometimes harm the transferability of the performance improvement, possibly because there is no image processing loss as a constraint on ’s output. Therefore, the transformer is best suited for the case where we want to guarantee the image processing quality not affected at all, at the expense of maintaining another image and losing some transferability.

4 Experiments

We evaluate our proposed methods on three image processing tasks, namely image super-resolution, denoising, and JPEG-deblocking, paired with two common visual recognition tasks, image classification and object detection. We adopt the SRResNet (srgan) as the architecture of the image processing model , due to its popularity and simplicity. For the transformer model , we use the 6-block ResNet architecture in CycleGAN (cyclegan)

, a general-purpose image to image transformation network. For classification we use the ImageNet and for detection we use PASCAL VOC as our benchmark. The recognition architectures are ResNet, VGG and DenseNet. Training is performed with the training set and results on the validation set are reported. For more details on the training settings and hyperparameters of each task, please refer to Appendix

A.

4.1 Evaluation on the Same Recognition Model

Task Super-resolution Denoising JPEG-deblocking
Classification Model R18 R50 R101 D121 V16 R18 R50 R101 D121 V16 R18 R50 R101 D121 V16
No Processing 46.3 50.4 55.5 51.6 42.1 46.8 55.8 61.3 59.7 46.7 43.1 47.7 55.2 49.2 43.9
Plain Processing 52.6 58.8 61.9 57.7 50.2 61.9 68.0 69.1 66.4 60.9 48.2 53.8 56.0 52.9 42.4
RA Processing 61.8 67.3 69.6 66.0 61.9 65.1 71.2 72.7 69.8 66.5 57.7 63.6 65.8 62.3 56.7
Unsupervised RA 61.3 66.9 69.4 65.3 61.0 61.7 68.6 70.8 67.1 63.6 53.8 60.4 63.4 59.7 53.1
RA w/ Transformer 63.0 68.2 70.1 66.5 63.0 65.2 70.9 72.3 69.6 65.9 59.8 65.1 66.7 63.9 58.7
(a) Accuracy (%) on ImageNet classification. The five models achieve 69.8, 76.2, 77.4, 74.7, 73.4 on original images.
Task Super-resolution Denoising JPEG-deblocking
Detection Model R18 R50 R101 V16 R18 R50 R101 V16 R18 R50 R101 V16
No Processing 67.9 70.3 72.1 63.6 51.8 56.5 61.8 38.9 49.3 54.5 64.1 38.4
Plain Processing 69.2 70.7 73.3 64.2 68.9 72.0 74.7 65.8 63.7 66.5 70.4 60.3
RA Processing 71.2 74.4 75.6 68.1 70.9 73.7 75.6 67.6 67.4 70.4 72.9 63.9
RA w/ Transformer 71.4 74.2 75.6 66.0 71.0 73.9 75.9 67.7 68.5 70.7 73.7 64.4
(b) mAP on PASCAL VOC object detection. The four models achieve 74.2, 76.8, 77.9, 72.2 on original images.
Table 1: Recognition-Aware (RA) processing techniques can substantially boost the recognition accuracy.

We first show our results when evaluating on the same recognition model, i.e., the used for evaluation is the same as the we use as the recognition loss in training. Table 0(a) shows our results on ImageNet classification. ImageNet-pretrained classification models ResNet-18/50/101, DenseNet-121 and VGG-16 are denoted as R18/50/101, D121, V16 in Table 0(a)

. The “No Processing” row denotes the recognition performance on the input of the image processing model: for denoising/JPEG-deblocking, this corresponds to the noisy/JPEG-compressed images; for super-resolution, the low-resolution images are bicubic interpolated to the original resolution. “Plain Processing” denotes conventional image processing models trained without recognition loss as described in Eqn.

1. We observe that a plainly trained processing model can boost the accuracy over unprocessed images. These two are considered as baselines in our experiments.

From Table 0(a), using RA processing can significantly boost the accuracy of output images over plainly processed ones, for all image processing tasks and recognition models. This is more prominent when the accuracy of plain processing is lower, e.g., in super-resolution and JPEG-deblocking, in which case we mostly obtain 10% accuracy improvement. Even without semantic labels, our unsupervised RA can still in most cases outperform baseline methods, despite achieves lower accuracy than its supervised counterpart. Also in super-resolution and JPEG-deblocking, using an intermediate transformer can bring additional improvement over RA processing.

The results for object detection are shown in Table 0(b). We observe similar trend as in classification: using recognition loss can consistently improve the mAP over plain image processing by a notable margin. On super-resolution, RA processing mostly performs on par with RA with transformer, but on the other two tasks using a transformer is slightly better.

4.2 Transfer between Recognition Architectures

In reality, sometimes the recognition model we want to eventually evaluate the output images on might not be available for us to use as a loss for training, e.g., it could be on the cloud, kept confidential or decided later. In this case, we could train an image processing model using recognition model that is accessible to us, and after we obtain the trained model , evaluate its output images’ recognition accuracy using another unseen recognition model . We evaluate all model architecture pairs on ImageNet classification in Table 2 and Table 3, for RA Processing and RA with Transformer respectively, where row corresponds to the model used as recognition loss (), and column corresponds to the evaluation model (). For RA with Transformer, we use the processing model and transformer trained with together when evaluating on .

Task Super-resolution Denoising JPEG-deblocking
Evaluation on R18 R50 R101 D121 V16 R18 R50 R101 D121 V16 R18 R50 R101 D121 V16
Plain Processing 52.6 58.8 61.9 57.7 50.2 61.9 68.0 69.1 66.4 60.9 48.2 53.8 56.0 52.9 42.4
RA w/ R18 61.8 66.7 68.8 64.7 58.2 65.1 70.6 71.9 69.1 63.8 57.7 62.3 64.3 60.7 52.8
RA w/ R50 59.3 67.3 68.8 64.3 59.1 64.2 71.2 72.2 69.2 64.7 55.8 63.6 64.7 61.0 53.5
RA w/ R101 58.8 66.0 69.6 63.4 58.2 64.0 70.5 72.7 68.9 64.8 54.9 61.5 65.8 60.3 52.8
RA w/ D121 59.0 65.6 67.8 66.0 57.4 64.2 70.6 72.0 69.8 64.3 54.8 61.8 64.4 62.3 52.9
RA w/ V16 57.9 64.8 67.0 63.0 61.9 63.9 70.4 72.0 68.8 66.5 54.5 60.9 63.1 59.7 56.7
Table 2: Transfer between recognition architectures using RA processing, on ImageNet classification. A image processing model trained with model (row) as recognition loss can improve the recognition performance on model (column) over plain processing.

In Table 2’s each column, training with any model produces substantially higher accuracy than plainly processed images on . Thus, we conclude that the improvement on the recognition accuracy is transferable among different recognition architectures. A possible explanation for this is that these models are all trained on the same ImageNet dataset, such that their mapping functions from input to output are similar, and optimizing the loss of one would lead to the lower loss of another. This phenomenon enables us to use RA processing without the knowledge of the downstream recognition model architecture. However, among all rows, the that achieves the highest accuracy is still the same model as , indicated by the diagonal boldface numbers in Table 2.

Task Super-resolution Denoising JPEG-deblocking
Evaluation on R18 R50 R101 D121 V16 R18 R50 R101 D121 V16 R18 R50 R101 D121 V16
Plain Processing 52.6 58.8 61.9 57.7 50.2 61.9 68.0 69.1 66.4 60.9 48.2 53.8 56.0 52.9 42.4
RA w/ w/ R18 63.0 59.2 67.0 63.9 27.0 65.2 69.4 71.6 68.4 40.3 59.8 58.7 62.6 60.3 19.9
RA w/ w/ R50 60.5 68.2 68.9 65.8 40.4 63.1 70.9 71.5 68.6 48.7 55.0 65.1 63.9 61.9 31.5
RA w/ w/ R101 59.6 66.2 70.1 65.1 35.6 62.4 68.8 72.3 67.6 52.3 54.8 61.3 66.7 24.8 60.5
RA w/ w/ D121 58.5 64.2 66.9 66.5 27.3 58.0 66.8 67.3 69.6 46.7 46.6 57.2 59.0 63.9 9.0
RA w/ w/ V16 59.2 64.7 67.8 65.0 63.0 57.6 64.0 67.1 55.7 63.1 56.1 61.2 63.4 58.7 60.1
Table 3: Transfer between architectures using RA with Transformer (), on ImageNet classification.

Meanwhile in Table 3, in most cases improvement is still transferable when we use a transformer , but there are a few exceptions. For example, when is ResNet or DenseNet and when is VGG-16, in most cases the accuracy fall behind plain processing by a large margin. This weaker transferability is possibly caused by the fact that there is no constraint imposed by the image processing loss on ’s output, thus it “overfits” more to the specific it is trained with. For more results on object detection and unsupervised RA, please refer to the Appendix B.1. This is intuitive since the processing model optimizes the same recognition loss during training as that used in evaluation.

4.3 Transfer between Object Categories

What if the and recognize different categories of objects? Can RA processing still bring transferable improvement? To answer this question, we divide the 1000 classes from ImageNet into two splits (denoted as category ), each with 500 classes, and train two 500-way classification models (ResNet-18) on both splits, obtaining and . Next, we train two image processing models and with the and as recognition loss, using images from category and respectively. Note that neither the image processing model nor the recognition model has seen any images from the other split of categories during training, and and learn completely different mappings from input to output. The plain processing counterparts of and are also trained with category and respectively, but without the recognition loss. We evaluate obtained image processing models on both splits, and the results are shown in Table 4.

Task Super-resolution Denoising JPEG-deblocking
Train/Eval Category Cat Cat Cat Cat Cat Cat
Cat Plain 59.6 60.1 67.6 68.0 54.2 55.5
Cat RA 67.2 66.5 69.7 69.4 63.0 62.3
Cat Plain 59.6 60.2 67.0 67.5 54.7 56.0
Cat RA 66.4 67.8 69.4 69.7 62.1 63.5
Table 4: Transfer between different object categories (500-way accuracy %). RA processing on one set of categories can also improve the performance on another. “Cat” means category.

We observe that using RA processing still benefits the recognition accuracy even when transferring across categories (e.g., in super-resolution, from 60.1% to 66.5% when transferring from category to category , on super resolution). The improvement is only marginally lower than directly training with recognition model of the same category (e.g., from 60.2% to 67.8% when trained and evaluated both on category ). Such transferability between categories suggest the learned image processing models do not improve accuracy by adding category-specific signals to the output images, instead they generate more general signals that enable a wider set of classes to be better recognized.

4.4 Transfer between Recognition Tasks

What if we take a further step to the case when and not only recognize different categories, but also perform different tasks? We evaluate such task transferability for when task is classification and task is object detection in Table 5. For results on the opposite direction and results for unsupervised RA, please refer to Appendix B.2.

Task Super-resolution Denoising JPEG-deblocking
Evaluation on R18 R50 R101 V16 R18 R50 R101 V16 R18 R50 R101 V16
Plain Processing 68.5 69.7 73.1 63.2 68.1 71.6 74.1 65.7 62.4 65.6 69.5 58.3
RA w/ R18 71.3 73.5 75.6 67.8 70.6 73.1 75.5 64.1 67.7 70.3 73.2 62.4
RA w/ R50 70.8 73.2 74.8 67.8 70.4 73.1 75.8 66.2 67.8 70.2 73.1 62.8
RA w/ R101 70.7 73.2 75.3 67.0 70.5 73.5 75.7 66.9 68.1 70.2 72.8 63.2
RA w/ D121 71.2 73.6 75.3 67.2 70.5 73.2 75.7 65.7 68.1 70.5 73.1 62.6
RA w/ V16 70.4 72.4 74.6 67.5 70.6 73.0 75.7 67.7 67.8 70.3 73.2 63.7
Table 5: Transfer from ImageNet classification to PASCAL VOC object detection (mAP). Processing model trained with classification model (row) can improve the performance on detection model (column).

In Table 5, note that rows indicate classification models used as loss and columns indicate detection models, so even if they are of the same name (e.g., “R18”), they are still different models, and are trained on different datasets for different tasks. We are transferring between architectures, categories as well as tasks in this experiment. There is even a domain shift since the model is trained with ImageNet training set but fed with PASCAL VOC input images during evaluation. Here “Plain Processing” models are trained on the ImageNet instead of PASCAL VOC dataset, thus the results are different from those in Table 0(b). We observe that except two cases on the “V16” column in denoising, using classification loss on model (row) can boost the detection accuracy on model notably upon plain processing. This improvement is even comparable with directly training using the detection loss, as in Table 0(b). Such task transferability suggests the “machine semantics” of the image could even be a task-agnostic property, and makes our method even more broadly applicable.

4.5 Image Processing Quality Comparison

We have analyzed the recognition accuracy of the output images, now we compare the output image quality using conventional metrics PSNR and SSIM. When using RA with transformer, the output quality of is guaranteed unaffected, therefore here we evaluate RA processing. We use ResNet-18 on ImageNet as , and report results with different s (Eqn. 3) in Table 6.

Super-resolution Denoising JPE-deblocking
0 26.29/0.795/52.6 31.24/0.895/61.9 27.50/0.825/48.2
26.33/0.803/59.2 31.18/0.894/64.4 27.50/0.823/56.0
26.31/0.792/61.8 30.78/0.884/65.1 27.17/0.810/57.7
25.47/0.760/61.3 29.71/0.855/64.3 26.32/0.776/56.6
Table 6: PSNR/SSIM/Accuracy when using different s, on ImageNet dataset.

corresponds to plain processing. When , in super-resolution, the PSNR/SSIM metrics are even slightly higher, and in denoising and JPEG-deblocking they are only marginally worse. However, the accuracy obtained is significantly higher. This suggests that the added recognition loss is not harmful when is chosen properly. When is excessively large (), the image quality is hurt more, and interestingly even the recognition accuracy start to decrease. A proper balance between image processing loss and recognition loss is needed for both image quality and performance on downstream recognition tasks.

Target Image Input Image Plain Processing RA RA RA
Label: bear Low-resolution 19.24/0.603/lion 19.25/0.602/bear 19.20/0.600/bear 19.03/0.585/bear
Label: finch Noisy 30.45/0.909/kite 30.45/0.908/finch 30.18/0.899/finch 29.41/0.871/finch
Label: crab JPEG-compressed 27.26/0.859/goldfish 27.18/0.857/crab 26.87/0.845/crab 26.19/0.823/crab
Figure 3: Examples where outputs of RA processing models can be correctly classified but those from plain processing models cannot. PSNR/SSIM/class prediction is shown below each output image. Slight differences between images from plain processing and RA processing models could be noticed when zoomed in.

In Fig. 3, we visualize some examples where the output image is incorrectly classified with a plain image processing model, and correctly recognized with RA processing. With smaller ( and ), the image is nearly the same as the plainly processed images. When is too large (), we could see some extra textures when zooming in. For more results please refer to Appendix C.

5 Analysis

In this section we analyze some alternatives to our approaches. All experiments in this section are conducted using RA processing on super-resolution, with ResNet-18 trained on ImageNet as the recognition model, and = 10 if used.

Training without the Image Processing Loss. It is possible to train the processing model on the recognition loss , without even keeping the original image processing loss (Eqn. 3). This may presumably lead to better recognition performance since the model can now “focus on” optimizing the recognition loss. However, we found removing the original image processing loss hurts the recognition performance: the accuracy drops from 61.8% to 60.9%; even worse, the SSIM/PSNR metrics drop from 26.33/0.792 to 16.92/0.263, which is reasonable since the image processing loss is not optimized during training. This suggests the original image processing loss is helpful for the recognition accuracy, since it helps the corrupted image to restore to its original form.

Fine-tuning the Recognition Model. Instead of fixing the recognition model , we could fine-tune it together with the training of image processing model , to optimize the recognition loss. Many prior works (sharma2018classification; bai2018finding; sicnn) do train/fine-tune the recognition model jointly with the image processing model. We use SGD with momentum as ’s optimizer, and the final accuracy reaches 63.0%. However, since we do not fix , it becomes a model that specifically recognizes super-resolved images, and we found its performance on original target images drops from 69.8% to 60.5%. Moreover, when transferring the trained on ResNet-56, the accuracy is 62.4 %, worse than 66.7% when we train with a fixed ResNet-18. We lose some transferability if we do not fix the recognition model .

Training Recognition Models from Scratch. Other than fine-tuning a pretrained recognition model , we could first train a super-resolution model, and then train from scratch on the output images. We achieve 66.1% accuracy on the output images in the validation set, higher than 61.8% in RA processing. However, the accuracy on original clean images drops from 69.8% to 66.1%. Alternatively, we could even train from scratch on the interpolated low-resolution images, in which case we achieve 66.0% on interpolated validation data but only 50.2% on the original validation data. In summary, training or fine-tuning to cater the need of super-resolved or interpolated images can harm its performance on the original clean images, and causes additional overhead in storing models. In contrast, using our RA processing technique could boost the accuracy of output images with the performance on original images intact.

6 Conclusion

We investigated the problem of enhancing the machine interpretability of image processing outputs. We find our simple approach – optimizing with the additional recognition loss during training can significantly boost the recognition accuracy with minimal or no loss in image processing quality. Moreover, such improvement can transfer to recognition architectures, object categories, and vision tasks unseen during training, indicating the enhanced interpretability is not specific to one particular model but generalizable to others. This makes the proposed approach feasible even when the future downstream recognition models are unknown.

References

Appendix

Appendix A Experimental Details

General Setup. We evaluate our proposed methods on three image processing tasks: image super-resolution, denoising, and JPEG-deblocking. In those tasks, the target images are all the original images from the datasets. To obtain the input images, for super-resolution, we use a downsampling scale factor of 4

; for denoising, we add Gaussian noise on the images with a standard deviation of 0.1 to obtain the noisy images; for JPEG deblocking, a quality factor of 10 is used to compress the image to JPEG format. The image processing loss used is the mean squared error (MSE, or

) loss. For the recognition tasks, we consider image classification and object detection, two common tasks in computer vision. In total, we have 6 (3

2) task pairs to evaluate.

We adopt the SRResNet (srgan) as the architecture of the image processing model , which is simple yet effective in optimizing the MSE loss. Even though SRResNet is originally designed for super-resolution, we find it also performs well on denoising and JPEG deblocking when its upscale parameter set to 1 for the same input-output sizes. Throughout the experiments, on both the image processing network and the transformer, we use the Adam optimizer (adam) with an initial learning rate of , following the original SRResNet (srgan)

. Our implementation is in PyTorch

(pytorch).

Image Classification. For image classification, we evaluate our method on the large-scale ImageNet benchmark (imagenet). We use five pre-trained image classification models, ResNet-18/50/101 (resnet), DenseNet-121 (densenet) and VGG-16 (vgg) with BN (bn) (denoted as R18/50/101, D121, V16 in Table 0(a)

), on which the top-1 accuracy (%) of the original validation images is 69.8, 76.2, 77.4, 74.7, and 73.4 respectively. We train the processing models for 6 epochs on the training set, with a learning rate decay of 10

at epoch 5 and 6, and a batch size of 20. In evaluation, we feed unprocessed validation images to the image processing model, and report the accuracy of the output images evaluated on the pre-trained classification networks. For unsupervised RA, we use distance as the function in Eqn. 4. The hyperparameter is chosen using super-resolution with the ResNet-18 recognition model, on two small subsets for training/validation from the original large training set. The chosen for RA processing, RA with transformer, and unsupervised RA is 10, 10 and 10 respectively.

Object Detection. For object detection, we evaluate on PASCAL VOC 2007 and 2012 dataset, using Faster-RCNN (ren2015faster) as the recognition model. Our implementation is based on the code from (jjfaster2rcnn). Following common practice (yolo; ren2015faster; dai2016r), we use VOC 07 and 12 trainval data as the training set, and evaluate on VOC 07 test data. The Faster-RCNN training uses the same hyperparameters in (jjfaster2rcnn). For the recognition model’s backbone architecture, we evaluate ResNet-18/50/101 and VGG-16 (without BN (bn)), obtaining mAP of 74.2, 76.8, 77.9, 72.2 on the test set respectively. Given those trained detectors as recognition loss functions, we train the models on the training set for 7 epochs, with a learning rate decay of 10 at epoch 6 and 7, and a batch size of 1. We report the mean Average Precision (mAP) of processed images in the test set. As in image classification, we use for RA processing, and for RA with transformer.

Appendix B More Results on Transferability

We present some additional results on transferability here.

b.1 Transferring between Architectures

Task Super-resolution Denoising JPEG-deblocking
Evaluation on R18 R50 R101 V16 R18 R50 R101 V16 R18 R50 R101 V16
Plain Processing 69.2 70.7 73.3 64.2 68.9 72.0 74.7 65.8 63.7 66.5 70.4 60.3
RA w/ R18 71.2 73.8 75.2 66.9 70.9 74.0 75.5 67.2 67.4 70.0 72.3 63.5
RA w/ R50 70.6 74.4 75.4 66.4 70.6 73.7 75.5 67.2 67.0 70.4 72.4 63.2
RA w/ R101 71.1 73.8 75.6 65.8 70.3 73.6 75.6 66.2 65.9 69.3 72.9 61.3
RA w/ V16 70.4 72.8 74.9 68.1 69.9 73.4 75.6 67.6 66.1 69.3 72.1 63.9
Table 7: Transfer between recognition architectures, evaluated on PASCAL VOC object detection (mAP).

We provide the model transferability results of RA processing on object detection in Table 7. Rows indicate the models trained as recognition loss and columns indicate the evaluation models. We see similar trend as in classification (Table 0(a)): using other architectures as loss can also improve recognition performance over plain processing; the loss model that achieves the highest performance is mostly the model itself, as can be seen from the fact that most boldface numbers are on the diagonals.

Task Super-resolution Denoising JPEG-deblocking
Evaluation on R18 R50 R101 D121 V16 R18 R50 R101 D121 V16 R18 R50 R101 D121 V16
Plain Processing 52.6 58.8 61.9 57.7 50.2 61.9 68.0 69.1 66.4 60.9 48.2 53.8 56.0 52.9 42.4
Unsup. RA w/ R18 61.3 66.3 68.6 64.5 57.3 61.7 67.9 69.7 66.4 60.5 53.8 59.1 62.0 57.5 50.0
Unsup. RA w/ R50 58.9 66.9 68.6 64.1 58.2 61.2 68.6 70.3 66.6 61.3 52.8 60.4 62.5 58.3 50.3
Unsup. RA w/ R101 57.8 64.9 69.0 62.9 56.9 60.6 68.0 70.7 66.3 60.7 52.3 58.7 63.4 57.9 49.0
Unsup. RA w/ D121 58.0 64.7 67.2 65.3 56.0 60.7 67.8 69.7 67.1 60.3 52.2 59.2 62.2 59.7 49.9
Unsup. RA w/ V16 57.7 64.6 67.3 63.2 61.0 60.4 67.1 69.6 65.9 63.6 52.0 58.4 61.5 57.4 53.1
Table 8: Transfer between recognition architectures using unsupervised RA, on ImageNet classification.

As a complement in Section 4.2, we present the results when transferring between recognition architectures, using unsupervised RA, in Table 8. We note that for super-resolution and JPEG-deblocking, similar trend holds as in (supervised) RA processing, as using any architecture in training will improve over plain processing. But for denoising, this is not always the case. Some models trained with unsupervised RA are slightly worse than the plain processing counterpart. A possible reason for this is the noise level in our experiments is not large enough and plain processing achieve very high accuracy already.

b.2 Transferring between Recognition Tasks

In Section 4.4, we investigated the transferability of improvement from classification to detection. Here we evaluate the opposite direction, from detection to classification. The results are shown in Table 9. Here, using RA processing can still consistently improve over plain processing for any pair of models, but we note that the improvement is not as significant as directly training using classification models as loss (Table 0(a) and Table 2).

Task Super-resolution Denoising JPEG-deblocking
Evaluation on R18 R50 R101 D121 V16 R18 R50 R101 D121 V16 R18 R50 R101 D121 V16
Plain Processing 53.0 58.9 62.0 57.3 50.9 59.7 65.1 67.3 63.9 59.2 48.8 54.6 56.8 53.1 44.7
RA w/ R18 54.6 60.2 63.4 58.8 52.7 60.8 66.7 68.8 65.2 61.1 50.8 57.2 59.6 55.4 48.5
RA w/ R50 54.0 59.7 63.0 58.7 52.0 60.5 66.6 68.5 64.9 60.8 50.7 56.9 59.2 55.3 48.3
RA w/ R101 54.1 59.8 63.3 58.7 52.5 60.2 66.1 68.3 64.6 60.6 51.3 57.2 59.5 55.5 48.3
RA w/ V16 54.5 60.4 63.6 59.1 52.7 60.4 66.6 68.4 64.7 60.6 50.6 56.5 58.7 54.9 47.9
Table 9: Transfer from PASCAL VOC object detection to ImageNet classification (accuracy %). A image processing model trained with detection model (row) as recognition loss can improve the performance on classification model (column) over plain processing.

Additionally, the results when we transfer the model trained with unsupervised RA with image classification to object detection are shown in Table 10. In most cases, it improves over plain processing, but for image denoising, this is not always the case. Similar to results in Table 8, this could be because the noise level is relatively low in our experiments.

Super-resolution Denoising JPEG-deblocking
Evaluation on R18 R50 R101 V16 R18 R50 R101 V16 R18 R50 R101 V16
Plain Processing 68.5 69.7 73.1 63.2 68.1 71.6 74.1 65.7 62.4 65.6 69.5 58.3
Unsup. RA w/ R18 71.3 73.4 75.3 66.8 69.0 71.3 74.3 61.1 65.2 68.1 71.3 59.8
Unsup. RA w/ R50 70.7 73.3 75.0 66.6 68.9 71.7 74.4 63.1 65.4 68.5 71.2 60.0
Unsup. RA w/ R101 70.7 73.2 75.0 66.2 68.9 71.3 73.9 63.3 65.2 67.9 71.1 59.6
Unsup. RA w/ D121 71.0 73.2 75.1 66.6 68.7 70.3 73.0 63.8 65.9 68.6 71.4 61.1
Unsup. RA w/ V16 70.3 72.3 74.3 67.0 68.5 70.7 74.0 63.6 65.9 68.2 71.5 61.1
Table 10: Transfer from ImageNet classification to PASCAL VOC object detection, using unsupervised RA.

Appendix C More Visualizations

We provide more visualizations in Fig. 4 where the output image is incorrectly classified by ResNet-18 with a plain image processing model, and correctly recognized with RA processing, as in Fig. 3 at Section 4.5.

Target Image Input Image Plain Processing RA RA RA
Label: beer bottle Low-resolution 21.06/0.725/shoe shop 21.16/0.731/beer bottle 21.05/0.727/beer bottle 20.46/0.687/beer bottle
Label: dam Low-resolution 29.71/0.780/cliff 29.76/0.783/dam 29.60/0.778/dam 28.92/0.755/dam
Label: tiger shark Low-resolution 36.58/0.915/hammerhead 36.17/0.917/tiger shark 36.00/0.911/tiger shark 33.59/0.834/tiger shark
Label: pill bottle Noisy 33.69/0.935/lotion 33.56/0.932/pill bottle 33.09/0.920/pill bottle 32.14/0.904/pill bottle
Label: tabby cat Noisy 30.77/0.830/plastic bag 30.74/0.830/tabby cat 30.51/0.825/tabby cat 29.93/0.811/tabby cat
 
Label: tricycle Noisy 30.50/0.918/barber chair 30.46/0.917/tricycle 30.05/0.911/tricycle 29.06/0.895/tricycle
Label: mushroom JPEG-compressed 25.78/0.746/folding chair 25.78/0.747/mushroom 21.55/0.730/mushroom 24.96/0.696/mushroom
Label: pier JPEG-compressed 27.41/0.818/mobile home 27.41/0.816/pier 27.04/0.803/pier 26.18/0.772/pier
Figure 4: Examples where output images from RA processing models can be correctly classified but those from plain processing models cannot. PSNR/SSIM/class prediction is shown below each output image. Slight differences between images from plain processing and RA processing models (especially with large s) could be noticed when zoomed in.