Joint Learning of Multiple Image Restoration Tasks

07/10/2019 ∙ by Xing Liu, et al. ∙ Tohoku University 4

Convolutional neural networks have recently been successfully applied to the problems of restoring clean images from their degraded versions. Most studies have designed and trained a dedicated network for each of many image restoration tasks, such as motion blur removal, rain-streak removal, haze removal, etc. In this paper, we show that a single network having a single input and multiple output branches can solve multiple image restoration tasks. This is made possible by improving the attention mechanism and an internal structure of the basic blocks used in the dual residual networks, which was recently proposed and shown to work well for a number of image restoration tasks by Liu et al. Experimental results show that the proposed approach achieves a new state-of-the-art performance on haze removal (both in PSNR/SSIM) and JPEG artifact removal (in SSIM). To the authors' knowledge, this is the first report of successful multi-task learning on diverse image restoration tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 5

page 6

page 7

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of image restoration, i.e., restoring an original, clean image from its degraded version, has been studied for a long time in computer vision and image processing. As with other problems of computer vision, deep learning has been applied to this problem, which has led to significant improvement of performance. There are many factors causing image degradation/distortion, for each of which there are a (large) number of studies in the past, such as motion/defocus blur

[56, 50, 72, 71, 12], several types of noises (e.g., Gaussian, real-world noise, etc.) [13, 9, 70, 69], JPEG compression noise [76, 8, 5, 24], rain streak [41, 30, 11], raindrop [35, 78, 31], haze [23, 46, 3] etc.

Previous studies have treated each of these degradation types individually and developed “dedicated” methods for each factor. This is also the case with recent studies [34, 51, 59, 40, 77, 80] utilizing deep learning; there are different networks for different degradation types. It should be noted that a few recent studies attempt to deal with combined degradation [58, 79], having proposed single networks that can deal with images having mixed degradation types with unknown mixing ratio. However, their restoration accuracy (for images with a single degradation type) tends to be much lower than the dedicated methods; moreover, they can only deal with small image patches (i.e., pixels) for now. Thus, it is fair to say that they are still in an experimental stage.

Figure 1: Left: Approach employed in recent studies, i.e., designing/training a different network for each image restoration task dealing with a single degradation factor. Right: Our approach; a single network having a single input and multiple output branches is trained on multiple image restoration tasks.

In this paper, we consider another approach and choose to use a single network to deal with multiple degradation factors. Figure 1 shows the standard approach and our approach; we use a network having a single input and multiple output branches, each of which is dedicated to a different image restoration task. Although this is also a standard formulation of multi-task learning (MTL), it remains unclear so far if MTL works for image restoration tasks. Moreover, our motivation is not merely to improve the current state-of-the-art by employing MTL. This study is motivated more by a desire to know what is (and should be) learned by the CNNs on image restoration tasks and what is their optimal design for the tasks.

In earlier studies of image restoration, it was of a primary interest to model and represent the statistics or prior of natural images, which is then utilized to restore the original image from an input by, for instance, using it in the framework of the Bayesian inference

[15, 47, 54]. On the other hand, in the recent approaches that use CNNs in an end-to-end fashion, there is no explicit modeling of image prior. We conjecture that in order to restore original images accurately, a network must learn natural image prior in some form for any degradation factors. If this is true, optimal networks for individual degradation factors may share representation of natural image prior. Our method is built upon this conjecture.

To enable to deal with multiple degradation factors with a single network, we propose a new design of networks build upon the dual residual networks, which was recently proposed by Liu [44]. Although the authors have shown that the base network architecture is effective for various degradation factors, the internal components of their networks need to be changed for different degradation factors.

Our improvements to the method of Liu [44]

are two folds. One is an improved attention mechanism. They employed the squeeze-and-excitation (SE) mechanism in some of their proposed component blocks, named DuRB-S and DuRB-US. Since it was originally proposed for object classification task, the SE mechanism has been successfully applied to various tasks such as super-resolution (SR)

[85]

, single-view depth estimation

[26], etc. Its concept is to use global average pooling of activations in each individual channel of a layer to generate channel-wise attention weights on its layer activations. We extend it to additionally use amplitude of spatial derivatives of activation (i.e., and ) to compute attention weights. The other extension is a new design of basic block, in which the two operations employed in DuRB-U and -US [44] are fused; we will refer to it as DuRB-M. We show through experiments that these two extensions make it possible for a single network to learn multiple image restoration tasks, updating the current state-of-the-art for some of them.

2 Related Work

Image Restoration

Image restoration has been studied for a long time. Most of the early studies incorporate models of degraded images along with priors of clean natural images, based on which they formulate and solve an optimization problem. Examples are the studies on motion blur removal [15, 72, 71, 2] and those on haze removal [23, 4, 46]. Recently, CNN-based methods have achieved good performance for all sorts of image restoration tasks [60, 19, 48, 34, 66, 80, 75, 52, 81, 40, 38, 86, 65, 14]. For motion blur removal, Nah [48] proposed a network architecture having modified residual blocks and trained it on a large scale dataset (GoPro Data). Kupyn [34] proposed a GAN [21] -based method, updating the former state-of-the-art. For haze removal, Zhang [80] proposed a GAN-based CNN that jointly estimates multiple unknowns comprising a haze model. Ren [53] proposed a method of weighting several enhanced versions of an input image with the weights predicted by a CNN. For rain-streak removal, Li proposed a RNN-based method [40] and Li proposed non-locally enhanced dense block (NEDB) [38]. For JPEG-compression noise removal, Galteri [17] proposed a GAN-based method and Zhang [86]

proposed a non-local attention mechanism. Finally, Liu recently proposed a network architecture having dual residual connections, updating state-of-the-art performance on most of the above tasks.

Single Net for Multiple Degradation Factors

A more challenging formulation is to restore a clean image from a degraded input that undergoes an unknown combination of multiple degradation factors [79, 58]. However, existing methods for this formulation seem premature; there is currently a large gap with the state-of-the-art methods designed for a single degradation factor in performance of restoration from images with that single factor. Other studies consider two degradation factors that are similar (e.g., deblurring and super-resolution) or closely related with each other (e.g., denoising vs. deblur/decompression) [73, 84, 20, 48] to improve performance of the tasks. Our study differs from these studies in motivation as described in Sec. 1, resulting in differences in the number (up to four) and diversity of degradation factors that are simultaneously considered. There are also studies that train and use different networks with an identical architectural design on different restoration tasks, e.g., noise removal, rain-streak removal, and super-resolution [33]; or noise, mosaic, JPEG-compression noise removal, and super-resolution [86]. Our study differs from them in that we train the same network on multiple different restoration tasks in a multi-task learning framework.

Multi-task Learning

It is well known that multi-task learning [7] of deep networks is effective for many computer vision tasks; [7, 45, 32, 74, 22, 42]

to name a few. To enable MTL to work, there should arguably be some relation among the tasks jointly learned; in other words, there should be overlaps among the representations to be learned for those tasks. Combinations of tasks in the successful MTL examples include scene recognition and object recognition

[22]; depth estimation and scene parsing [68]; facial expression recognition and landmark detection [25]; vision and language [49] . However, it remains unknown if the same holds true for image restoration tasks. Although there should be some similarity among them, different types of degradations seem to be somewhat orthogonal with each other. In fact, the aforementioned studies on restoration from combined degradations attempt to deal with different degradation factors by adaptively selecting different networks [79] or different operations [58] depending on the degradation factors in input images.

Squeeze-and-Excitation Mechanism for Attention

Attention mechanisms have been developed and employed to solve various computer vision problems [64, 39, 1, 28, 58]

. Hu proposed a squeeze-and-excitation (SE) block, which produces and applies channel-wise weights on an input tensor

[28]. This block has been successfully applied to various tasks such as classification [28], super resolution [85] and single-view depth estimation [26]. A number of later studies [43, 29, 67, 27, 18] aim at improving the SE-block. Woo propose to use channel-wise and spatial attention weights [67]. Hu study how to efficiently combine a SE-block and a ResNet module [29]. Gao propose to use correlation of activations of each pair of channels to generate attention [18]. Hu improve SE-block by replacing global average pooling with a pooling operation with trainable parameters [27].

3 Problem Formulation

Figure 2: , and denotes a CNN trained for motion blur removal, haze removal and rain-streak removal. The images in the second column are normalized for better visibility.

Traditionally, the problem of image restoration is formulated as an inverse problem, where the forward process of image formation is modeled and utilized. For example, an image suffering from non-uniform motion blur is modeled [34] as

(1)

where is the blurred image; is the sharp image; is the blur kernel determined by motion field ; and is noise. The image of a scene with haze degradation is modeled [80] as

(2)

where is the hazy image of a scene; is its clean image; is a transmission map; and is global atmospheric light. An image having rain-streak degradation is modeled [40] as

(3)

where , and ’s denotes an image with rain-streak, the clean image, and different rain-streak effects. The case of can consider an accumulation of multiple effects.

Using the models, the problem is formulated as minimization of an objective function measuring the difference between the input degraded image and its model given as above. It is minimized with respect to the unknown clean image along with some other unknowns (, blur kernels .). It is also a common practice to incorporate natural image prior as a regularizer in the the minimization. Such image prior is shared among image restoration for different types of degradations.

On the other hand, the recent trend is to use CNNs to directly predict the clean image from its degraded version. This approach does not use any image prior explicitly. A successful method of applying a CNN to the tasks [34, 16, 44] is to make it predict a compensating image such that adding it to the input image will yield a clean image, i.e., the difference from the input image to the true clean image. This method can be applied to any image degradation type. Figure 2 illustrates a few examples. This is formally written by

(4)

where is the clean image we want to predict; is a CNN designed and trained for degradation type ; and is an input image having degradation type .

Our goal is to develop a universal network that can deal with various degradation types. Although the ultimate goal will be a monolithic CNN that has a single input and output, it is very hard to design and train such a CNN that can achieve high performance. Taking the above formulation of predicting the compensating component into account, we instead consider networks having multi-heads for different degradation types, as shown in Fig. 1:

(5)

where is a head, which we call a decoder, for degradation type and is an encoder shared by all the degradation types.

4 Universal Networks for Image Restoration

In this section, we describe the design of networks that are applicable to different types of degradation. We make two improvements to the dual residual networks proposed by Liu [44], intending to enhance its representation capacity. One is an improvement of the attention mechanism and the other is a new design of the structure of base blocks. We explain these two improvements in turn, followed by description of the overall network architecture. For decoder , we use a stack of PixelShuffle [57] modules with convolutional operations.

4.1 Improved Attention Mechanism

Figure 3: Absolute spatial derivatives of images in (a) vertical and (b) horizontal directions. (The values in the three color channels are summed together.) The values in the bottom are computed with the application of (6) on all the three color channels.

An attention mechanism is employed in the dual residual networks. It is the channel-wise attention that was originally developed for object recognition in the study of squeeze-and-excitation (SE) networks [28]

, and has been widely used for many other tasks. A SE-block computes and applies attention weights on the channels of the input feature map. To determine the weight on each channel, it computes the averages of activation values of channels; then, they are converted by two fully-connected layers with ReLU and sigmoid activaton functions to generate channel-wise weights. The aggregation of activation values is equivalent to global average pooling.

We enhance this attention mechanism by incorporating a different aggregation method of channel activation. Our idea is to use different statistics of channel activation values in addition to their averages. For this, we choose to use (absolute) spatial derivatives of channel activation values. More specifically, denoting an activation value at spatial position of channel by , we calculate

(6)

where and denote the number of values in the corresponding spatial derivative maps, (=3 for our experiments) is a scalar to enhance derivative values. This is also known as a version of the total variation [55], which has been used as a regularization term for various image processing tasks; a notable example is the classical image denoising, where the total variation helps to obtain a smoother solution while preserving edges.

Figure 3 shows how the absolute spatial derivatives behaves for different inputs using input images (instead of intermediate layer features) as examples. It is observed that they provide different responses between clean and degraded images of the same scenes.

Figure 4 illustrates the proposed attention mechanism. We compute the global average and the total variation of activation values of each channel and input it to the same pipeline as the SE-block to generate attention weights over the channels. This mechanism is built into a ResNet module, as shown in Fig. 5. We will show the effectiveness of this design though experiments including ablation tests.

Figure 4: The proposed attention mechanism improving the SE block. It generates channel-wise attention weights by global average pooling (the same as the standard SE block) and total variation (TV) of each channel activation values.
Figure 5: The improved SE-ResNet Module, which incorporates the improved attention mechanism into a ResNet module.

4.2 Improved Design of a Dual Residual Block

Figure 6: The proposed basic block (DuRB-M) used for building our network.

The design of the dual residual networks [44] aims at making maximum use of paired operations that are believed to fit for image restoration tasks. The choice of the paired operations is arbitrary and four choices are suggested depending on the type of degradations. We pay attention on the two of them, in both of which the first operation is up-convolution. Specifically, one is the pair of up-convolution (i.e., up-sampling followed by convolution) and simple convolution. The block employing the pair is named DuRB-U and applied to motion blur removal. (See the upper panel of Fig. 6.) The other is the pair of up-sampling followed by convolution and a SE-block. The block is named DuRB-US and applied to haze removal.

In this paper, we aim at development of universal networks that can deal with motion blur, haze, and even more. Toward this end, we propose a new design of the block structure, which we call DuRB-M. The idea is to integrate the above two designs (i.e., DuRB-U and -US). To be specific, while keeping the same up-convolution for the first operation, we employ parallel computation of the second operations of DuRB-U and -US, i.e., convolution and a SE-block, for the second operation of the new block design; see the lower panel of Fig. 6. The output maps of the two operations are merged by concatenation in the channel dimension, followed by convolution to adjust the number of channels. We also replace the ResNet module in the original DuRB structure with the aforementioned improved SE-ResNet module with the enhanced attention mechanism, as shown in Fig. 5.

Figure 7: Architecture of the encoder and the task-specific decoder for the task ; , and denotes a convolutional layer, a ReLU layer and a hyperbolic tangent function, respectively; means conv. PixelShuffle [57]. The encoder has 68 weight layers (each DuRB-M has 13 weight layers) and a decoder has 5 weight layers.
motion blur removal haze removal rain-streak removal JPEG artifacts removal (q=10)
Zhang [82] 29.19 / 0.93 Li [36] 19.06 / 0.85 Fu [16] 30.92 / 0.89 Dong [14] 28.98 / 0.82
Nah [48] 29.23 / 0.92 Cai [6] 21.14 / 0.85 Li [40] 32.48 / 0.91 Chen [10] 29.15 / 0.81
Liu [44] 29.90 / 0.91 Ren [53] 22.30 / 0.88 Li [38] 33.16 / 0.92 Zhang [83] 29.19 / 0.81
Tao [61] 30.26 / 0.93 Liu [44] 32.12 / 0.98 Liu [44] 33.21 / 0.93 Zhang [86] 29.63 / 0.82
Ours(3) 30.18 / 0.92 33.90 / 0.98 32.76 / 0.92 -/-
Ours(4) 30.17 / 0.92 34.16 / 0.98 32.87 / 0.92 28.20 / 0.83
Table 1: Comparison of state-of-the-art methods in terms of accuracy (PSNR/SSIM) for four different tasks. Ours(3) and Ours(4) are the proposed network trained on the three and four tasks, respectively. The best one is in bold and the second is with underline. The value with the superscript is considered to be an error.
Figure 8: Experimental designs of an overall network consisting of the encoder and task-specific decorders , , and ; is a DuRB-M block.

4.3 Overall Design of the Universal Network

As mentioned earlier, our network consists of a shared encoder and multiple decoders. Figure 7 shows the overall design. To train it on tasks, we use decorders . Each decoder () starts with two sets of up-sampling plus convolution (implemented by PixelShuffle [57]

) and ReLU in this order, followed by convolution with a hyperbolic tangent activation function. All the convolution layers of the decoder employ

kernels. The number of their channels are 96 for the first two conv. layers and 48 for the last one. We use the same design for all the decoders for different tasks. As they have learnable weights in the convolution layers, they work differently after training. The encoder

starts with three convolution layers with ReLU activation, followed by a stack of the proposed DuRB-M’s. The 2nd and 3rd convolution layers use stride

, and thus the input image is down-scaled to 1/4 of the original size when inputted to the first DuRB-M. Note that there is a skip connection from the output from the second ReLU to the first DuRB-M. Other details of the encoder are given in the supplementary material.

5 Experimental Results

We conduct experiments to evaluate the proposed method. We choose three tasks, i.e., motion blur removal, haze removal, and rain-streak removal, for the main experiments (i.e., detailed architectural design search, ablation study, etc.); we additionally use JPEG compression noise removal for performance evaluation.

5.1 Experimental Configuration

5.1.1 Datasets

In our experiments, we choose the dataset(s) for each task that is the most widely used in recent studies. We use the GoPro dataset [48] for motion blur removal. It consists of 2,013 and 1,111 non-overlapped training (GoPro-train) and test (GoPro-test) pairs of blurred and sharp images, respectively. We use the RESIDE dataset [37] for haze removal, which consists of 13,990 samples of indoor scenes and a few test subsets. Following [44] and [53], we use a subset SOTS (Synthetic Objective Testing Set) that contains 500 indoor scene samples for evaluation. For both training and test subsets of RESIDE, synthetic hazy effects are made using (2). We use the DID-MDN dataset [81] for rain-streak removal. It consists of 12,000 training pairs and 1,200 test pairs of clear image and synthetic rainy image. Rain-streak effects for an image in this dataset are made using Photoshop. As for JPEG compression noise removal, we use the training subset (800 images) of the DIV2K dataset [62] and LIVE1 dataset (29 images) for training and testing our networks, following the study of the state-of-the-art method [86]. An additional setting we made for our experiment is that we re-sized the original DIV2K images (larger than for most of them) to their half for training out network for computational efficiency.

5.1.2 Training on Multiple Tasks

We jointly train our network on multiple tasks in the following way. We split the training into a series of cycles, in each of which the network is trained on a combination of all the tasks. To be specific, each cycle contains one or more randomly chosen minibatches per a single task. Considering that the loss decreases at a different speed for different tasks, we choose the number of minibatches in one cycle, specifically, one for haze removal, one for rain-streak removal, and three for motion blur removal. The minibatches are randomly chosen from the training split of each dataset and packed in a random order in a row for the cycle. We then iterate this cycle until convergence. Each input image in a batch is obtained by randomly cropping a 256256 region from an original training image or its re-sized version (for GoPro dataset and DIV2K). Details of training are given in the supplementary material.

5.2 Extended Design of the Entire Network

We have found in our preliminary experiments that while the architecture of a shared encoder followed by multiple task-specific decorders (illustrated in Fig. 8 (a)) shows strong performance, more general architectures achieves even better performance. By general architectures, we mean those having extended DuRB-M blocks on top of the shared encoder, to which some of the task-specific decorders are connected, as shown in Fig. 8 (b)-(d). To explore what structure shows good performance, we consider the four architectural designs shown in Fig. 8.

When we have three target tasks, there are thirteen designs of assigning them to the four architectures, which are listed in Table 2. In the table, we use B, H and R to denote motion blur removal, haze removal, and rain-streak removal, respectively, and use “” to denote one DuRB-M block. In the table we report the performance of each of the thirteen designs in terms of accuracy (PSNR/SSIM) averaged over the three tasks. In this experiment, we trained each of our thirteen networks for iterations under the same experimental setting. It is seen that RB

H performs the best. This indicates differences among the three tasks cannot be fully absorbed by the single encoder, and implies that there is a hierarchy that is probably associated with their difference in difficulty in this order.

Alignment PSNR / SSIM
BHR 30.61 / 0.9244
BRH 30.42 / 0.9227
HBR 30.57 / 0.9244
HRB 30.34 / 0.9219
RBH 30.79 / 0.9246
RHB 30.29 / 0.9201
Alignment PSNR / SSIM
BHR 30.24 / 0.9217
BHR 30.36 / 0.9216
HRB 30.02 / 0.9189
HRB 30.41 / 0.9236
RBH 30.47 / 0.9219
RBH 30.38 / 0.9228
RBH 30.32 / 0.9206
Table 2: Comparison of performance of thirteen different designs of the network for three tasks (B: motion-blur removal, H: haze removal, and R: rain-streak removal). The values (PSRN/SSIM are averaged accuracy over the three tasks.

5.3 Comparison with the State-of-the-art

We compare the proposed method with the state-of-the-art methods for different tasks. Table 1 shows the results. We choose the best four published methods (ranked by PSNR) for each task. “Ours(3)” indicate our method trained on the three tasks. We report here the accuracy values obtained for the best architecture found in the experiment explained above. It is observed that the proposed method outperforms others for haze removal and achieves comparable performance to the previous methods for other tasks.

Table 1 also shows the results (“Ours(4)”) obtained by simultaneously training our network on four tasks, i.e., the three tasks plus JPEG compression noise removal. It is seen that the addition of this task contributes to further improvements on haze removal and rain-streak removal. In the experiment, we search for a good design for the four tasks; to do this with a modest computational cost, we considered only combinations of inserting either a new decorder or a new decoder with a DuRB-M to the above three-task network. The best performer is the one with additional DuRB-M inserted in between B and H, i.e., RBJH, where is the JPEG compression noise removal. The results with “Ours(4)” in Table .1 is obtained by this design. A few examples of the output images for the four tasks are shown in Fig. 10.

Figure 9: Visualization of activations of selected intermediate layers of our network trained on the three tasks. Each feature space is mapped to two-dimensional space by t-SNE [63]. The results of lower to higher layers are shown from left to right. (a) Output of the first ReLU layer. (b) The second ReLU layer. (c)-(e) Output of the first, third, and fifth DuRB-M blocks.

5.4 Ablation Study

TV GAP Fusion motion blur removal haze removal rain-streak removal
28.25 / 0.8724 26.58 / 0.9646 31.32 / 0.8976
28.49 / 0.8809 29.15 / 0.9699 31.99 / 0.9003
28.11 / 0.8703 29.66 / 0.9721 32.01 / 0.9015
28.51 / 0.8811 29.06 / 0.9719 32.03 / 0.9014
Ours 28.91 / 0.8911 31.32 / 0.9778 32.15 / 0.9048
Single task  (same net) motion blur 27.87 / 0.8653 -/- -/-
haze -/- 28.58 / 0.9685 -/-
rain-streak -/- -/- 32.60 / 0.9139
Table 3: Results of an ablation test with different components and employment of multi-task learning.
Figure 10: Examples of qualitative results for the four image restoration tasks.

The proposed method consists of several components. To evaluate the contribution of each component, we conducted two ablation tests. It is noteworthy that we use small batch size(=3) to meet GPU memory limitation to attain computational efficiency for ablation tests, whereas we use large batch size(=32) for Table 1 to maximize performance.

5.4.1 Improved Attention Mechanism and Dual Redisual Block

In the first test, we evaluate the contributions of the three components, i) the channel-wise total variation and ii) the channel-wise average pooling, both of which are used for attention computation, and iii) the improved design of dual residual block that employ fused operations. Table 3 shows the results on three different tasks when performing multi-task learning on the same three tasks. It is first seen that the use of all the three components yields the maximum accuracy for each task. It is also observed that each component has a certain amount of positive impact on the resulting accuracy, although it differs for different degradation types.

5.4.2 Impact of Multi-Task Learning

To evaluate the effectiveness of multi-task learning, we train the proposed network (the best design for the three tasks explained in Sec. 5.2) on each of the three tasks separately. In this experiment, we simply neglect other ’s than the one for the target task. The results are shown in the bottom of Table 3. It is seen that multi-task learning improves performance on motion blur removal and haze removal by a good margin, while it decreases the performance on rain-streak removal.

5.5 Visualization of Internal Features

To explore how different types of degradation are learned and represented inside our network, we visualize the internal activations of the best three-task model (i.e., RBH of Table 2). We input each sample of the test splits from the datasets of these tasks to the trained network. We then apply t-SNE [63] to the set of activations of selected intermediate layers to map them to two-dimensional space. Figure 9 shows the results. It is observed that the images having different degradation factors are quickly disentangled as they propagate through the layers and are clearly separated at the final output of the encoder. This demonstrates that the proposed network is able to learn different image restoration tasks with a single network. It also implies that the proposed network clearly distinguishes different types of degradations and represents them differently inside its layers. A further analysis is left for future studies.

6 Summary and Conclusion

We have considered the design of a single network that is applicable to multiple image restoration tasks. Our experiments demonstrate that using the proposed network design, multi-task learning of diverse restoration tasks (motion-blur, haze, rain-streak and JPEG compression) is feasible and also effective in the sense that it brings about synergetic performance improvement. This will be the first counterexample to the current research trend, in which networks are differently designed for individual degradation factors. We hope that our results will ignite further study of better designs of neural networks for image restoration tasks.

References

  • [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In

    Proc. Conference on Computer Vision and Pattern Recognition

    , pages 6077–6086, 2018.
  • [2] S. D. Babacan, R. Molina, M. N. Do, and A. K. Katsaggelos. Bayesian blind deconvolution with general sparse image priors. In Proc. European Conference on Computer Vision, pages 341–355, 2012.
  • [3] D. Berman, T. Treibitz, and S. Avidan. Non-local image dehazing. In Proc. Conference on Computer Vision and Pattern Recognition, pages 1674–1682, 2016.
  • [4] D. Berman, T. Treibitz, and S. Avidan. Air-light estimation using haze-lines. In Proc. International Conference on Computational Photography, pages 115–123, 2017.
  • [5] K. Bredies and M. Holler. A total variation-based jpeg decompression model. SIAM Journal on Imaging Sciences, 5(1):366–393, 2012.
  • [6] B. Cai, X. Xu, K. Jia, C. Qing, and D. Tao. Dehazenet: An end-to-end system for single image haze removal. IEEE Transactions on Image Processing, 25(11):5187–5198, 2016.
  • [7] R. Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997.
  • [8] H. Chang, M. K. Ng, and T. Zeng. Reducing artifacts in jpeg decompression via a learned dictionary. IEEE Transactions on Signal Processing, 62(3):718–728, 2014.
  • [9] F. Chen, L. Zhang, and H. Yu. External patch prior guided internal clustering for image denoising. In Proc. International Conference on Computer Vision, pages 603–611, 2015.
  • [10] Y. Chen and T. Pock. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1256–1272, 2017.
  • [11] Y.-L. Chen and C.-T. Hsu. A generalized low-rank appearance model for spatio-temporally correlated rain streaks. In Proc. International Conference on Computer Vision, pages 1968–1975, 2013.
  • [12] H. Cheong, E. Chae, E. Lee, G. Jo, and J. Paik. Fast image restoration for spatially varying defocus blur of imaging sensor. Sensors, 15(1):880–898, 2015.
  • [13] K. Dabov, A. Foi, V. Katkovnik, and K. O. Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on Image Processing, 16(8):2080–2095, 2007.
  • [14] C. Dong, Y. Deng, C. Change Loy, and X. Tang. Compression artifacts reduction by a deep convolutional network. In Proc. International Conference on Computer Vision, pages 576–584, 2015.
  • [15] R. Fergus, B. Singh, A. Hertzmann, S. T. Roweis, and W. T. Freeman. Removing camera shake from a single photograph. ACM Transactions on Graphics, 25(3):787–794, 2006.
  • [16] X. Fu, J. Huang, D. Zeng, Y. Huang, X. Ding, and J. Paisley. Removing rain from single images via a deep detail network. Proc. Conference on Computer Vision and Pattern Recognition, pages 1715–1723, 2017.
  • [17] L. Galteri, L. Seidenari, M. Bertini, and A. D. Bimbo. Deep generative adversarial compression artifact removal. In Proc. International Conference on Computer Vision, pages 4836–4845, 2017.
  • [18] Z. Gao, J. Xie, Q. Wang, and P. Li. Global second-order pooling convolutional networks. In arXiv, preprint arXiv:1811.12006, 2018.
  • [19] D. Gong, J. Yang, L. Liu, Y. Zhang, I. D. Reid, C. Shen, A. van den Hengel, and Q. Shi. From motion blur to motion flow: A deep learning solution for removing heterogeneous motion blur. In Proc. Conference on Computer Vision and Pattern Recognition, pages 3806–3815, 2017.
  • [20] M. González, J. Preciozzi, P. Musé, and A. Almansa. Joint denoising and decompression using cnn regularization. In Proc. Conference on Computer Vision and Pattern Recognition Workshops, pages 2598–2601, 2018.
  • [21] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proc. Conference on Neural Information Processing Systems, pages 2672–2680, 2014.
  • [22] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. In Proc. International Conference on Computer Vision, pages 2980–2988, 2017.
  • [23] K. He, J. Sun, and X. Tang. Single image haze removal using dark channel prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12):2341–2353, 2011.
  • [24] N. Hideki and N. Michiharu. Local map estimation for quality improvement of compressed color images. Pattern Recognition, 44(4):788–793, 2011.
  • [25] G. Hu, L. Liu, Y. Yuan, Z. Yu, Y. Hua, Z. Zhang, F. Shen, L. Shao, T. Hospedales, N. Robertson, and Y. Yang. Deep multi-task learning to recognise subtle facial expressions of mental states. In Proc. European Conference on Computer Vision, pages 106–123, 2018.
  • [26] J. Hu, M. Ozay, Y. Zhang, and T. Okatani. Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries. Proc. Winter Conference on Applications of Computer Vision, pages 1043–1051, 2018.
  • [27] J. Hu, L. Shen, S. Albanie, G. Sun, and A. Vedaldi. Gather-excite: Exploiting feature context in convolutional neural networks. In Proc. Conference on Neural Information Processing Systems, pages 9423–9433, 2018.
  • [28] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In Proc. Conference on Computer Vision and Pattern Recognition, pages 7132–7141, 2018.
  • [29] Y. Hu, G. Wen, M. Luo, D. Dai, and J. Ma. Competitive inner-imaging squeeze and excitation for residual network. In arXiv, preprint arXiv:1807.08920, 2018.
  • [30] L.-W. Kang, C.-W. Lin, and Y.-H. Fu. Automatic single-image-based rain streaks removal via image decomposition. IEEE Transactions on Image Processing, 21(4):1742–1755, 2012.
  • [31] M. R. Kanthan and S. N. Sujatha.

    Rain drop detection and removal using k-means clustering.

    In Proc. International Conference on Computational Intelligence and Computing Research, pages 1–5, 2015.
  • [32] A. Kendall, Y. Gal, and R. Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proc. Conference on Computer Vision and Pattern Recognition, pages 7482–7491, 2018.
  • [33] I. Kligvasser, T. R. Shaham, and T. Michaeli. xunit: Learning a spatial activation function for efficient image restoration. In Proc. Conference on Computer Vision and Pattern Recognition, pages 2433–2442, 2018.
  • [34] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas.

    Deblurgan: Blind motion deblurring using conditional adversarial networks.

    In Proc. Conference on Computer Vision and Pattern Recognition, pages 8183–8192, 2018.
  • [35] H. Kurihata, T. S. Takahashi, I. Ide, Y. Mekada, H. Murase, Y. Tamatsu, and T. Miyahara. Rainy weather recognition from in-vehicle camera images for driver assistance. In Proc. Intelligent Vehicles Symposium, pages 205–210, 2005.
  • [36] B. Li, X. Peng, Z. Wang, J. Xu, and D. Feng. Aod-net: All-in-one dehazing network. In Proc. International Conference on Computer Vision, pages 4780–4788, 2017.
  • [37] B. Li, W. Ren, D. Fu, D. Tao, D. Feng, W. Zeng, and Z. Wang. Reside: A benchmark for single image dehazing. In arXiv, preprint arXiv:1712.04143, 2017.
  • [38] G. Li, X. He, W. Zhang, H. Chang, L. Dong, and L. Lin. Non-locally enhanced encoder-decoder network for single image de-raining. In Proc. ACM International Conference on Multimedia, pages 1056–1064, 2018.
  • [39] K. Li, Z. Wu, K.-C. Peng, J. Ernst, and Y. Fu. Tell me where to look: Guided attention inference network. In Proc. Conference on Computer Vision and Pattern Recognition, pages 9215–9223, 2018.
  • [40] X. Li, J. Wu, Z. Lin, H. Liu, and H. Zha. Recurrent squeeze-and-excitation context aggregation net for single image deraining. In Proc. European Conference on Computer Vision, pages 262–277, 2018.
  • [41] Y. Li, R. T. Tan, X. Guo, j. Lu, and M. S. Brown. Rain streak removal using layer priors. In Proc. Conference on Computer Vision and Pattern Recognition, pages 2736–2744, 2016.
  • [42] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee. Enhanced deep residual networks for single image super-resolution. In Proc. conference on computer vision and pattern recognition workshops, pages 136–144, 2017.
  • [43] D. Linsley, D. Scheibler, S. Eberhardt, and T. Serre. Global-and-local attention networks for visual recognition. In arXiv, preprint arXiv:1805.08819, 2018.
  • [44] X. Liu, M. Suganuma, Z. Sun, and T. Okatani. Dual residual networks leveraging the potential of paired operations for image restoration. In Proc. Conference on Computer Vision and Pattern Recognition, 2019.
  • [45] Y. Liu, Z. Wang, H. Jin, and I. Wassell. Multi-task adversarial network for disentangled feature learning. In Proc. Conference on Computer Vision and Pattern Recognition, pages 3743–3751, 2018.
  • [46] G. Meng, Y. Wang, J. Duan, S. Xiang, and C. Pan. Efficient image dehazing with boundary constraint and contextual regularization. In Proc. International Conference on Computer Vision, pages 617–624, 2013.
  • [47] J. Miskin and D. J. MacKay. Ensemble learning for blind image separation and deconvolution. In

    Book of Advances in Independent Component Analysis

    , pages 123–141. 2000.
  • [48] S. Nah, T. H. Kim, and K. M. Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proc. Conference on Computer Vision and Pattern Recognition, pages 257–265, 2017.
  • [49] D. Nguyen and T. Okatani. Multi-task learning of hierarchical vision-language representation. In Proc. Conference on Computer Vision and Pattern Recognition, 2019.
  • [50] J. Pan, Z. Hu, Z. Su, and M.-H. Yang. Deblurring text images via l0-regularized intensity and gradient prior. In Proc. Conference on Computer Vision and Pattern Recognition, pages 2901–2908, 2014.
  • [51] R. Qian, R. T. Tan, W. Yang, J. Su, and J. Liu. Attentive generative adversarial network for raindrop removal from a single image. In Proc. Conference on Computer Vision and Pattern Recognition, pages 2482–2491, 2018.
  • [52] W. Ren, S. Liu, H. Zhang, J. Pan, X. Cao, and M.-H. Yang. Single image dehazing via multi-scale convolutional neural networks. In Proc. European Conference on Computer Vision, pages 154–169, 2016.
  • [53] W. Ren, L. Ma, J. Zhang, J. Pan, X. Cao, W. Liu, and M.-H. Yang. Gated fusion network for single image dehazing. In Proc. Conference on Computer Vision and Pattern Recognition, pages 3253–3261, 2018.
  • [54] S. Roth and M. J. Black. Fields of experts: a framework for learning image priors. In Proc. Conference on Computer Vision and Pattern Recognition, pages 860–867, 2005.
  • [55] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Journal of Physics D, 60(1-4):259–268, 1992.
  • [56] Q. Shan, J. Jia, and A. Agarwala. High-quality motion deblurring from a single image. ACM Transactions on Graphics, 27(3):73, 2008.
  • [57] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proc. Conference on Computer Vision and Pattern Recognition, pages 1874–1883, 2016.
  • [58] M. Suganuma, X. Liu, and T. Okatani. Attention-based adaptive selection of operations for image restoration in the presence of unknown combined distortions. In Proc. Conference on Computer Vision and Pattern Recognition, 2019.
  • [59] M. Suganuma, M. Ozay, and T. Okatani.

    Exploiting the potential of standard convolutional autoencoders for image restoration by evolutionary search.

    In Proc. International Conference on Machine Learning, pages 4778–4787, 2018.
  • [60] J. Sun, W. Cao, Z. Xu, and J. Ponce. Learning a convolutional neural network for non-uniform motion blur removal. In Proc. Conference on Computer Vision and Pattern Recognition, pages 769–777, 2015.
  • [61] X. Tao, H. Gao, X. Shen, J. Wang, and J. Jia. Scale-recurrent network for deep image deblurring. In Proc. Conference on Computer Vision and Pattern Recognition, pages 8174–8182, 2018.
  • [62] R. Timofte and O. . authors. Ntire 2017 challenge on single image super-resolution: Methods and results. In Proc. Conference on Computer Vision and Pattern Recognition Workshops, pages 1110–1121, 2017.
  • [63] L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008.
  • [64] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classification. In Proc. Conference on Computer Vision and Pattern Recognition, pages 6450–6458, 2017.
  • [65] Z. Wang, D. Liu, S. Chang, Q. Ling, Y. Yang, and T. S. Huang. D3: Deep dual-domain based fast restoration of jpeg-compressed images. In Proc. Conference on Computer Vision and Pattern Recognition, pages 2764–2772, 2016.
  • [66] P. Wieschollek, M. Hirsch, B. Schölkopf, and H. P. Lensch. Learning blind motion deblurring. In Proc. International Conference on Computer Vision, pages 231–240, 2017.
  • [67] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon. Cbam: Convolutional block attention module. In Proc. European Conference on Computer Vision, pages 3–19, 2018.
  • [68] D. Xu, W. Ouyang, X. Wang, and N. Sebe. Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In Proc. Conference on Computer Vision and Pattern Recognition, pages 675–684, 2018.
  • [69] J. Xu, L. Zhang, and D. Zhang. A trilateral weighted sparse coding scheme for real-world image denoising. In Proc. European Conference on Computer Vision, pages 21–38, 2018.
  • [70] J. Xu, L. Zhang, W. Zuo, D. Zhang, and X. Feng. Patch group based nonlocal self-similarity prior learning for image denoising. In Proc. International Conference on Computer Vision, pages 244–252, 2015.
  • [71] L. Xu and J. Jia. Two-phase kernel estimation for robust motion deblurring. In Proc. European Conference on Computer Vision, pages 157–170, 2010.
  • [72] L. Xu, S. Zheng, and J. Jia. Unnatural L0 sparse representation for natural image deblurring. In Proc. Conference on Computer Vision and Pattern Recognition, pages 1107–1114, 2013.
  • [73] X. Xu, D. Sun, J. Pan, Y. Zhang, H. Pfister, and M. Yang. Learning to super-resolve blurry face and text images. In Proc. International Conference on Computer Vision, pages 251–260, 2017.
  • [74] Y. Yan, C. Xu, D. Cai, and J. J. Corso. Weakly supervised actor-action segmentation via robust multi-task ranking. In Proc. Conference on Computer Vision and Pattern Recognition, pages 1022–1031, 2017.
  • [75] D. Yang and J. Sun. Proximal dehaze-net: A prior learning-based deep network for single image dehazing. In Proc. European Conference on Computer Vision, pages 729–746, 2018.
  • [76] Y. Yang, N. P. Galatsanos, and A. K. Katsaggelos. Regularized reconstruction to reduce blocking artifacts of block discrete cosine transform compressed images. IEEE Transactions on Circuits and Systems for Video Technology, 3(6):421–432, 1993.
  • [77] J. Yoo, S.-h. Lee, and N. Kwak. Image restoration by estimating frequency distribution of local patches. In Proc. Conference on Computer Vision and Pattern Recognition, pages 6684–6692, 2018.
  • [78] S. You, R. T. Tan, R. Kawakami, Y. Mukaigawa, and K. Ikeuchi. Adherent raindrop modeling, detectionand removal in video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9):1721–1733, 2016.
  • [79] K. Yu, C. Dong, L. Lin, and C. L. Chen.

    Crafting a toolchain for image restoration by deep reinforcement learning.

    In Proc. Conference on Computer Vision and Pattern Recognition, pages 2443–2452, 2018.
  • [80] H. Zhang and V. M. Patel. Densely connected pyramid dehazing network. In Proc. Conference on Computer Vision and Pattern Recognition, pages 3194–3203, 2018.
  • [81] H. Zhang and V. M. Patel. Density-aware single image de-raining using a multi-stream dense network. In Proc. Conference on Computer Vision and Pattern Recognition, pages 695–704, 2018.
  • [82] J. Zhang, J. Pan, J. Ren, Y. Song, L. Bao, R. W. Lau, and M.-H. Yang.

    Dynamic scene deblurring using spatially variant recurrent neural networks.

    In Proc. Conference on Computer Vision and Pattern Recognition, pages 2521–2529, 2018.
  • [83] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, 2017.
  • [84] X. Zhang, H. Dong, Z. Hu, W. Lai, F. Wang, and M. Yang. Gated fusion network for joint image deblurring and super-resolution. In arXiv preprint arXiv:1807.10806, 2018.
  • [85] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu. Image super-resolution using very deep residual channel attention networks. In Proc. European Conference on Computer Vision, pages 294–310, 2018.
  • [86] Y. Zhang, K. Li, K. Li, B. Zhong, and Y. Fu. Residual non-local attention networks for image restoration. In Proc. International Conference on Learning Representations, 2019.