IR-NAS: Neural Architecture Search for Image Restoration

09/18/2019 ∙ by Haokui Zhang, et al. ∙ The University of Adelaide 6

Recently, neural architecture search (NAS) methods have attracted much attention and outperformed manually designed architectures on a few high-level vision tasks. In this paper, we propose IR-NAS, an effort towards employing NAS to automatically design effective neural network architectures for low-level image restoration tasks, and apply to two such tasks: image denoising and image de-raining. IR-NAS adopts an flexible hierarchical search space, including inner cell structures and outer layer widths. The proposed IR-NAS is both memory and computationally efficient, which takes only 6 hours for searching using a single GPU and saves memory by sharing cell weights across different feature levels. We evaluate the effectiveness of our proposed IR-NAS on three different datasets, including an additive white Gaussian noise dataset BSD500, a realistic noise dataset SIM1800 and a challenging de-raining dataset Rain800. Results show that the architectures found by IR-NAS have fewer parameters and enjoy a faster inference speed, while achieving highly competitive performance compared with state-of-the-art methods. We also present analysis on the architectures found by NAS.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As an important category of tasks in computer vision, image restoration aims to estimate the underlying image from its degraded measurements, which is known to be an ill-posed inverse procedure. Depending on the type of degradation, image restoration can be categorized into different sub-problems, such as denoising, de-raining, inpainting and super-resolution, etc. In this work, we mainly focus on image denoising and image de-raining, although the NAS method developed here is so general that it can be applied to most image restoration problems. Most of the recent works on image restoration have shifted their approaches to deep learning, which outperformed conventional methods significantly. Nonetheless, discovering state-of-the-art neural network architectures requires substantial effort. Recently, there has been a growing interest in developing algorithmic solutions to automate the manual process of architecture design. Architectures found automatically have achieved highly competitive performance in high-level vision tasks such as image classification  

[1], object detection [2] and semantic segmentation  [3, 4]. In this paper, we propose IR-NAS, a neural architecture search (NAS) algorithm for low-level image restoration tasks including both image denoising and image de-raining. Our main contributions can be summarized in the following four aspects.

  1. We propose an efficient neural architecture search method for low-level image restoration, termed IR-NAS, and apply it to image denoising and de-raining tasks.

    The proposed IR-NAS is able to search for both inner cell structures and outer layer widths. It is also memory and computation efficient, taking only 6 hours to search on a single GPU, taking one third of the memory of Auto-Deeplab [3] to search for the same structure.

  2. We apply our proposed IR-NAS on two denoising datasets of different noise modes and one widely used de-raining dataset for evaluation. Experiments show that the IR-NAS designed networks outperform state-of-the-art algorithms on the three datasets with fewer parameters and faster speed.

  3. We conduct comparison experiments to analyse the networks found by our NAS algorithm in terms of the internal structure, offering insights in architectures found by NAS.

2 Related Work

Low level image processing

. Currently, due to the popularity of convolutional neural networks (CNNs), image restoration algorithms including image denoising and image de-raining have achieved a significant performance boost. For image denoising, the recent network model, DnCNN

[5] predicts the residue present in the image instead of the denoised image, showing good performance. Lately, FFDNet [6] attempts to address spatially varying noise by appending noise level maps to the input of DnCNN. NLRN [7]

incorporates non-local operations into a recurrent neural network (RNN) for image restoration. N3Net

[8] formulates a differentiable version of nearest neighbor search to further improve DnCNN. Recently, some algorithms focus on denoising for real-noisy images since many existing denoisers tend to overfit the additive white Gaussain noise (AWGN) and generalize poorly to real-world noisy images which are contaminated with more sophisticated noises than additive Gaussian noises. CBDNet [9] uses simulated camera pipeline to supplement real training data. Similar work in [10] proposes a camera simulator that aims to accurately simulate the degradation and noise transformation performed by camera pipelines.

For image de-raining, Fu et al. [11] introduced deep learning methods for solving the de-raining problem where the rain streaks are modelled as residues between the input and output of the networks in an end-to-end fashion. Yang et al. [12] design a deep recurrent dilated network to jointly detect and remove rain streaks. Li et al. [13] design a scale-aware multi-stage recurrent network that consists of several parallel sub-networks to estimate rain streaks of different sizes and densities individually. Recently, Zhang et al. [14]

propose to classify rain density for guiding the rain removal step. Li et al. 

[15] propose a recurrent squeeze-and-excitation based context aggregation network to remove rain streaks through multiple stages.

Network architecture search

. Network architecture search (NAS) aims to find automatic apporaches of designing neural architectures to replace conventional hand-crafted ones. Early attempts employ evolutionary algorithms (EAs) for optimizing the neural architectures and parameters. The best architecture may be obtained by iteratively mutating a population of candidate architectures 

[16]

. An alternative to EA is to use reinforcement learning (RL) techniques such as policy gradients 

[17] and Q-learning [18] to train a recurrent neural network that acts as a meta-controller to generate sequences encoding potential architectures by exploring a predefined search space. One drawback is that these EA and RL based methods tend to require a large amount of computations. Recently, speed-up techniques like hyper-networks [19], network morphism [20] and shared weights [21] lead to substantial reduction of the search cost.

Our work is most closely related to DARTS [22], ProxylessNAS [23] and Auto-Deeplab [3]. DARTS are based on the continuous relaxation of the architecture representation, allowing efficient search of the cell architecture via gradient descent, has achieved competitive performance. We extend its search space to include widths for cells by layering multiple candidate paths. Another optimization based NAS with widths in its search space is ProxylessNAS. However, it is limited to discover sequential structures and choose kernel widths within manually designed blocks (Inverted Bottlenecks [24]). By introducing multiple paths of different widths, the search space of our IR-NAS resembles Auto-Deeplab. The two major differences are: 1) to retain high resolution feature maps, we do not downsample the features but reply on automatically selected dilated convolutions and deformable convolutions to adapt receptive field; 2) we share the cell weights across different paths which leads to three times memory efficiency comparing to Auto-Deeplab counterparts.

One more relevant work is FALSR [25], which is proposed for super resolution task. FALSR involves RL and EA in its controller and it takes about 3 days on 8 Tesla-V100 GPUs to find the final architecture. Our proposed IR-NAS takes only 6 hours to search on a single GPU. Compared with FALSR, our IR-NAS is 96 fast in searching.

3 Our Proposed Approach

Following [22, 23], we employ gradient-based architecture search strategy in our IR-NAS and we search for a computation cell as basic block then build the final architecture by stacking the searched block with different widths. Differing from these methods, IR-NAS has a more flexible search space and it is able to search both the cell structures and widths. In this section, we first introduce how to search architectures for cells; then we explain how to determine the widths of cells. Finally we present our search strategy and our designed loss.

3.1 Cell Architecture Search

Figure 1: Cell architecture search. Left: supercell that contains all possible layer types. Right: the cell architecture search result, a compact cell, where each node only keeps the two most important inputs and each input is connected to current node with a selected operation.

For cell architecture search, we employ the continuous relaxation strategy proposed in [22]. More specifically, we build a supercell that integrate all possible layer types, which is show in the left side of Figure 1. This supercell is a directed acyclic graph containing a sequenced nodes. In Figure 1, we only show three nodes for clear exposition.

We denote the super cell in layer as

, which takes outputs of previous cell and the cell before previous cell as inputs and outputs one tensor

. Inside , each node takes the two inputs of current cell and the outputs of all previous nodes as input and outputs one tensor. Taking the th node in as an example, the output of this node is calculated as follows:

(1)

where is the input set of node . and are the outputs of Cells in layers and , respectively. is the set of possible layer types. Here, to make the search space continuous, we operate each in an continuous relaxation way, that is:

(2)

where correspond to possible layer types. denotes the weight of operator . Following several recent image restoration network [7, 26, 15], we do not reduce the spatial resolution of the input. To preserve pixel-level information for low-level image processing, we decide to not downsample the features but rely on operations with adaptive receptive field such as dilated convolutions and deformable convolutions. In this paper, we provide the following 6 types of basic operators:

  • conv: convolution;

  • sep: separable convolution;

  • dil: convolution with dilation rate 2;

  • def: deformable convolution V2 [27];

  • skip: skip connection;

  • none: no connection and return zero.

Each convolution operation starts with a ReLU activation layer and is followed by a batch normalization layer.

is the concatenation of the outputs of nodes and it can be expressed as:

(3)

In summary, the task of cell architecture search is learning continuous weights , which will be updated via gradient descent. After the supercell is trained, for each node, we rank the corresponding inputs according to values, then reserve the top two inputs and remove the rest to obtain the compact cell, as shown in the right of Figure 1.

3.2 Cell Width Search

Figure 2: Cell width search. Left: network architecture search space, a supernet that consists of supercells and contains several supercells with different widths in each layer. Right: the final architecture from the supernet, a compact newwork that consists of compact cells and only keeps one cell in each layer.

In the last section, we have presented the main idea of cell architecture search, which is used to design the specific architectures inside cells. However, the overall network is built by stacking several cells with different widths. Only searching architectures for cells is not sufficient.

In this section, we introduce the inter cell search space which determines the widths of cells in different layers. Similarly, we build a supernet that contains several supercells with different widths in each layer. As illustrated in the left of Figure 2, the supernet mainly consists of three parts:

1) start part, consists of input layer and two convolution layer;

2) middle part, contains layers and each layer has three supercells of different widths;

3) end part, concatenating the outputs of , then feeding them to a convolution layer to generate output. The task of cell width search is selecting a proper width for each layer of middle part.

Our supernet provides three paths of cells with different widths. For each layer, the supernet decides to increase the width by twice, keeping previous width or reducing the width by two. After searching, only one cell at each layer is kept. The continuous relaxation strategy mentioned in cell architecture search section is reused for inter cell search.

At each layer , there are three cells , and with widths , and , where is the basic width and is set to 10 during search phase. The output feature of each layer is

(4)

where is the output of . The channel width of is , where is the number of nodes in the cells.

Each cell is connected to , and in the the previous layer and two layers before. We first process the outputs from those layers with a convolution to features with width matching the input of . Then the output for the th cell in layer is computed with

(5)

where is the weight of . We combine the three outputs of according to corresponding weights then feed them to as input. After the supernet is trained, we select the widths for each layer according to the values.

Note the similarity of this design with Auto-Deeplab which is used to select feature strides for image segmentation. However, in Auto-Deeplab, the outputs from the three different levels are first processed by separate cells with different sets of weights before summing into the output:

(6)

By reusing the cell weights , we are able to save three times the memory consumption in the supernet and use a deeper and wider supernet for more accurate approximations.

Note that, different from in cell architecture search, we can not simply rank cells of different widths according to values then reserve the top one cell. In cell widths search, the channel widths of outputs of different cells in the same layer are very different. Using the strategy what we adopted in cell architecture search may lead to the widths of adjacent layers in the final network change drastically, which has a negative effect to the efficiency, as explained in [28]. In cell width search, we view the

values as probability, then use the Viterbi algorithm to select the path with the maximum probability as the final result. In addition, an ASPP module is added to the end of the last Cell in the final architecture, as illustrated in the right of Figure 

2.

3.3 Searching with Gradient Descent

In terms of the optimization method, our proposed IR-NAS belongs to differentiable architecture search. The searching process is the optimization process. For image denoising and image de-raining, the two most widely used evaluation metrics are PSNR and SSIM 

[29]. Inspired by this, we design the following loss for optimizing supernet:

(7)

where

(8)

where and denote the input image and corresponding groundtruth. is a loss item that we designed to enforce the visible structure of result. is the supernet. is structural similarity [29]. is a weighting coefficient and it is empirically set to 0.5 in our experiments. During optimizing the supernet with gradient descent, we split the training set into three disjoint parts: Train W, Train A and Train V. W and A are used to optimize the weights of the supernet (kernels in convolution layers ) and weights of different layer types and cells of different widths ( and ). Train V is used to evaluate the performance of the trained supernet. More details are introduced in the section of implementation details.

4 Experiments

4.1 Datasets and Implementation Details

Datasets For the denoising experiments, we use two datasets. The first one is BSD500 [30]. following [31, 32, 7], we use as the training set the combination of 200 images from the training set and 100 images from the validation set, and test on 200 images from test set. On this dataset, we generate noisy images by adding white Gaussian noise to clean images with .

The second one is SIM1800, built by ourselves. As the additive white noise models is not able to accurately reproduce the true noise in real world, by using the camera pipeline simulation method proposed in 

[10], we build this new denoising dataset SIM1800, which contains 1600 training samples and 212 test samples.

Firstly, we use the camera pipeline simulation method to add noise to 25k patches extracted from MIT-Adobe5k dataset [33], then manually pick up 1812 patches which have the most realistic visual effects and finally randomly select 1600 patches as training set and reserve the rest as test set.

For de-raining experiments, we compare the IR-NAS designed models with previous works on the outdoor synthetic 800 rain images (Rain800), which consists of 700 training image pairs and 100 test image pairs.

Figure 3: Denoising experiments on BSD500.
Figure 4: Denoising experiments on SIM1800.
Methods # parameters (M) time cost (s)
PSNR SSIM PSNR SSIM PSNR SSIM
BM3D [34] - - 27.31 0.7755 25.06 0.6831 23.82 0.6240
WNNM [35] - - 27.48 0.7807 25.26 0.6928 23.95 0.3460
RED [31] 0.99 - 27.95 0.8056 25.75 0.7167 24.37 0.6551
MemNet [32] 4.32 - 28.04 0.8053 25.86 0.7202 24.53 0.6608
NLRN [7] 0.98 10411.49 28.15 0.8423 25.93 0.7214 24.58 0.6614
N3Net [26] 0.68 121.11 28.66 0.8220 26.50 0.7490 25.18 0.6960
IR-NAS 0.63 83.25 29.14 0.8403 26.77 0.7635 25.48 0.7129
Table 1: Denoising experiments. Comparisons with state-of-the-arts on BSD500. We show our results in the last row. Time cost means GPU-seconds cost to inference on the 200 images from the test set of BSD500 with one GTX 980 graphic card.
Methods PSNR SSIM
NLRN [7] 27.53 0.8081
N3Net [26] 27.62 0.8191
IR-NAS 27.23 0.8326
Table 2: Denoising results on SIM1800.

Search settings The supernet that we build for image denoising consists of 4 cells and each cell has 5 nodes. The supernet that we build for image de-raining contains 3 cells and each cell is made up of 4 nodes. Both basic widths in the two supernets are set to 10 during search.

In designing network for image denoising, we conduct architecture search on BSD500 and apply the networks found by IR-NAS on both two denoising datasets. Specifically, we combine the 200 images of training set and 100 images of validation set as training set, 2% of which are randomly selected and used to evaluate the performance of the supernet (Train V). The rest are equally divided into two parts, one part is used to update the kernels of convolution layers (Train W) and the other part is used to optimize the parameters of architecture (Train A). Similarly, For image de-raining, we search the architecture on the training set of Rain800, which is also split to three parts.

We train the supernet for 100 epochs with batchsize of 12. We optimize the parameters of kernels and architecture with two optimizers. For learning the kernels of convolution layers, we employ SGD optimizer. The momentum and weight decay are set to 0.9 and 0.0003, respectively. The learning rate decays from 0.025 to 0.001 with cosine annealing strategy 

[36]. For learning the parameters of architecture, we use the Adam optimizer, where both learning rate and weight decay are set to 0.001. In the first 20 epochs, we only update the parameters of kernels, then we start to alternately optimize the kernels of convolution layers and architecture parameters from epoch 21.

During the training process of searching, we randomly crop patches of and feed them to network. During evaluation, we split each image to some adjacent patches of and then feed them to network and finally joint the corresponding patch results to form a whole image. We evaluate the supernet for every 1000 iterations and save the one which has the most high PSNR and SSIM as the result of architecture search.

Training settings For image denoising and image de-raining, we training the final architectures found by IR-NAS with same strategy. Specifically, we train the network for 600k iterations with Adam optimizer, where the initial learning rate, batchsize are set to 0.05 and 12, respectively. For data augmentation in image denoising, we use random crop, random rotations, horizontal and vertical flipping. In random crop, the patches of are randomly cropped from input images. For fair comparison, following [7], we train a different model for each noise level on BSD500. For image de-raining, we use random crop and horizontal flipping for augmentation.

4.2 Comparisons with State-of-the-Art Results

In this section, we compare the IR-NAS designed networks with a number of recent image denoising and de-raining methods and use PSNR and SSIM to quantitatively measure the restoration performance of those methods.

Image denoising results For image denoising experiments, we compare our IR-NAS designed network with several published image denoising methods, including BM3D [34], WNNM [35], RED [31], MemNet [32], NLRN [7] and N3Net [26]. The comparison results on BSD500 and SIM1800 are listed in Table 1 and Table 2, respectively. Figures 3 and4 show the visual effects.

Table 1 shows that NLRN, N3Net and IR-NAS beat other models by a clear margin. Among the top three methods, our proposed IR-NAS achieves the best performance when is set to 50 and 70. When the noise level is set to 30, the SSIM of NLRN is slightly higher (0.002) than that of our IR-NAS, but the PSNR of NLRN is much lower (nearly 1 dB) than that of IR-NAS.

Broadly speaking, our IR-NAS achieves better performance than others. In addition, compared with NLRN and N3Net, the network designed by our IR-NAS has fewer parameters and faster inference speed. As listed in Table 1, the IR-NAS designed network contains 0.63M parameters, which is 92.65% that of N3Net and 64.29% that of NLRN. Compared with N3Net, the IR-NAS designed network reduces the inference time on the test set of BSD500 by 31.26%. Figure 3 shows that the network designed by our IR-NAS achieves the best visual effect.

As NLRN and N3Net beat other denoising models by a large margin on BSD500, we now compare the network designed by IR-NAS with NLRN and N3Net on SIM1800. Table 2 lists the results, from which we can see that the SSIM of the IR-NAS desgined network is much higher than that of NLRN and N3Net. However, PSNR of IR-NAS designed network is slightly lower than that of NLRN and N3Net. In summary, the performance of the IR-NAS designed network is competitive with that of NLRN and N3Net on SIM1800. Figure 4 shows a visual comparison.

Figure 5: de-raining experiments on Rain800.
Methods PSNR SSIM
DSC [37] 18.56 0.5996
LP [38] 20.46 0.7297
DetailsNet [11] 21.16 0.7320
JORDER [12] 22.24 0.7763
JORDER-R [12] 22.29 0.7922
SCAN [15] 23.45 0.8112
RESCAN [15] 24.09 0.8410
IR-NAS 26.31 0.8685
Table 3: de-raining results on Rain800. With a GTX 980 graphic card, RESCAN and IR-NAS respectively cost 44.35, 21.80 gpu-seconds to inference on the test set of Rain800.
Figure 6: Architecture analysis. ‘Conv’, ‘def’ and ‘dil’ denote conventional, deformable and dilated convolutions. ‘Skip’ is skip connection. (a) our IR-NAS designed networks for denoising and de-raining; (b) the detailed structures inside designed cells; (c) modified cells, where conventional convolution layers are replaced by deformable convolution layers; (d) modified cells, where the connection relationships between different nodes inside cells are changed.

Image de-raining results On Rain800, we compare the de-raining network found by IR-NAS with seven previous methods. The results are listed in Table 3 and shown in Figure 5. As shown in Table 3, the de-raining network designed by IR-NAS has much better performance than others. From RESCAN to the network designed by IR-NAS, PSNR and SSIM are improved by 2.22 and 0.0275, respectively. In addition, the inference speed of IR-NAS designed de-raining network is 2.03 that of RESCAN.

4.3 Analysis

In this section, we analyse the architectures designed by IR-NAS. Figure 6 (a) and (b) show the searched networks. (a) shows the search results in outer network level and (b) show the details inside cells. From Figure 6 (a) and (b), we can see that:

  1. In both the denoising network and de-raining network that are found by our IR-NAS, the width of cell which is most close to output layer has the maximum number of channels, which is consist with previous manually designed networks.

  2. Generally speaking, with the same widths, deformable convolution is more flexible and powerful than other convolution operations. Even so, inside cells, instead of connecting all the nodes with the powerful deformable convolution, IR-NAS connects different nodes with different types of operators, such as conventional convolution, dilated convolution and skip connection.

    We believe that these results prove that IR-NAS is able to select proper operators.

  3. Separable convolutions are not included in the searched results. We conjecture that this is caused by the fact that we do not limit FLOPS or parameter number during search. Interestingly, the networks found by our IR-NAS still have fewer parameters than other manual models.

From Figure 6 (b), we can see that the networks designed by IR-NAS consist of many fragmented branches, which might be the main reason of that the designed networks have better performance than previous denoising and de-raining models.

As explained in [28], fragmentation structure is beneficial for accuracy. Here we verify if IR-NAS improves the accuracy by designing a proper architecture or by simply integrating various branch structures and convolution operations. We modify the architecture found by our IR-NAS in two different ways and then compare the modified architectures with unmodified architectures. The first modification is replacing conventional convolutions in searched architectures with deformable convolutions as shown in Figure 6 (c). As mentioned above, deformable convolution is more flexible than conventional convolution, replacing conventional convolutions with deformable convolutions should improve the capacity of networks. The other modification is to change the connection relationships between nodes inside each cell, as shown in Figure 6 (d). This modification is aiming to verify if the connection relationship built by our IR-NAS is indeed appropriate.

Methods Image denoising () Image de-raining
PSNR SSIM PSNR SSIM
IN_NAS 29.14 0.8403 26.31 0.8685
IN_NAS, 29.06 0.8398 21.48 0.6754
IN_NAS, 29.13 0.8400 25.69 0.8416
Table 4: Architecture analysis.
Methods
PSNR SSIM PSNR SSIM PSNR SSIM
NLRN 28.15 0.8423 25.93 0.7214 24.58 0.6614
N3Net 28.66 0.8220 26.50 0.7490 25.18 0.6960
IR-NAS* 29.03 0.8254 26.77 0.7498 25.42 0.6962
IR-NAS** 29.14 0.8403 26.77 0.7635 25.48 0.7129

Table 5: Ablation study on BSD500. IR-NAS is trained with single loss MSE and IR-NASis trained with the combination loss MSE and

.

Methods PSNR SSIM
SCAN 23.45 0.8112
RESCAN 24.09 0.8410
IR-NAS 25.51 0.8494
IR-NAS 26.31 0.8685
Table 6: Ablation study on Rain800.

The modification parts are marked in red in Figure 6 (c) and (d). Following the two proposed modifications, we also modify other parts for comparison experiments. However, limited by space, we only show two examples for each task in this paper. The comparison results are listed in Table 4, where the two mentioned modification operations, are denoted as and .

From Table 4, we can see that both modification reduce the accuracy on image denoising and de-raining. Especially for image de-raining, replacing convolution operation reduces the PSNR and SSIM by 4.83 and 0.1931, respectively. Changing connection relationships decreases the PSNR and SSIM to 25.69 and 0.8416, respectively.

In summary, we can draw one conclusion from the comparison results, that is, IR-NAS does design a proper structure and select proper convolution operations, instead of simply integrating a complex network with various convolution operations.

4.4 Ablation Study

In this section we analyze how our designed loss item improve image restoration results. We implement two baselines: (1) IR-NAS trained with single loss MSE, and (2) IR-NAS trained with the combination loss MSE and . Table 5 shows the denoising results of these two methods and another two recent state-of-the-art denoisers, NLRN, N3Net on BSD500. Meanwhile, we conduct the experiments on the de-raining dataset Rian800, and the experimental results are summarized in Table 6. It is clear that either IR-NAS or IR-NAS outperforms other competitive models on both datasets, while IR-NAS* trained with combination loss brings a gain over IR-NAS trained with single loss, in particularly for image de-raining, the gain is remarkable, about 0.8 dB PSNR, 0.02 SSIM, respectively. In short, our desinged loss is useful for improving both PSNR and SSIM metrics.

5 Conclusion

In this work, we have proposed IR-NAS, NAS for low-level image restoration tasks. The proposed IR-NAS is both memory and computation efficient. It takes only 6 hours to search on a single GPU and takes only one third of the memory of Auto-Deeplab to search for the same structure. Our proposed IR-NAS is applied on three datasets, two denoising dataset and one de-raining dataset, and achieves highly competitive or better performance compared with previous state-of-the-art methods with fewer parameters and faster inference speed.

We have also introduced an SSIM based loss, , which is proved very useful for improving the two evaluation metrics, PSNR and SSIM. In future work, we plan to improve the efficiency of IR-NAS and solve more low-level tasks.

References