Single Image Deraining via Scale-space Invariant Attention Neural Network

06/09/2020
by   Bo Pang, et al.
Harbin Institute of Technology
0

Image enhancement from degradation of rainy artifacts plays a critical role in outdoor visual computing systems. In this paper, we tackle the notion of scale that deals with visual changes in appearance of rain steaks with respect to the camera. Specifically, we revisit multi-scale representation by scale-space theory, and propose to represent the multi-scale correlation in convolutional feature domain, which is more compact and robust than that in pixel domain. Moreover, to improve the modeling ability of the network, we do not treat the extracted multi-scale features equally, but design a novel scale-space invariant attention mechanism to help the network focus on parts of the features. In this way, we summarize the most activated presence of feature maps as the salient features. Extensive experiments results on synthetic and real rainy scenes demonstrate the superior performance of our scheme over the state-of-the-arts.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 5

page 7

page 8

11/10/2015

Attention to Scale: Scale-aware Semantic Image Segmentation

Incorporating multi-scale features in fully convolutional neural network...
04/03/2020

DFNet: Discriminative feature extraction and integration network for salient object detection

Despite the powerful feature extraction capability of Convolutional Neur...
04/25/2021

Parallel Scale-wise Attention Network for Effective Scene Text Recognition

The paper proposes a new text recognition network for scene-text images....
07/31/2021

Multi-scale Matching Networks for Semantic Correspondence

Deep features have been proven powerful in building accurate dense seman...
12/14/2016

Scale Coding Bag of Deep Features for Human Attribute and Action Recognition

Most approaches to human attribute and action recognition in still image...
03/21/2021

Deep Dense Multi-scale Network for Snow Removal Using Semantic and Geometric Priors

Images captured in snowy days suffer from noticeable degradation of scen...
08/08/2019

GridDehazeNet: Attention-Based Multi-Scale Network for Image Dehazing

We propose an end-to-end trainable Convolutional Neural Network (CNN), n...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In practical applications, tasks of outdoor scene analysis inevitably involve scenarios where images are captured under bad weather conditions, such as rainy days. Such factors cause degradation in image quality, resulting in unexpected impacts on subsequent tasks like object detection [19], recognition [18] and scene analysis [7]. The image enhancement task that attempts to remove rain steaks is thus useful and necessary for outdoor visual system, which serves as the pre-processing step to help improve detection or recognition performance.

Deraining becomes an active low-level image processing problem. Many works emerge in recent years, either video-based [10, 14], or single-image based [17, 26]

. In this paper, we focus on the line of single image deraining, which can be formulated as an ill-posed problem. The early methods treat rain removal as a signal separation problem, relying on some prior modeling about the background layer and the rain steak layer. However, the artifact of rain steaks is inherently a kind of signal-dependent noise, features of which are intrinsically overlapped with those of the background in the feature space, making this inverse problem even harder to solve. The progress of deep learning based image restoration lights the path of single image deraining

[17, 26, 20]. The kind of data-driven approach is able to model more complicated mappings from rain images to clean ones, and thus achieves much better deraining results than the traditional model-based approach [16, 12, 4].

In real-world scenarios, rain steaks appear at different scales, depending on their distance from the camera. This leads to rain artifacts with varying sizes, background clutter and heavy occlusions, making single image deraining remain a challenging task. Some works try to handle this multi-scale effect in rain modeling. For instance, Fu et al. [1] construct pyramid frameworks to exploit the multi-scale knowledge for deraining. Jiang et al. [8] propose to first generate Gaussian pyramid and then fuse the multi-scale information. These methods explicitly decompose the rain image into different pyramid levels by progressively downsampling in pixel domain. Yet, we note that this multi-scale representation approach through image pyramid is not optimal. The downsampling operation by removing large amount of pixels results in blurry artifact and resolution reduction, making the following fusion procedure not sufficiently informed about the key characteristics in identifying salient features in images.

Fig. 1: The overall framework of our network, which includes multiple stages. is the input rain image, are intermediate deraining results, is the final deraining result,

are the estimated rain layers. The core module in each stage is Cross-scale Feature Aggregation, in which multi-scale features are extracted and scale-space invariant attention masks are derived. Adjacent stages are connected by LSTM to propagate information and achieve progressive refinement.

Considering the above limitation of existing methods, in this work, we revisit multi-scale representation by scale-space theory, and propose to represent the multi-scale correlation in convolutional feature domain, which is more compact and robust than representations in pixel domain. Specifically, we propose a scale-aware deep convolutional neural network for deraining, which includes a multi-scale feature extraction branch coupling with scale-space invariant attention branch. In the feature branch, we build multi-scale pyramid in feature space through average pooling operations with various sizes, which can efficiently suppress noise and preserve background information. Besides compact representation and robustness to noise, this manner brings other benefits, such as invariance to local translation and enlarged receptive field. In the attention branch, we tailor a scale-space invariant attention mechanism to quantify the importance of the input multi-scale features to focus on. We achieve this by building difference-of-Gaussian (DoG) pyramid in scale-space, which is coupled with the feature branch and able to reveal latent salient features cross scales. Finally, LSTM-based muiti-stage refinement strategy is employed to progressively improve the network performance. In a nutshell, our scheme works by learning an intermediate attention maps in scale-space that are used to select the most relevant pieces of information from multi-scale features for separating the background and rain steaks.

The main contributions of this work are highlighted as follows:

  • We propose a multi-scale correlation representation in feature space for single image deraining, which is more compact and robust than the counterpart representation in pixel domain.

  • We propose a scale-space invariant attention network through DoG pyramid, which can reveal latent salient features cross scales.

  • Our scheme achieves the best single image deraining performance so far, which is consistently superior than a wide range of state-of-the-arts over various benchmarking datasets.

The rest sections are organized as follows: The related works are briefly overviewed in Section 2. In Section 3, we introduce the proposed method in detail. In Section 4, we provide extensive experimental results and ablation study to demonstrate the superior performance of our method.

Ii Related Work

In this section, we briefly overview existing model-based and deep learning based single image deraining works.

Ii-a Model-based methods

A rainy image can be modeled as a linear combination of the background layer and the rain streak layer. Based on this, model-based methods conduct deraining by explicitly defining some prior models on both the background layer and the rain streak layer. The task of deraning is then transferred to a signal separation problem. Luo et al. [16] propose a dictionary learning method based on sparsely approximating the patches of two layers by very high discriminative codes over a learned dictionary with strong mutual exclusivity property. Li et al. [12] point out either taking dictionary learning methods or imposing a low rank structure will always leave too many rain streaks in the background image or over smooth the background image. So they propose a prior-based method using GMM which can accommodate multiple orientations and scales of the rain streaks. Gu et al. [4] combine analysis sparse representation (ASR) and synthesis sparse representation (SSR) which both are used for sparsity-based image modeling proposing a joint convolutional analysis and synthesis (JCAS) sparse representation model. However, these model-based methods couldn’t well formulate complex raining process, which is not enough to retrieve background information.

Ii-B Deep Learning based methods

For the data-based methods to do the rain rain removal task, the intuitive idea is to learn the mapping from the rainy image to clear backgroud image. However, such solution might cause loss of the background information. To better recover the clear image, Fu et al. [2] design a two-layers network. One is called base layer. The other is called detail layer. In base layer, it mainly focuses on low frequency information of the image using low-pass filter. While in detail layer, they design a CNN to obtain the high frequency information of the image. They enhance both the outputs of two layers and then combine them to get clean image. Due to the success of deep residual network, Fu et al. [3] concentrate on high frequency detail by learning the residual part between the rainy image and the clear image. Afterwards considering the different scale, direction and shape of the rain streaks, Li et al. [11] adopt the dilated convolutional neural network to acquire large receptive field and used the recurrent structure. Fu et al. [1] construct light weight pyramid frameworks to exploit the multi-scale knowledge for deraining. In Yang’s work[26], they construct contextualized dilated networks to aggregate context information at multiple scales for learning the rain features. Jiang et al. [8] propose to first generate Gaussian pyramid and then fuse the multi-scale information. In our work, we propose to represent the multi-scale correlation in convolutional feature domain, which is more compact and robust than that in pixel domain. Since there are many modules used in network, the network is too complicated to analyze each module’s function. So Ren et al. [17] provide a simple and strong baseline using a LSTM block and several resblocks. And for better dealing with the real world’s rainy image, several methods also have been proposed [21] [6] [28].

Fig. 2: Illustration of rain steak layer estimation. (a) and (c) are the input rainy images, (b) and (d) are the estimated rain streak layers by our network.

Iii Proposed Method

In this section, we introduce in detail the proposed deep neural network for single image deraining.

Iii-a Network Architecture

The overall architecture of our proposed deraining network is illustrated in Fig. 1. Our scheme tackles the deraining problem in a multi-stage manner, which processes the input rain image and intermediate deraining results to generate the output clean image progressively, where at the beginning . The core module, named cross-scale feature aggregation (CFA), is recursively conducted in each stage with the same network parameters, which is tailored to capture rich cross-scale information of input data. We connect CFA modules at adjacent stages with convolutional LSTM units that propagate information across the stages. In one CFA module, to estimate the rain steak layer , we learn multi-scale features with three scale sizes (, , ), and design a scale-space invariant attention network to derive the importance masks . In this way, we identify salient regions in the input data for the network to focus on, i.e., , which are helpful to improve the modeling ability of the proposed deraining network. The illustration of rain steak layer estimation is shown in Fig. 2. It can be found our scheme estimates the rain steak layer very well.

Iii-B Cross-scale Feature Aggregation

According to the scale-space theory [13], the real-world objects have the nature of multi-scale, which exist as meaningful entities over certain ranges of scales. This implies that the perception of objects depends on the scale of observation. For images of unknown scenes, it is unlikely to know a prior what scales are relevant. The only reasonable approach is to represent the image data at multiple scales [13].

In rain images, the notion of “scale" deals with visual changes in rain steaks’ appearance with respect to their distance from the camera. Inspired by the scale-space theory, in the task of single image deraining, we consider to learn multi-scale features to capture a rich representation of image data. A straightforward approach is to build a coarse-to-fine pyramid in pixel domain, as done in [1] and [8]. However, the image representation in pixel domain is not compact. The downsampling operation by decreasing the pixel number results in information loss about key characteristics in identifying salient features in images.

Fig. 3: Cross-scale Feature Aggregation

Instead, in our work, we propose to construct multi-scale representation in feature domain, which is more compact and robust than that in pixel domain. Specifically, in the -th stage, we concatenate the input rain image and the deraining result of the last stage as the input, which is fed into convolutional neural networks to extract feature maps

(1)

where denotes to the concatenation operator, represents one convolutional layer. Note that here we omit the parameters of , which serve as the parameters of the overall network. Then we build multi-scale pyramid of CNN features through average pooling operations and one convolution layer to obtain multi-scale features:

(2)

where represents the pooling operator with various size , which means that the receptive field of each scale is with size of , and

, and with a stride of 1, 2 and 4, respectively.

To improve the discriminative ability of our network, we do not treat the extracted features equally, but design a novel attention mechanism that couples with the multi-scale feature extraction to help the network focus on parts of the features. In this way, we summarize the most activated presence of feature maps, which are expected to be ones of rain steaks.

As illustrated in Fig. 3, for scale , the proposed scale-space invariant attention network (SIAN) is exploited to derive the importance mask , according to which we identify salient changes in latent CNN features:

(3)

where denotes the element-wise multiplication. Finally, all salient features are concatenated together to form the cross-scale feature:

(4)

where denotes the upsampling operator with factor 2.

After one convolutional layer followed by ReLU activation,

is passed into which contains two Resblocks and a convolutional layer to get the estimation of rain steak layer :

(5)

The deraining result of the -th stage is:

(6)

Iii-C Scale-space Invariant Attention Network

Fig. 4: Scale-space Invariant Attention Network

In this subsection, we elaborate how we design a powerful attention mechanism that can reveal latent information from features captured at different scales. The proposed scale-space invariant attention mechanism is inspired by the classical SIFT feature extractor [15]. SIFT achieves great success in feature extraction and description before the rise of deep learning, which is invariant to image scale and rotation, and partly invariant to affine distortion, addition of noise, and change in illumination. These wonderful properties make the underlying wisdom of SIFT particularly enlightening for the task of deraining, in which the rain steaks also exhibit multi-scale, various rotations and affine transformation. The rain images also suffer from noise and low-contrast artifacts.

SIFT has the four major stages of computation: 1) Scale-space extrema detection, 2) Keypoint localization, 3) Orientation assignment, 4) Keypoint descriptor. For the design of attention network, we only care about the first two stages. The keypoint, aka extrema, which is salient change in scale-space, is naturally defined as the attention. To detect extrema, scale-space is first constructed in CNN feature space. As illustrated in Fig. 4, the SIAN architecture includes three octaves, each of which corresponds to a scale-space with various smoothness levels indicating by smoothing parameter and scaling parameter , where and .

Each octave includes six layers. We define the first layer in as . The -th layer in is then defined as:

(7)

where is the Gaussian kernel

(8)

Here is set as 1.6 in practical implementation.

For the first octave , the first layer is the Gaussian smoothed version of the feature defined in Eq. (1):

(9)

where is set as 1.52 in our implementation. For the rest two octaves, the first layer is the pooling version of the last third layer of the previous octave. Formally, the first layer of the octave is

(10)

and the first layer of the octave is

(11)

Note that the pooling operation we use here is the max pooling, which is beneficial to the following local extrema detection process.

Fig. 5: Progressive Refinement by LSTM

The derived scale-space pyramid is coupled with the multi-scale pyramid in feature extraction. Given octaves, difference-of-Gaussian (DoG) is created by considering the difference between adjacent layers

(12)

According to the DoG pyramid, we then detect the local extrema, i.e., the salient features. We no longer do it by comparing a point against its 26 neighbors in spatial and scale domain, as done in SIFT, but turn to the learning approach. Specifically, we concatenate all the DoG layers in octave

, which are then passed through a convolutional layer followed by a ReLU activation, and finally we use the sigmoid function to get the final attention mask

.

(a) Rainy (21.2dB/0.727)
(b) Groundtruth
(c) Stage=1 (19.5dB/0.704)
(d) Stage=2 (21.1dB/0.765)
(e) Stage=3 (22.0dB/0.787)
(f) Stage=4 (22.5dB/0.800)
(g) Stage=5 (22.8dB/0.804)
(h) Stage=6 (23.0dB/0.807)
Fig. 6: Illustration of progressive refinement. The subjective and objective (PSNR/SSIM) results of six stages are provided. It can be found the quality of derained image is improved gradually stage-by-stage.
1:Rainy images ; corresponding ground truth ; network with initial parameter ; initial learning rate ;
2:Network parameters
3:for  = 1: num_epochs do
4:     Pick up training batch set ,.
5:     for  = 1: do
6:         
7:         for =1:6 do
8:              ;
9:              ;
10:              Generate attention mask by SIAN;
11:              ;
12:              Update by passing it through ConvLSTM;
13:              ;
14:              ;
15:              ;
16:              ;
17:         end for;
18:         ;
19:     end for;
20:end for;
21:.
Algorithm 1 Network Training Flow.

Iii-D Progressive Refinement by LSTM

To further improve the network performance, similar to PReNet [17], we connect adjacent stages with convolutional LSTM (ConvLSTM) units that propagate information from the previous stage [24]. LSTM [5] is good at handling time-sequence data. There are three gates in LSTM, including the input gate, the forget gate, the output gate and the cell state. The key in ConvLSTM is the cell state, which encodes the state information that will be propagated to the next LSTM. In our work, as illustrated in Fig. 5, after obtaining the according to Eq. (3), it is updated by passing two convolutional layers, and serves as the input of the next ConvLSTM block. In Fig. 6, we provide an illustration of the effect of progressive refinement by LSTM. It can found the quality of derained image becomes better and better along with the stage number increases.

Datasets Sample Number (Train/Test) Description
Rain12 [12] 12 Only for testing
Rain100L [25] 200/100
Synthesized with
one type of rain streaks
(light rain case)
Rain100H [25] 1,800/100
Synthesized with
five types of rain streaks
(heavy rain case)
Rain1400 [3] 12,600/1,400
1,000 clean image
used to synthesize
14,000 rain images.
TABLE I: Benchmarking datasets used for network training as well as performance evaluation
Loss Rain100L Rain100H
PSNR SSIM PSNR SSIM
MAE 38.54 0.981 30.12 0.904
MSE 38.52 0.981 30.03 0.902
Negative SSIM 38.80 0.984 30.33 0.909
TABLE II:

Performance comparison of different loss functions

Iii-E Network Training

The network parameters involve the kernel weights of the performed convolutional layers. The training process is conducted based on several public benchmarking dataset, as shown in Table I, which are synthesized data and thus there are many pairs of rain images and the corresponding ground truth .

For each image, we compute its accumulated negative SSIM loss [22] over all outputs of stages. As shown in Table II, negative SSIM loss works better than the popular MAE and MSE loss. Traversing all training samples, the final training loss is:

(13)

where is the number of samples used for training.

The optimal parameters can be obtained by:

(14)

This minimization problem can be addressed by ADAM [9]:

(15)

where is the learning rate, and represents the updated value based on ADAM. The whole network training flow is summarized in Algorithm 1.

Our network is trained on NVIDIA GTX 1080Ti. The image patch size of all the dataset are set as

and the batch size are set as 18. We extract image patches with stride 40, 80, and 100 for Rain100L, Rain100H and Rain1400, respectively. For Rain100L, we perform the data augmentation by horizontal flip. The training epoch of Rain100L, Rain100H and Rain1400 are set as 100, 100 and 50, respectively. The learning rate

is set as 2e-4.

Iv Experiments

In this section, extensive quantitative and qualitative results are provided to demonstrate the superior performance of the proposed method. Ablation study is also offered to promote deeper understanding of our network.

Iv-a Evaluation on Synthetic Datasets

Iv-A1 Comparison with the state-of-the-arts

Our method is comprehensively compared with state-of-the-art model-based and deep learning based works on synthetic benchmarking datasets shown in Table I. The comparison study group includes:

(a) Input (21.17dB/0.727)
(b) JCAS (24.33dB/0.809)
(c) LPNet (24.60dB/0.876)
(d) JORDER_E(42.49dB/0.988)
(e) PreNet (40.23dB/0.987)
(f) RESCAN(40.98dB/0.987)
(g) Ours (42.93dB/0.991)
(h) Groundtruth
Fig. 7: Visual deraining results on sample from Rain100L.
(a) Input (13.55dB/0.482)
(b) JCAS (15.62dB/0.570)
(c) LPNet (28.03dB/0.916)
(d) JORDER_E (24.95dB/0.878)
(e) PreNet (24.39dB/0.863)
(f) RESCAN(23.31dB/0.862)
(g) Ours (28.03dB/0.916)
(h) Groundtruth
Fig. 8: Visual deraining results on sample from Rain100H.
  • Model-based: 1) Discriminative Sparse Coding, DSC [16]

    ; 2) Gaussian Mixture Model, GMM

    [12]; 3) Joint Convolutional Analysis and Synthesis Sparse Representation, JCAS [4];

  • Deep Learning based: 1) Clear [2]; 2) Deep Detail Network, DDN [3]; 3) Recurrent Squeeze-and-Excitation Context Aggregation Net, RESCAN [11]; 4) Progressive Recurrent Network, PReNet [17]; 5) Spatial Attentive Network, SPANet [21]; 6) Enhanced JOint Rain DEtection and Removal, JORDER-E [26]; 7) Semi-supervised Image Rain Removal, SSIR [23]; 8) Lightweight Pyramid Networks, LPNet [1].

We follow the same experiment settings as introduced in [20] [27]

. Peak signal-to-noise ratio (PSNR) and SSIM are used for quantitative performance evaluation. We only consider the luminance channel, since it has the most significant impact on the human visual system to evaluate the image quality. We adopt the numerical results reported in

[20].

The quantitative evaluation results on Rain100L, Rain100H, and Rain1400, Rain12 are presented in Table III and Table IV, respectively. It can be found that, compared with deep learning based methods, three model-based methods—DSC, GMM and JCAS—achieve relatively lower PSNR and SSIM values, due to the lack of modeling ability. Deep learning based methods achieve great success in single image deraining. For instance, compared with the best performed model-based method GMM, the method JORDER_E improves the PSNR by over 9dB on Rain100L. Among all compared data-driven methods, our proposed scheme achieves the best PSNR and SSIM performance on various datasets. The PSNR gains over the second best performed work are 0.19dB, 0.22dB, 0.12dB and 0.55dB on Rain100L, Rain100H, Rain1400 and Rain12, respectively. These results demonstrate the superiority of our work.

We also provide qualitative evaluation through visual quality comparison. The example images cover various scenarios, including light rain steaks, large rain streaks and dense rain accumulation. As illustrated in Fig. 7-Fig. 9, the model-based method JCAS cannot remove the rain steaks well. In its deraining results, most rain steaks are still existing. LPNet, which builds coarse-to-fine pyramid in pixel domain to exploit the multi-scale correlation, cannot preserve the image structures well. It can be found in Fig. 7, LPNet cannot remove heavy rain steaks. PreNet also employs the progressive refinement manner as ours. However, as shown in Fig. 8, it leads to oversmoothing effect in the building regions. JORDER_E performs the second best in quantitative evaluation. In subjective evaluation, it can be seen there is still rain trace in sky region of Fig. 7; it also suffers from oversmoothing as PReNet in Fig. 8.

Datasets Rain100L Rain
Metrics PSNR SSIM PSNR SSIM
Input 26.90 0.838 13.56 0.371
DSC [16] (ICCV’15) 27.34 0.849 13.77 0.320
GMM [12] (CVPR’16) 29.05 0.872 15.23 0.450
JCAS [4] (ICCV’17) 28.54 0.852 14.62 0.451
Clear [2] (TIP’17) 30.24 0.934 15.33 0.742
DDN [3](CVPR’17) 32.38 0.926 22.85 0.725
RESCAN [11] (ECCV’18) 38.52 0.981 29.62 0.872
PReNet [17] (CVPR’19) 37.45 0.979 0.905
SPANet [21] (CVPR’19) 34.46 0.962 25.11 0.833
JORDER_E [26] (TPAMI’19) 30.04
SIRR [23] (CVPR’19) 32.37 0.926 22.47 0.716
LPNet [1] (TNNLS’ 20) 33.40 0.960 23.40 0.820
Ours 38.80 0.984 30.33 0.909
TABLE III: The quantitative evaluation results with respect to PSNR (dB)/SSIM on Rain100L and Rain100H. The best and the second ones are highlighted by bold and underline.
Datasets Rain1400 Rain12
Metrics PSNR SSIM PSNR SSIM
Input 25.24 0.810 30.14 0.856
DSC [16] (ICCV’15) 27.88 0.839 30.07 0.866
GMM [12] (CVPR’16) 27.78 0.859 32.14 0.916
JCAS [4] (ICCV’17) 26.20 0.847 33.10 0.931
Clear [2] (TIP’17) 26.21 0.895 31.24 0.935
DDN [3] (CVPR’17) 28.45 0.889 34.04 0.933
RESCAN [11] (ECCV’18) 32.03 0.931 36.43 0.952
PReNet [17] (CVPR’19) 32.55 36.66 0.961
SPANet [21] (CVPR’19) 29.76 0.908 34.63 0.943
JORDER_E [26] (TPAMI’19) 0.943
SIRR [23] (CVPR’19) 28.44 0.889 34.02 0.935
LPNet [1] (TNNLS’ 20) - - 34.7 0.95
Ours 32.80 0.946 37.24 0.967
TABLE IV: The quantitative evaluation results with respect to PSNR (dB)/SSIM on Rain1400 and Rain12. The best and the second ones are highlighted by bold and underline.
(a) Input (21.59dB/0.771)
(b) PreNet (29.76dB/0.908)
(c) Ours (29.93dB/0.908)
(d) Groundtruth
(e) Input (28.48dB/0.798)
(f) PreNet (35.22dB/0.936)
(g) Ours (36.08dB/0.948)
(h) Groundtruth
Fig. 9: Visual deraining results on samples from Rain1400 and Rain12.
(a) Input
(b) PreNet
(c) JORDER_E
(d) Ours
(e) Input
(f) PreNet
(g) JORDER_E
(h) Ours
Fig. 10: Visual deraining results on real rainy images

Iv-B Evaluation on Real Rainy Scenes

We further investigate the performance of our method on real deraining cases. In Fig. 10, we show the subjective comparison results on two real rainy images with the PReNet and JORDER_E. It can be found, for the first image, ours and JORDER_E achieve better results than PReNet, which generates large oversmoothing regions on the roof. For the second image, the result by our scheme is more clear than the other two methods.

Iv-C Ablation Study

The main modules of our scheme include multi-scale feature extraction, scale-space invariant attention network (SIAN), and LSTM based progressive refinement. In this subsection, we provide ablation analysis to show the roles of these modules to the final performance. We define the group of ablation study as:

  • Baseline: only the first stage of Fig. 1 is used, and , which means no attention mechanism is employed;

  • Baseline+LSTM: all stages of Fig. 1 are used, and ;

  • Baseline+SIAN+LSTM: the complete form of Fig. 1.

As shown in Table. V, compared with Baseline, Baseline+LSTM works better. This demonstrate the strategy of progressive refinement is helpful. Compare with Baseline+LSTM, Baseline+SIAN+LSTM further improves the PSNR and SSIM performance, which demonstrates the proposed attention network really can improve the modeling ability of the network.

Method Rain100L Rain100H
PSNR SSIM PSNR SSIM
Baseline 37.92 0.980 28.55 0.893
Baseline+LSTM 38.68 0.982 30.20 0.906
Baseline+SIAN+LSTM 38.80 0.984 30.33 0.909
TABLE V: The ablation study about the role of modules to the final performance

V Conclusion

In this work, we presented a novel single image deraining scheme based on scale-aware deep neural networks. To aggregate features from multiple scales into our rain steaks prediction, we developed a new scale-space invariant attention mechanism that learns a set of importance masks, one for each scale. Experimental results show that our proposed method achieves state-of-the-art performance with respect to both quantitative and qualitative evaluations.

References

  • [1] X. Fu, B. Liang, Y. Huang, X. Ding, and J. Paisley. Lightweight pyramid networks for image deraining. IEEE Transactions on Neural Networks and Learning Systems, pages 1–14, 2019.
  • [2] Xueyang Fu, Jiabin Huang, Xinghao Ding, Yinghao Liao, and John Paisley. Clearing the skies: A deep network architecture for single-image rain removal. IEEE Transactions on Image Processing, 26(6):2944–2956, 2017.
  • [3] Xueyang Fu, Jiabin Huang, Delu Zeng, Yue Huang, Xinghao Ding, and John Paisley. Removing rain from single images via a deep detail network. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 3855–3863, 2017.
  • [4] Shuhang Gu, Deyu Meng, Wangmeng Zuo, and Lei Zhang. Joint convolutional analysis and synthesis sparse representation for single image layer separation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1708–1716, 2017.
  • [5] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [6] Xiaowei Hu, Chi-Wing Fu, Lei Zhu, and Pheng-Ann Heng. Depth-attentional features for single-image rain removal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8022–8031, 2019.
  • [7] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis & Machine Intelligence, (11):1254–1259, 1998.
  • [8] Kui Jiang, Zhongyuan Wang, Peng Yi, Chen Chen, Baojin Huang, Yimin Luo, Jiayi Ma, and Junjun Jiang. Multi-scale progressive fusion network for single image deraining. pages 1–8, 2020.
  • [9] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv, 2014.
  • [10] Minghan Li, Xiangyong Cao, Qian Zhao, Lei Zhang, Chenqiang Gao, and Deyu Meng. Video rain/snow removal by transformed online multiscale convolutional sparse coding. arXiv, 2019.
  • [11] Xia Li, Jianlong Wu, Zhouchen Lin, Hong Liu, and Hongbin Zha. Recurrent squeeze-and-excitation context aggregation net for single image deraining. In Proceedings of the European Conference on Computer Vision (ECCV), pages 254–269, 2018.
  • [12] Yu Li, Robby T Tan, Xiaojie Guo, Jiangbo Lu, and Michael S Brown. Rain streak removal using layer priors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2736–2744, 2016.
  • [13] Tony Lindeberg. Scale-Space Theory in Computer Vision. Kluwer Academic Publishers, USA, 1994.
  • [14] Jiaying Liu, Wenhan Yang, Shuai Yang, and Zongming Guo. Erase or fill? deep joint recurrent rain removal and reconstruction in videos. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  • [15] G Lowe. Sift-the scale invariant feature transform. Int. J, 2:91–110, 2004.
  • [16] Yu Luo, Yong Xu, and Hui Ji. Removing rain from a single image via discriminative sparse coding. In Proceedings of the IEEE International Conference on Computer Vision, pages 3397–3405, 2015.
  • [17] Dongwei Ren, Wangmeng Zuo, Qinghua Hu, Pengfei Zhu, and Deyu Meng. Progressive image deraining networks: a better and simpler baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3937–3946, 2019.
  • [18] Julius T Tou and Rafael C Gonzalez. Pattern recognition principles. 1974.
  • [19] Paul Viola, Michael Jones, et al. Robust real-time object detection. International journal of computer vision, 4(34-47):4, 2001.
  • [20] Hong Wang, Yichen Wu, Minghan Li, Qian Zhao, and Deyu Meng. A survey on rain removal from video and single image. arXiv preprint arXiv:1909.08326, pages 1–8, 2019.
  • [21] Tianyu Wang, Xin Yang, Ke Xu, Shaozhe Chen, Qiang Zhang, and Rynson WH Lau. Spatial attentive single-image deraining with a high quality real rain dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12270–12279, 2019.
  • [22] Zhou Wang, Alan Conrad Bovik, Hamid Rahim Sheikh, and Eero P. Simoncelli. Image quality assessment: From error visibility to structural similarity. IEEE Trans Image Process, 13(4), 2004.
  • [23] Wei Wei, Deyu Meng, Qian Zhao, Zongben Xu, and Ying Wu.

    Semi-supervised transfer learning for image rain removal.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3877–3886, 2019.
  • [24] SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo.

    Convolutional lstm network: A machine learning approach for precipitation nowcasting.

    In Advances in neural information processing systems, pages 802–810, 2015.
  • [25] Wenhan Yang, Robby T Tan, Jiashi Feng, Jiaying Liu, Zongming Guo, and Shuicheng Yan. Deep joint rain detection and removal from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1357–1366, 2017.
  • [26] Wenhan Yang, Robby T Tan, Jiashi Feng, Jiaying Liu, Shuicheng Yan, and Zongming Guo. Joint rain detection and removal from a single image with contextualized deep networks. IEEE transactions on pattern analysis and machine intelligence, 2019.
  • [27] Wenhan Yang, Robby T Tan, Shiqi Wang, Yuming Fang, and Jiaying Liu. Single image deraining: From model-based to data-driven and beyond. arXiv preprint arXiv:1912.07150, pages 1–8, 2019.
  • [28] He Zhang and Vishal M Patel. Density-aware single image de-raining using a multi-stream dense network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 695–704, 2018.