Single image deraining is a crucial problem because rain occludes the background scene, appears in different locations and decreases the performance of computer vision tasks like outdoor surveillance systems and intelligent vehicles . The rain streaks can be seen as linear noises which may vary in size, direction and density. Moreover, in real cases, when rain accumulation is dense, the individual streaks cannot be observed clearly. Accumulated rain streaks reduce the visibility in a manner more similar to fog, and create a haze-like phenomenon in the background. This foggy phenomenon can be described as the haze model , and the whole model  is written as
where means the observed rain streak image, means the clear image, means a rain streak layer with the same direction, means the maximum number of layers, means the global atmospheric light, means the atmospheric transmission and denotes element-wise multiplication.
. However, these methods tend to introduce undesirable artifacts, since their handcrafted priors from human observations do not always hold in diverse real-world images. Instead of applying handcrafted visual priors, recently, deep-learning-based methods[6, 14, 21, 3] are proposed, and these methods usually perform more accurate than conventional priors-based methods with significant performance gains.
We also consider neural networks for single image deraining since the deraining model is a crude approximation. CNN-based model can learn and capture more detailed features from rainy images. Similar to other image restoration tasks like image deblurring , and image dehazing , image deraining can be modeled as an image-to-image mapping problem. From previous studies [21, 3, 19], low-level features (e.g., edge and frequency) are more important than high-level features like attribute, texture. To extract these low-level features, we apply the wavelet transform and propose the wavelet channel attention module (WCAM) with a fusion network. First, our network replaces the down-sampling and up-sampling with the discrete wavelet transform (DWT) and the inverse discrete wavelet transform (IDWT). The network further captures the various frequency features and bi-orthogonal property of the DWT for signal recovery. Second, the channel attention  module selectively emphasizes interdependent channel maps by integrating associated features among all channel maps. Thus, we combine the DWT and the channel attention so that intermediate feature maps with different frequencies are integrated effectively. Third, as demonstrated in [21, 15], fusing various levels of features is beneficial for many computer vision tasks. The DWT is seen as the fusion of high-frequency and low-frequency images from original images. Inspired by it, our input is four sub-band images from the DWT, and the output is four confidence maps that determine the importance of different sub-band images and fuse them to reconstruct the clear image by the IDWT. In summary, Our network is encoder-decoder architecture. The proposed WCAMs replace convolutions in the encoder, and corresponding inverse wavelet channel attention modules (IWCAMs) also replace convolutions in the decoder.
This paper makes the following contributions: (i) We propose a novel end-to-end WCAM with a fusion network that captures frequency features and fuses four sub-band images for single image deraining. (ii) Several experiments show that the proposed network obtains much better performance than previous state-of-the-art deraining methods, even in the presence of large rain streaks and rain streak accumulation.
2 Related Work
2.1 Single image deraining
Single image deraining is an ill-posed problem and seen as the denoising problem. Early methods propose hand-craft priors to estimate distributions of rain streaks and remove them. In, Chen and Hsu decompose the background and rain streak layers based on low-rank priors to restore clear images. Luo et al.  use sparse coding with high discriminability so that the derained image layer and the rain layer can be accurately separated. In 
, patch-based priors are applied for both the clean background and rain layers in the form of Gaussian mixture models to remove rain streaks.
Recently, trainable CNNs have been proposed to estimate the clean image from a rainy input directly. In , a dual convolutional neural network for deraining is proposed. The proposed network consists of two parallel branches, which respectively recovers the structures and details in an end-to-end manner. Zhang et al.  propose a density-aware multi-stream densely connected convolutional neural network that jointly estimates rain density estimation and derained image. In , the authors introduce the Gaussian-Laplacian image pyramid decomposition technology to the neural network and propose a lightweight pyramid network for single image deraining. In 
, the authors propose the two-stage network. Rain streaks, transmissions, and the atmospheric light are estimated in the first stage. Derained images are refined in the second stage by the conditional generative adversarial network. Different from them, our network is implemented in the wavelet space to capture detailed frequency information.
2.2 Attention mechanisms
Attention plays an important role in human perception and computer vision tasks. Attention mechanisms give feature maps weights so that features of the sequence of regions or locations are magnified. Generally, there are two attention mechanisms: spatial attention and channel attention [17, 18]. Mnih et al. 
propose an attention model that spatially selects a sequence of regions to refine feature maps, and the network not only performs well but is also robust to noisy inputs. Hu et al. propose the squeeze-and-excitation module and use global average-pooled features to compute channel-wise attention. Furthermore, Woo et al. combine the spatial and channel attention to propose a convolutional block attention module. Their module sequentially infers attention maps along two separate dimensions, channel and spatial, then attention maps are multiplied to the input feature map for adaptive feature refinement, which increases the accuracy of image recognition. In our work, we integrate channel attention and wavelet transform so that output feature maps contain frequency features but different magnified weights.
3 Proposed Methods
3.1 Wavelet Channel Attention module
We first describe the 2-D discrete wavelet transform (DWT) in our model. We apply Haar wavelets  and it contains four kernels, , where and are the low pass filter and high pass filter respectively. Both filters are
The DWT for image processing means an image is convolved and then down-sampled to obtain the four sub-band images and . The low-pass filter captures smooth surface and texture while the three high-pass filters extract vertical, horizontal, and diagonal edge-like information. Even though the downsampling operation is employed, due to the biorthogonal property of DWT, the original image can be accurately reconstructed by the inverse discrete wavelet transform (IDWT). Therefore, the DWT is seen as four
convolutional kernels whose weights are fixed and the stride equals two. Similarily, the IDWT is seen as the transpose convolution operation whose kernels are identical to DWT’s.
Then, we extend convolutions by combining the DWT and the channel attention mechanism and propose the wavelet channel attention module (WCAM). A WCAM is a computational unit that is built upon a transformation mapping an intermediate feature map to a feature maps . Given an intermediate feature map , the DWT decomposes into .
convolutions and leaky rectified linear units (LeakyReLUs) are applied to extract various frequency features from DWTand denoted as , where is an operator combining the convolution and the LeakyReLU. Furthermore, we propose improved feature maps with different frequencies contributing different weights to restore the clear image. We use the channel attention  to control weights of various channel-wise features. The proposed module calculates the global average pooling of and uses convolutions to infer a channel attention map . Therefore, the entire result of WCAM is . The detailed structure of the WCAM is depicted in Fig. 2(b) and formulated as follows:
where AP means global average pooling.
The WCAM reduces the size of feature maps but increases the receptive field to capture multi-frequency and multi-scale features. To magnify the size of the feature map, similarly, the inverse wavelet channel attention module (IWCAM) is proposed. An IWCAM is also a computational unit which can be built upon a transformation mapping an intermediate feature map to a feature map . Given an intermediate feature map , the IDWT merges into and convolutions and LeaklyReLUs are then adopted and denoted as . Global average pooling of is calculated and convolutions are used to infer a channel attention map . The entire result of IWCAM is . The structure of the IWCAM is depicted in Fig. 2(c) and formulated as follows:
3.2 Network Architecture
Our network is encoder-decoder structure. The encoder consists of three WCAMs. Once a WCAM is adopted, the sizes of feature maps become quarter and the number of channels becomes four times, which not only captures multi-scale features but various frequency information. At the bottom of the network, the residual block  combining wavelet channel attention is used and shown in Fig. 1(d). This module aggregates features and makes the learning process effective, especially in the event of deeper networks. Our decoder consists of three IWCAMs to generate clear images from extracted features. The terminal outputs are four confidence maps instead of the restored images. Once confidence maps for the derived wavelet inputs are predicted, they are multiplied by the four derived inputs to give the final derained image:
where , , ,and are confidence maps for and , respectively. The reason for using the fusion mechanism is that the low-frequency sub-band plays a role in the objective quality, while the high-frequency sub-bands can affect the perceptual quality significantly . When low-frequency parts and high-frequency parts are optimized separately, the increase of objective quality cannot decrease perceptual quality. Additionally, like U-net , we apply skipping connections to combine the identical size feature maps from WCAMs and IWCAMs so that the learning process converges quickly. The entire network architecture is shown in Fig. 1(a).
4 Experimental Results
4.1 Datasets and training details
In this work, Outdoor-Rain dataset  is adopted to train and test the network. This dataset contains clear images, and the corresponding rainy images generated by Eq.(1). There are 7500 training and validation samples and 1,500 samples for evaluation. Both clean and rainy images are . During training, images are cropped to , the wavelet SSIM loss  and the L1 loss are employed, and RAdam 
is used as an optimization algorithm with a mini-batch size of 16. The learning rate starts from 0.0001 and is divided by ten after 100 epochs. The models are trained for 300 iterations. The entire experiments are performed by the Pytorch framework.
4.2 Image Deraining results
PSNR and SSIM are chosen as objective metrics for quantitative evaluation. We select four state-of-the-art works [14, 21, 3, 6] as deep learning-based benchmarks to make fairly comparisons with our purposed method. For the fair comparison, all methods are retrained on the same dataset. The comparison results are shown in Table 1. Table 1 presents our method has the largest PSNR and SSIM values among all deraining networks, which demonstrates our method has a superior performance of restoring clean images for this dataset and the frequency features are beneficial for restoring rainy images. Furthermore, we perform various methods on synthetic and real rainy photos and results are depicted in Fig. 2 and Fig. 3. As revealed in Fig. 2 and Fig. 3, since comparative methods tend to miscalculate rainy concentration, restored images are dark or have remaining rain and mist. In contrast, our purposed method predicts better-derained results with balanced colors and detailed edges.
We analyze how the WCAM and the fusion help to refine derained results with three experiments. The first experiment uses convolutions and wavelets without channel attention and fusion. The second experiment uses the proposed modules without fusion. The third experiment estimates feature maps to fuse sub-band images without channel attention. Table 2 compares our method against three baselines and demonstrates wavelet channel attention and fusion contribute the best results.
|Ours, w/o attention, w/o fusion||21.44||0.784|
|Ours, w/o fusion||23.34||0.764|
|Ours, w/o attention||23.24||0.810|
In this paper, the wavelet channel attention module with a fusion network is proposed for single image deraining. The wavelet transform and the inverse wavelet transform are substituted for down-sampling and up-sampling to extract various frequency features. The channel attention effectively controls ratios of feature maps. Furthermore, the proposed network estimates confidence maps for each derived wavelet input. Confidence maps and derived inputs are fused to render final derained results. Experiments on synthetic and real images verify the superiority of our model compared to the state-of-the-art results.
-  (2013) A generalized low-rank appearance model for spatio-temporally correlated rain streaks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1968–1975. Cited by: §1, §2.1.
Wavelet domain style transfer for an effective perception-distortion tradeoff in single image super-resolution. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §3.2.
-  (2019) Lightweight pyramid networks for image deraining. IEEE transactions on neural networks and learning systems. Cited by: §1, §1, §2.1, §4.2, Table 1.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.2.
-  (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §1, §2.2, §3.1.
-  (2019) Heavy rain image restoration: integrating physics model and conditional adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1633–1642. Cited by: §1, §1, §2.1, §4.1, §4.2, Table 1.
-  (2019-06) Single image deraining: a comprehensive benchmark analysis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2016) Rain streak removal using layer priors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2736–2744. Cited by: §2.1.
On the variance of the adaptive learning rate and beyond. In Proceedings of the Eighth International Conference on Learning Representations (ICLR 2020), Cited by: §4.1.
-  (2015) Removing rain from a single image via discriminative sparse coding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3397–3405. Cited by: §1, §2.1.
-  (1999) A wavelet tour of signal processing. Elsevier. Cited by: §3.1.
-  (1976) Optics of the atmosphere: scattering by molecules and particles. New York, John Wiley and Sons, Inc., 1976. 421 p.. Cited by: §1.
-  (2014) Recurrent models of visual attention. In Advances in neural information processing systems, pp. 2204–2212. Cited by: §2.2.
-  (2018) Learning dual convolutional neural networks for low-level vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3070–3079. Cited by: §1, §1, §2.1, §4.2, Table 1.
-  (2018-06) Gated fusion network for single image dehazing. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1, §3.2.
-  (2018) Cbam: convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §2.2.
-  (2020) Characterizing speech adversarial examples using self-attention u-net enhancement. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3107–3111. Cited by: §2.2.
-  (2019) Wavelet u-net and the chromatic adaptation transform for single image dehazing. In IEEE International Conference on Image Processing (ICIP), pp. 2736–2740. Cited by: §1.
Y-net: multi-scale feature aggregation network with wavelet structure similarity loss function for single image dehazing. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2628–2632. Cited by: §4.1.
-  (2018-06) Density-aware single image de-raining using a multi-stream dense network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2.1, §4.2, Table 1.