Wavelet Channel Attention Module with a Fusion Network for Single Image Deraining

07/17/2020 ∙ by Hao-Hsiang Yang, et al. ∙ 0

Single image deraining is a crucial problem because rain severely degenerates the visibility of images and affects the performance of computer vision tasks like outdoor surveillance systems and intelligent vehicles. In this paper, we propose the new convolutional neural network (CNN) called the wavelet channel attention module with a fusion network. Wavelet transform and the inverse wavelet transform are substituted for down-sampling and up-sampling so feature maps from the wavelet transform and convolutions contain different frequencies and scales. Furthermore, feature maps are integrated by channel attention. Our proposed network learns confidence maps of four sub-band images derived from the wavelet transform of the original images. Finally, the clear image can be well restored via the wavelet reconstruction and fusion of the low-frequency part and high-frequency parts. Several experimental results on synthetic and real images present that the proposed algorithm outperforms state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Single image deraining is a crucial problem because rain occludes the background scene, appears in different locations and decreases the performance of computer vision tasks like outdoor surveillance systems and intelligent vehicles [7]. The rain streaks can be seen as linear noises which may vary in size, direction and density. Moreover, in real cases, when rain accumulation is dense, the individual streaks cannot be observed clearly. Accumulated rain streaks reduce the visibility in a manner more similar to fog, and create a haze-like phenomenon in the background. This foggy phenomenon can be described as the haze model [12], and the whole model [6] is written as

(1)

where means the observed rain streak image, means the clear image, means a rain streak layer with the same direction, means the maximum number of layers, means the global atmospheric light, means the atmospheric transmission and denotes element-wise multiplication.

Several methods try to analyze visual priors to capture deterministic and statistical properties of rainy images [1, 10, 16]

. However, these methods tend to introduce undesirable artifacts, since their handcrafted priors from human observations do not always hold in diverse real-world images. Instead of applying handcrafted visual priors, recently, deep-learning-based methods

[6, 14, 21, 3] are proposed, and these methods usually perform more accurate than conventional priors-based methods with significant performance gains.

We also consider neural networks for single image deraining since the deraining model is a crude approximation. CNN-based model can learn and capture more detailed features from rainy images. Similar to other image restoration tasks like image deblurring [14], and image dehazing [19], image deraining can be modeled as an image-to-image mapping problem. From previous studies [21, 3, 19], low-level features (e.g., edge and frequency) are more important than high-level features like attribute, texture. To extract these low-level features, we apply the wavelet transform and propose the wavelet channel attention module (WCAM) with a fusion network. First, our network replaces the down-sampling and up-sampling with the discrete wavelet transform (DWT) and the inverse discrete wavelet transform (IDWT). The network further captures the various frequency features and bi-orthogonal property of the DWT for signal recovery. Second, the channel attention [5] module selectively emphasizes interdependent channel maps by integrating associated features among all channel maps. Thus, we combine the DWT and the channel attention so that intermediate feature maps with different frequencies are integrated effectively. Third, as demonstrated in [21, 15], fusing various levels of features is beneficial for many computer vision tasks. The DWT is seen as the fusion of high-frequency and low-frequency images from original images. Inspired by it, our input is four sub-band images from the DWT, and the output is four confidence maps that determine the importance of different sub-band images and fuse them to reconstruct the clear image by the IDWT. In summary, Our network is encoder-decoder architecture. The proposed WCAMs replace convolutions in the encoder, and corresponding inverse wavelet channel attention modules (IWCAMs) also replace convolutions in the decoder.

This paper makes the following contributions: (i) We propose a novel end-to-end WCAM with a fusion network that captures frequency features and fuses four sub-band images for single image deraining. (ii) Several experiments show that the proposed network obtains much better performance than previous state-of-the-art deraining methods, even in the presence of large rain streaks and rain streak accumulation.

2 Related Work

2.1 Single image deraining

Single image deraining is an ill-posed problem and seen as the denoising problem. Early methods propose hand-craft priors to estimate distributions of rain streaks and remove them. In

[1], Chen and Hsu decompose the background and rain streak layers based on low-rank priors to restore clear images. Luo et al. [10] use sparse coding with high discriminability so that the derained image layer and the rain layer can be accurately separated. In [8]

, patch-based priors are applied for both the clean background and rain layers in the form of Gaussian mixture models to remove rain streaks.

Recently, trainable CNNs have been proposed to estimate the clean image from a rainy input directly. In [14], a dual convolutional neural network for deraining is proposed. The proposed network consists of two parallel branches, which respectively recovers the structures and details in an end-to-end manner. Zhang et al. [21] propose a density-aware multi-stream densely connected convolutional neural network that jointly estimates rain density estimation and derained image. In [3], the authors introduce the Gaussian-Laplacian image pyramid decomposition technology to the neural network and propose a lightweight pyramid network for single image deraining. In [6]

, the authors propose the two-stage network. Rain streaks, transmissions, and the atmospheric light are estimated in the first stage. Derained images are refined in the second stage by the conditional generative adversarial network. Different from them, our network is implemented in the wavelet space to capture detailed frequency information.

2.2 Attention mechanisms

Attention plays an important role in human perception and computer vision tasks. Attention mechanisms give feature maps weights so that features of the sequence of regions or locations are magnified. Generally, there are two attention mechanisms: spatial attention and channel attention [17, 18]. Mnih et al. [13]

propose an attention model that spatially selects a sequence of regions to refine feature maps, and the network not only performs well but is also robust to noisy inputs. Hu et al.

[5] propose the squeeze-and-excitation module and use global average-pooled features to compute channel-wise attention. Furthermore, Woo et al.[17] combine the spatial and channel attention to propose a convolutional block attention module. Their module sequentially infers attention maps along two separate dimensions, channel and spatial, then attention maps are multiplied to the input feature map for adaptive feature refinement, which increases the accuracy of image recognition. In our work, we integrate channel attention and wavelet transform so that output feature maps contain frequency features but different magnified weights.

3 Proposed Methods

3.1 Wavelet Channel Attention module

We first describe the 2-D discrete wavelet transform (DWT) in our model. We apply Haar wavelets [11] and it contains four kernels, , where and are the low pass filter and high pass filter respectively. Both filters are

(2)

The DWT for image processing means an image is convolved and then down-sampled to obtain the four sub-band images and . The low-pass filter captures smooth surface and texture while the three high-pass filters extract vertical, horizontal, and diagonal edge-like information. Even though the downsampling operation is employed, due to the biorthogonal property of DWT, the original image can be accurately reconstructed by the inverse discrete wavelet transform (IDWT). Therefore, the DWT is seen as four

convolutional kernels whose weights are fixed and the stride equals two. Similarily, the IDWT is seen as the transpose convolution operation whose kernels are identical to DWT’s.

Then, we extend convolutions by combining the DWT and the channel attention mechanism and propose the wavelet channel attention module (WCAM). A WCAM is a computational unit that is built upon a transformation mapping an intermediate feature map to a feature maps . Given an intermediate feature map , the DWT decomposes into .

convolutions and leaky rectified linear units (LeakyReLUs) are applied to extract various frequency features from DWT

and denoted as , where is an operator combining the convolution and the LeakyReLU. Furthermore, we propose improved feature maps with different frequencies contributing different weights to restore the clear image. We use the channel attention [5] to control weights of various channel-wise features. The proposed module calculates the global average pooling of and uses convolutions to infer a channel attention map . Therefore, the entire result of WCAM is . The detailed structure of the WCAM is depicted in Fig. 2(b) and formulated as follows:

(3)

where AP means global average pooling.

The WCAM reduces the size of feature maps but increases the receptive field to capture multi-frequency and multi-scale features. To magnify the size of the feature map, similarly, the inverse wavelet channel attention module (IWCAM) is proposed. An IWCAM is also a computational unit which can be built upon a transformation mapping an intermediate feature map to a feature map . Given an intermediate feature map , the IDWT merges into and convolutions and LeaklyReLUs are then adopted and denoted as . Global average pooling of is calculated and convolutions are used to infer a channel attention map . The entire result of IWCAM is . The structure of the IWCAM is depicted in Fig. 2(c) and formulated as follows:

(4)
Figure 1: The proposed WCAM with a fusion network and components in the network. (a) The entire network, where green arrows mean skipping connections. (b) WCAM. (c) IWCAM. (d) The residul WCAM.

3.2 Network Architecture

Our network is encoder-decoder structure. The encoder consists of three WCAMs. Once a WCAM is adopted, the sizes of feature maps become quarter and the number of channels becomes four times, which not only captures multi-scale features but various frequency information. At the bottom of the network, the residual block [4] combining wavelet channel attention is used and shown in Fig. 1(d). This module aggregates features and makes the learning process effective, especially in the event of deeper networks. Our decoder consists of three IWCAMs to generate clear images from extracted features. The terminal outputs are four confidence maps instead of the restored images. Once confidence maps for the derived wavelet inputs are predicted, they are multiplied by the four derived inputs to give the final derained image:

(5)

where , , ,and are confidence maps for and , respectively. The reason for using the fusion mechanism is that the low-frequency sub-band plays a role in the objective quality, while the high-frequency sub-bands can affect the perceptual quality significantly [2]. When low-frequency parts and high-frequency parts are optimized separately, the increase of objective quality cannot decrease perceptual quality. Additionally, like U-net [16], we apply skipping connections to combine the identical size feature maps from WCAMs and IWCAMs so that the learning process converges quickly. The entire network architecture is shown in Fig. 1(a).

4 Experimental Results

Figure 2: Derained results on sample images from the synthetic dataset Outdoor-Rain. (Please zoom-in at screen to view details)
Figure 3: Derained results on real rainy images. (Please zoom-in at screen to view details)

4.1 Datasets and training details

In this work, Outdoor-Rain dataset [6] is adopted to train and test the network. This dataset contains clear images, and the corresponding rainy images generated by Eq.(1). There are 7500 training and validation samples and 1,500 samples for evaluation. Both clean and rainy images are . During training, images are cropped to , the wavelet SSIM loss [20] and the L1 loss are employed, and RAdam [9]

is used as an optimization algorithm with a mini-batch size of 16. The learning rate starts from 0.0001 and is divided by ten after 100 epochs. The models are trained for 300 iterations. The entire experiments are performed by the Pytorch framework.

4.2 Image Deraining results

PSNR and SSIM are chosen as objective metrics for quantitative evaluation. We select four state-of-the-art works [14, 21, 3, 6] as deep learning-based benchmarks to make fairly comparisons with our purposed method. For the fair comparison, all methods are retrained on the same dataset. The comparison results are shown in Table 1. Table 1 presents our method has the largest PSNR and SSIM values among all deraining networks, which demonstrates our method has a superior performance of restoring clean images for this dataset and the frequency features are beneficial for restoring rainy images. Furthermore, we perform various methods on synthetic and real rainy photos and results are depicted in Fig. 2 and Fig. 3. As revealed in Fig. 2 and Fig. 3, since comparative methods tend to miscalculate rainy concentration, restored images are dark or have remaining rain and mist. In contrast, our purposed method predicts better-derained results with balanced colors and detailed edges.

We analyze how the WCAM and the fusion help to refine derained results with three experiments. The first experiment uses convolutions and wavelets without channel attention and fusion. The second experiment uses the proposed modules without fusion. The third experiment estimates feature maps to fuse sub-band images without channel attention. Table 2 compares our method against three baselines and demonstrates wavelet channel attention and fusion contribute the best results.

[14] [21] [3] [6] Ours
PSNR 17.92 21.64 18.16 21.17 24.89
SSIM 0.676 0.788 0.723 0.742 0.813
Table 1: Quantitative SSIM and PSNR on the synthetic Outdoor-Rain dataset.
PSNR SSIM
Ours, w/o attention, w/o fusion 21.44 0.784
Ours, w/o fusion 23.34 0.764
Ours, w/o attention 23.24 0.810
Ours 24.89 0.813
Table 2: Ablation study shows how the WCAM and the fusion help to refine derained results.

5 Conclusion

In this paper, the wavelet channel attention module with a fusion network is proposed for single image deraining. The wavelet transform and the inverse wavelet transform are substituted for down-sampling and up-sampling to extract various frequency features. The channel attention effectively controls ratios of feature maps. Furthermore, the proposed network estimates confidence maps for each derived wavelet input. Confidence maps and derived inputs are fused to render final derained results. Experiments on synthetic and real images verify the superiority of our model compared to the state-of-the-art results.

References

  • [1] Y. Chen and C. Hsu (2013) A generalized low-rank appearance model for spatio-temporally correlated rain streaks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1968–1975. Cited by: §1, §2.1.
  • [2] X. Deng, R. Yang, M. Xu, and P. L. Dragotti (2019-10)

    Wavelet domain style transfer for an effective perception-distortion tradeoff in single image super-resolution

    .
    In The IEEE International Conference on Computer Vision (ICCV), Cited by: §3.2.
  • [3] X. Fu, B. Liang, Y. Huang, X. Ding, and J. Paisley (2019) Lightweight pyramid networks for image deraining. IEEE transactions on neural networks and learning systems. Cited by: §1, §1, §2.1, §4.2, Table 1.
  • [4] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §3.2.
  • [5] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §1, §2.2, §3.1.
  • [6] R. Li, L. Cheong, and R. T. Tan (2019) Heavy rain image restoration: integrating physics model and conditional adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1633–1642. Cited by: §1, §1, §2.1, §4.1, §4.2, Table 1.
  • [7] S. Li, I. B. Araujo, W. Ren, Z. Wang, E. K. Tokuda, R. H. Junior, R. Cesar-Junior, J. Zhang, X. Guo, and X. Cao (2019-06) Single image deraining: a comprehensive benchmark analysis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [8] Y. Li, R. T. Tan, X. Guo, et al. (2016) Rain streak removal using layer priors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2736–2744. Cited by: §2.1.
  • [9] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han (2020-04)

    On the variance of the adaptive learning rate and beyond

    .
    In Proceedings of the Eighth International Conference on Learning Representations (ICLR 2020), Cited by: §4.1.
  • [10] Y. Luo, Y. Xu, and H. Ji (2015) Removing rain from a single image via discriminative sparse coding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3397–3405. Cited by: §1, §2.1.
  • [11] S. Mallat (1999) A wavelet tour of signal processing. Elsevier. Cited by: §3.1.
  • [12] E. J. McCartney (1976) Optics of the atmosphere: scattering by molecules and particles. New York, John Wiley and Sons, Inc., 1976. 421 p.. Cited by: §1.
  • [13] V. Mnih, N. Heess, et al. (2014) Recurrent models of visual attention. In Advances in neural information processing systems, pp. 2204–2212. Cited by: §2.2.
  • [14] J. Pan, S. Liu, D. Sun, J. Zhang, Y. Liu, J. Ren, Z. Li, J. Tang, H. Lu, Y. Tai, et al. (2018) Learning dual convolutional neural networks for low-level vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3070–3079. Cited by: §1, §1, §2.1, §4.2, Table 1.
  • [15] W. Ren, L. Ma, J. Zhang, J. Pan, X. Cao, W. Liu, and M. Yang (2018-06) Gated fusion network for single image dehazing. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [16] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1, §3.2.
  • [17] S. Woo, J. Park, J. Lee, and I. So Kweon (2018) Cbam: convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §2.2.
  • [18] C. Yang, J. Qi, P. Chen, et al. (2020) Characterizing speech adversarial examples using self-attention u-net enhancement. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3107–3111. Cited by: §2.2.
  • [19] H. Yang and Y. Fu (2019) Wavelet u-net and the chromatic adaptation transform for single image dehazing. In IEEE International Conference on Image Processing (ICIP), pp. 2736–2740. Cited by: §1.
  • [20] H. Yang, C. H. Yang, and Y. J. Tsai (2020)

    Y-net: multi-scale feature aggregation network with wavelet structure similarity loss function for single image dehazing

    .
    In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2628–2632. Cited by: §4.1.
  • [21] H. Zhang and V. M. Patel (2018-06) Density-aware single image de-raining using a multi-stream dense network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2.1, §4.2, Table 1.