Density-aware Single Image De-raining using a Multi-stream Dense Network
Single image rain streak removal is an extremely challenging problem due to the presence of non-uniform rain densities in images. We present a novel density-aware multi-stream densely connected convolutional neural network-based algorithm, called DID-MDN, for joint rain density estimation and de-raining. The proposed method enables the network itself to automatically determine the rain-density information and then efficiently remove the corresponding rain-streaks guided by the estimated rain-density label. To better characterize rain-streaks with different scales and shapes, a multi-stream densely connected de-raining network is proposed which efficiently leverages features from different scales. Furthermore, a new dataset containing images with rain-density labels is created and used to train the proposed density-aware network. Extensive experiments on synthetic and real datasets demonstrate that the proposed method achieves significant improvements over the recent state-of-the-art methods. In addition, an ablation study is performed to demonstrate the improvements obtained by different modules in the proposed method. Code can be found at: https://github.com/hezhangsprinterREAD FULL TEXT VIEW PDF
Density-aware Single Image De-raining using a Multi-stream Dense Network
In many applications such as drone-based video surveillance and self driving cars, one has to process images and videos containing undesirable artifacts such as rain, snow, and fog. Furthermore, the performance of many computer vision systems often degrades when they are presented with images containing some of these artifacts. Hence, it is important to develop algorithms that can automatically remove these artifacts. In this paper, we address the problem of rain streak removal from a single image. Various methods have been proposed in the literature to address this problem[17, 6, 35, 19, 2, 14, 9, 1, 36, 33, 5].
One of the main limitations of the existing single image de-raining methods is that they are designed to deal with certain types of rainy images and they do not effectively consider various shapes, scales and density of rain drops into their algorithms. State-of-the-art de-raining algorithms such as [33, 6] often tend to over de-rain or under de-rain the image if the rain condition present in the test image is not properly considered during training. For example, when a rainy image shown in Fig. 1(a) is de-rained using the method of Fu et al. , it tends to remove some important parts in the de-rained image such as the right arm of the person, as shown in Fig. 1(b). Similarly, when  is used to de-rain the image shown in Fig. 1(d), it tends to under de-rain the image and leaves some rain streaks in the output de-rained image. Hence, more adaptive and efficient methods, that can deal with different rain density levels present in the image, are needed.
One possible solution to this problem is to build a very large training dataset with sufficient rain conditions containing various rain-density levels with different orientations and scales. This has been achieved by Fu et al.  and Yang et al., where they synthesize a novel large-scale dataset consisting of rainy images with various conditions and they train a single network based on this dataset for image de-raining. However, one drawback of this approach is that a single network may not be capable enough to learn all types of variations present in the training samples. It can be observed from Fig. 1 that both methods tend to either over de-rain or under de-rain results. Alternative solution to this problem is to learn a density-specific model for de-raining. However, this solution lacks flexibility in practical de-raining as the density label information is needed for a given rainy image to determine which network to choose for de-raining.
In order to address these issues, we propose a novel Density-aware Image De-raining method using a Multi-stream Dense Network (DID-MDN) that can automatically determine the rain-density information (i.e. heavy, medium or light) present in the input image (see Fig. 2
). The proposed method consists of two main stages: rain-density classification and rain streak removal. To accurately estimate the rain-density level, a new residual-aware classifier that makes use of the residual component in the rainy image for density classification is proposed in this paper. The rain streak removal algorithm is based on a multi-stream densely-connected network that takes into account the distinct scale and shape information of rain streaks. Once the rain-density level is estimated, we fuse the estimated density information into our final multi-stream densely-connected network to get the final de-rained output. Furthermore, to efficiently train the proposed network, a large-scale dataset consisting of 12,000 images with different rain-density levels/labels (i.e. heavy, medium and light) is synthesized. Fig.1(c) & (d) present sample results from our network, where one can clearly see that DID-MDN does not over de-rain or under de-rain the image and is able to provide better results as compared to  and .
This paper makes the following contributions:
A novel DID-MDN method which automatically determines the rain-density information and then efficiently removes the corresponding rain-streaks guided by the estimated rain-density label is proposed.
Based on the observation that residual can be used as a better feature representation in characterizing the rain-density information, a novel residual-aware classifier to efficiently determine the density-level of a given rainy image is proposed in this paper.
A new synthetic dataset consisting of 12,000 training images with rain-density labels and 1,200 test images is synthesized. To the best of our knowledge, this is the first dataset that contains the rain-density label information. Although the network is trained on our synthetic dataset, it generalizes well to real-world rainy images.
Extensive experiments are conducted on three highly challenging datasets (two synthetic and one real-world) and comparisons are performed against several recent state-of-the-art approaches. Furthermore, an ablation study is conducted to demonstrate the effects of different modules in the proposed network.
In this section, we briefly review several recent related works on single image de-raining and multi-scale feature aggregation.
Mathematically, a rainy image can be modeled as a linear combination of a rain-streak component with a clean background image , as follows
In single image de-raining, given the goal is to recover . As can be observed from (1) that image de-raining is a highly ill-posed problem. Unlike video-based methods [23, 29, 25], which leverage temporal information in removing rain components, prior-based methods have been proposed in the literature to deal with this problem. These include sparse coding-based methods [14, 9, 41], low-rank representation-based methods [2, 35]
and GMM-based (gaussian mixture model) methods. One of the limitations of some of these prior-based methods is that they often tend to over-smooth the image details [14, 35].
Recently, due to the immense success of deep learning in both high-level and low-level vision tasks[8, 31, 38, 21, 32, 34], several CNN-based methods have also been proposed for image de-raining [3, 5, 33, 6]. In these methods, the idea is to learn a mapping between input rainy images and their corresponding ground truths using a CNN structure.
It has been observed that combining convolutional features at different levels (scales) can lead to a better representation of an object in the image and its surrounding context [7, 39, 8, 11]. For instance, to efficiently leverage features obtained from different scales, the FCN (fully convolutional network) method  uses skip-connections and adds high-level prediction layers to intermediate layers to generate pixel-wise prediction results at multiple resolutions. Similarly, the U-Net architecture  consists of a contracting path to capture the context and a symmetric expanding path that enables the precise localization. The HED model  employs deeply supervised structures, and automatically learns rich hierarchical representations that are fused to resolve the challenging ambiguity in edge and object boundary detection. Multi-scale features have also been leveraged in various applications such as semantic segmentation , face-alignment , visual tracking  crowd-counting , action recognition , depth estimation , single image dehazing [22, 37] and also in single image de-raining . Similar to , we also leverage a multi-stream network to capture the rain-streak components with different scales and shapes. However, rather than using two convolutional layers with different dilation factors to combine features from different scales, we leverage the densely-connected block  as the building module and then we connect features from each block together for the final rain-streak removal. The ablation study demonstrates the effectiveness of our proposed network compared with the structure proposed in .
The proposed DID-MDN architecture mainly consists of two modules: (a) residual-aware rain-density classifier, and (b) multi-stream densely connected de-raining network. The residual-aware rain-density classifier aims to determine the rain-density level given a rainy image. On the other hand, the multi-stream densely connected de-raining network is designed to efficiently remove the rain streaks from the rainy images guided by the estimated rain-density information. The entire network architecture of the proposed DID-MDN method is shown in Fig. 2.
As discussed above, even though some of the previous methods achieve significant improvements on the de-raining performance, they often tend to over de-rain or under de-rain the image. This is mainly due to the fact that a single network may not be sufficient enough to learn different rain-densities occurring in practice. We believe that incorporating density level information into the network can benefit the overall learning procedure and hence can guarantee better generalization to different rain conditions . Similar observations have also been made in , where they use two different priors to characterize light rain and heavy rain, respectively. Unlike using two priors to characterize different rain-density conditions , the rain-density label estimated from a CNN classifier is used for guiding the de-raining process. To accurately estimate the density information given a rainy input image, a residual-aware rain-density classifier is proposed, where the residual information is leveraged to better represent the rain features. In addition, to train the classier, a large-scale synthetic dataset consisting of 12,000 rainy images with density labels is synthesized. Note that there are only three types of classes (i.e. labels) present in the dataset and they correspond to low, medium and high density.
One common strategy in training a new classifier is to fine-tune a pre-defined model such as VGG-16 , Res-net  or Dense-net  on the newly introduced dataset. One of the fundamental reasons to leverage a fine-tune strategy for the new dataset is that discriminative features encoded in these pre-defined models can be beneficial in accelerating the training and it can also guarantee better generalization. However, we observed that directly fine-tuning such a ‘deep’ model on our task is not an efficient solution. This is mainly due to the fact that high-level features (deeper part) of a CNN tend to pay more attention to localize the discriminative objects in the input image . Hence, relatively small rain-streaks may not be localized well in these high-level features. In other words, the rain-streak information may be lost in the high-level features and hence may degrade the overall classification performance. As a result, it is important to come up with a better feature representation to effectively characterize rain-streaks (i.e. rain-density).
From (1), one can regard as the residual component which can be used to characterize the rain-density. To estimate the residual component () from the observation
, a multi-stream dense-net (without the label fusion part) using the new dataset with heavy-density is trained. Then, the estimated residual is regarded as the input to train the final classifier. In this way, the residual estimation part can be regarded as the feature extraction procedure111Classificaiton network can be regarded as two parts: 1.Feature extractor and 2. Classifer, which is discussed in Section 3.2. The classification part is mainly composed of three convolutional layers (Conv) with kernel size , one average pooling (AP) layer with kernel size 99 and two fully-connected layers (FC). Details of the classifier are as follows:
where (3,24) means that the input consists of 3 channels and the output consists of 24 channels. Note that the final layer consists of a set of 3 neurons indicating the rain-density class of the input image (i.e. low, medium, high). An ablation study, discussed in Section 4.3, is conducted to demonstrate the effectiveness of proposed residual-aware classifier as compared with the VGG-16 model.
Loss for the Residual-aware Classifier:
. To efficiently train the classifier, a two-stage training protocol is leveraged. A residual feature extraction network is firstly trained to estimate the residual part of the given rainy image, then a classification sub-network is trained using the estimated residual as the input and is optimized via the ground truth labels (rain-density). Finally, the two stages (feature extraction and classification) are jointly optimized. The overall loss function used to train the residual-aware classier is as follows:
where indicates the per-pixel Euclidean-loss to estimate the residual component and indicates the cross-entropy loss for rain-density classification.
It is well-known that different rainy images contain rain-streaks with different scales and shapes. Considering the images shown in Fig. 3, the rainy image in Fig. 3 (a) contains smaller rain-streaks, which can be captured by small-scale features (with smaller receptive fields), while the image in Fig. 3 (b) contains longer rain-streaks, which can be captured by large-scale features (with larger receptive fields). Hence, we believe that combining features from different scales can be a more efficient way to capture various rain streak components [10, 33].
Based on this observation and motivated by the success of using multi-scale features for single image de-raining , a more efficient multi-stream densely-connected network to estimate the rain-streak components is proposed, where each stream is built on the dense-block introduced in  with different kernel sizes (different receptive fields). These multi-stream blocks are denoted by Dense1 , Dense2 , and Dense3 , in yellow, green and blue blocks, respectively in Fig. 2. In addition, to further improve the information flow among different blocks and to leverage features from each dense-block in estimating the rain streak components, a modified connectivity is introduced, where all the features from each block are concatenated together for rain-streak estimation. Rather than leveraging only two convolutional layers in each stream , we create short paths among features from different scales to strengthen feature aggregation and to obtain better convergence. To demonstrate the effectiveness of our proposed multi-stream network compared with the multi-scale structure proposed in , an ablation study is conducted, which is described in Section 4.
To leverage the rain-density information to guide the de-raining process, the up-sampled label map 222For example, if the label is 1, then the corresponding up-sampled label-map is of the same dimension as the output features from each stream and all the pixel values of the label map are 1. is concatenated with the rain streak features from all three streams. Then, the concatenated features are used to estimate the residual (
) rain-streak information. In addition, the residual is subtracted from the input rainy image to estimate the coarse de-rained image. Finally, to further refine the estimated coarse de-rained image and make sure better details well preserved, another two convolutional layers with ReLU are adopted as the final refinement.
There are six dense-blocks in each stream. Mathematically, each stream can be represented as
where indicates concatenation, denotes the output from the th dense block, and denotes the th stream. Furthermore, we adopt different transition layer combinations333The transition layer can function as up-sample transition, down-sample transition or no-sampling transition . and kernel sizes in each stream. Details of each stream are as follows:
Dense1: three transition-down layers, three transition-up layers and kernel size .
Dense2: two transition-down layers, two no-sampling transition layers, two transition-up layers and kernel size .
Dense3: one transition-down layer, four no-sampling transition layers, one transition-up layer and kernel size .
Note that each dense-block is followed by a transition layer. Fig 4 presents an overview of the first stream, Dense1.
Loss for the De-raining Network:. Motivated by the observation that CNN feature-based loss can better improve the semantic edge information [13, 15] and to further enhance the visual quality of the estimated de-rained image , we also leverage a weighted combination of pixel-wise Euclidean loss and the feature-based loss. The loss for training the multi-stream densely connected network is as follows
where represents the per-pixel Euclidean loss function to reconstruct the de-rained image and is the feature-based loss for the de-rained image, defined as
where represents a non-linear CNN transformation and is the recovered de-rained image. Here, we have assumed that the features are of size with channels. In our method, we compute the feature loss from the layer relu12 of the VGG-16 model .
|Input||DSC  (ICCV’15)||GMM  (CVPR’16)||CNN  (TIP’17)||JORDER  (CVPR’17)||DDN  (CVPR’17)||JBO  (ICCV’17)||DID-MDN|
|Test1||0.7781/21.15||0.7896/21.44||0.8352/22.75||0.8422/22.07||0.8622/24.32||0.8978/ 27.33||0.8522/23.05||0.9087/ 27.95|
During testing, the rain-density label information using the proposed residual-aware classifier is estimated. Then, the up-sampled label-map with the corresponding input image are fed into the multi-stream network to get the final de-rained image.
In this section, we present the experimental details and evaluation results on both synthetic and real-world datasets. De-raining performance on the synthetic data is evaluated in terms of PSNR and SSIM . Performance of different methods on real-world images is evaluated visually since the ground truth images are not available. The proposed DID-MDN method is compared with the following recent state-of-the-art methods: (a) Discriminative sparse coding-based method (DSC)  (ICCV’15), (b) Gaussian mixture model (GMM) based method  (CVPR’16), (c) CNN method (CNN)  (TIP’17), (d) Joint Rain Detection and Removal (JORDER) method  (CVPR’17), (e) Deep detailed Network method (DDN)  (CVPR’17), and (f) Joint Bi-layer Optimization (JBO) method  (ICCV’17).
Even though there exist several large-scale synthetic datasets [6, 36, 33], they lack the availability of the corresponding rain-density label information for each synthetic rainy image. Hence, we develop a new dataset, denoted as Train1, consisting of 12,000 images, where each image is assigned a label based on its corresponding rain-density level. There are three rain-density labels present in the dataset (e.g. light, medium and heavy). There are roughly 4,000 images per rain-density level in the dataset. Similarly, we also synthesize a new test set, denoted as Test1, which consists of a total of 1,200 images. It is ensured that each dataset contains rain streaks with different orientations and scales. Images are synthesized using Photoshop. We modify the noise level introduced in step 3 of 444http://www.photoshopessentials.com/photo-effects/photoshopweather-effects-rain/ to generate different rain-density images, where light, medium and heavy rain conditions correspond to the noise levels , , and , respectively 555The reason why we use three labels is that during our experiments, we found that having more than three rain-density levels does not significantly improve the performance. Hence, we only use three labels (heavy, medium and light) in the experiments.. Sample synthesized images under these three conditions are shown in Fig 5. To better test the generalization capability of the proposed method, we also randomly sample 1,000 images from the synthetic dataset provided by Fu  as another testing set, denoted as Test2.
During training, a 512 512 image is randomly cropped from the input image (or its horizontal flip) of size 586
586. Adam is used as optimization algorithm with a mini-batch size of 1. The learning rate starts from 0.001 and is divided by 10 after 20 epoch. The models are trained for up to 80
12000 iterations. We use a weight decay of 0.0001 and a momentum of 0.9. The entire network is trained using the Pytorch framework. During training, we set. All the parameters are defined via cross-validation using the validation set.
The first ablation study is conducted to demonstrate the effectiveness of the proposed residual-aware classifier compared to the VGG-16  model. The two classifiers are trained using our synthesized training samples Train1 and tested on the Test1 set. The classification accuracy corresponding to both classifiers on Test1 is tabulated in Table 3. It can be observed that the proposed residual-aware classifier is more accurate than the VGG-16 model for predicting the rain-density levels.
|Accuracy||73.32 %||85.15 %|
In the second ablation study, we demonstrate the effectiveness of different modules in our method by conducting the following experiments:
Single: A single-stream densely connected network (Dense2) without the procedure of label fusion.
Yang-Multi 666To better demonstrate the effectiveness of our proposed muli-stream network compared with the state-of-the-art multi-scale structure proposed in , we replace our multi-stream dense-net part with the multi-scale structured in  and keep all the other parts the same.: Multi-stream network trained without the procedure of label fusion.
Multi-no-label: Multi-stream densely connected network trained without the procedure of label fusion.
DID-MDN (our): Multi-stream Densely-connected network trained with the procedure of estimated label fusion.
The average PSNR and SSIM results evaluated on Test1 are tabulated in Table 2. As shown in Fig. 6, even though the single stream network and Yang’s multi-stream network  are able to successfully remove the rain streak components, they both tend to over de-rain the image with the blurry output. The multi-stream network without label fusion is unable to accurately estimate the rain-density level and hence it tends to leave some rain streaks in the de-rained image (especially observed from the derained-part around the light). In contrast, the proposed multi-stream network with label fusion approach is capable of removing rain streaks while preserving the background details. Similar observations can be made using the quantitative results as shown in Table 2.
We compare quantitative and qualitative performance of different methods on the test images from the two synthetic datasets - Test1 and Test2. Quantitative results corresponding to different methods are tabulated in Table 1. It can be clearly observed that the proposed DID-MDN is able to achieve superior quantitative performance.
To visually demonstrate the improvements obtained by the proposed method on the synthetic dataset, results on two sample images selected from Test2 and one sample chosen from our newly synthesized Test1 are presented in Figure 7. Note that we selectively sample images from all three conditions to show that our method performs well under different variations 777Due to space limitations and for better comparisons, we only show the results corresponding to the most recent state-of-the-art methods [33, 6, 41] in the main paper. More results corresponding to the other methods [19, 17, 5] can be found in Supplementary Material.. While the JORDER method  is able to remove some parts of the rain-streaks, it still tends to leave some rain-streaks in the de-rained images. Similar results are also observed from . Even though the method of Fu et al.  is able to remove the rain-streak, especially in the medium and light rain conditions, it tends to remove some important details as well, such as flower details, as shown in the second row and window structures as shown in the third row (Details can be better observed via zooming-in the figure). Overall, the proposed method is able to preserve better details while effectively removing the rain-streak components.
The performance of the proposed method is also evaluated on many real-world images downloaded from the Internet and also real-world images published by the authors of [36, 6]. The de-raining results are shown in Fig 8.
As before, previous methods either tend to under de-rain or over de-rain the images. In contrast, the proposed method achieves better results in terms of effectively removing rain streaks while preserving the image details. In addition, it can be observed that the proposed method is able to deal with different types of rain conditions, such as heavy rain shown in the second row of Fig 8 and medium rain shown in the fifth row of Fig 8. Furthermore, the proposed method can effectively deal with rain-streaks containing different shapes and scales such as small round rain streaks shown in the third row in Fig 8 and long-thin rain-streak in the second row in Fig 8. Overall, the results evaluated on real-world images captured from different rain conditions demonstrate the effectiveness and the robustness of the proposed DID-MDN method. More results can be found in Supplementary Material.
Running time comparisons are shown in the table below. It can be observed that the testing time of the proposed DID-MDN is comparable to the DDN  method. On average, it takes about 0.3s to de-rain an image of size .
|DSC||GMM||CNN (GPU)||JORDER (GPU)||DDN (GPU)||JBO (CPU)||DID-MDN (GPU)|
In this paper, we propose a novel density-aware image deraining method with multi-stream densely connected network (DID-MDN) for jointly rain-density estimation and deraining. In comparison to existing approaches which attempt to solve the de-raining problem using a single network to learn to remove rain streaks with different densities (heavy, medium and light), we investigated the use of estimated rain-density label for guiding the synthesis of the de-rained image. To efficiently predict the rain-density label, a residual-aware rain-density classier is proposed in this paper. Detailed experiments and comparisons are performed on two synthetic and one real-world datasets to demonstrate that the proposed DID-MDN method significantly outperforms many recent state-of-the-art methods. Additionally, the proposed DID-MDN method is compared against baseline configurations to illustrate the performance gains obtained by each module.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1715–1723, July 2017.
Perceptual losses for real-time style transfer and super-resolution.In European Conference on Computer Vision, pages 694–711. Springer, 2016.
Reconstruction for feature disentanglement in pose-invariant face recognition.In ICCV, 2017.
Learning deep features for discriminative localization.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2921–2929, 2016.