Single Image Deraining: A Comprehensive Benchmark Analysis
We present a comprehensive study and evaluation of existing single image deraining algorithms, using a new large-scale benchmark consisting of both synthetic and real-world rainy images.This dataset highlights diverse data sources and image contents, and is divided into three subsets (rain streak, rain drop, rain and mist), each serving different training or evaluation purposes. We further provide a rich variety of criteria for dehazing algorithm evaluation, ranging from full-reference metrics, to no-reference metrics, to subjective evaluation and the novel task-driven evaluation. Experiments on the dataset shed light on the comparisons and limitations of state-of-the-art deraining algorithms, and suggest promising future directions.READ FULL TEXT VIEW PDF
Single Image Deraining: A Comprehensive Benchmark Analysis
Images captured in rainy days suffer from noticeable degradation of scene visibility. The goal of single image deraining algorithms is to generate sharp images from a rainy image input. Image deraining can potentially benefit both the human visual perception quality of images, and many computer vision applications, such as outdoor surveillance systems and intelligent vehicles.
and deep convolutional neural network (CNN)-based models[6, 7, 8]. However, a fair comprehensive study of the problem, the existing algorithms, and the performance metrics have been absent so far, which is the goal of this paper.
As a complicated atmospheric process, rain could cause several different types of visibility degradations, due to a magnitude of environmental factors including raindrop size, rain density, and wind velocity . When a rainy image is taken, the visual effects of rain on that digital image further hinges on many camera parameters, such as exposure time, depth of field, and resolution . Most existing deraining works assume one rain model (usually rain streak), which might have oversimplified the problem. We group existing rain models in literature into three major categories: rain streak, raindrop, as well as rain and mist.
A rain streak image can be modeled as a linear superimposition of the clean background scene and the sparse, line-shape rain streak component :
Rain streaks accumulated throughout the scene reduce the visibility of the background . This is the most common model assumed by the majority of deraining algorithms.
Adherent raindrops  that fall and flow on camera lenses or a window glasses can obstruct and/or blur the background scenes. The raindrop degraded image can be modeled as the combination of the clean background , and the blurry or obstruction effect of the raindrops in scattered, small-sized local coherent regions:
is a binary mask and means element-wise multiplication. In the mask, a pixel is part of a raindrop region if , and otherwise belongs to the background.
Further, rainy images often contain both rain and mist in real cases . In addition, distant rain streaks accumulated throughout the scene reduce the visibility in a manner more similarly to fog, creating a mist-like phenomenon in the image background. Concerning this, we can define the rain and mist model for the captured image , based on a composition of the rain streak model and the atmospheric scattering haze model :
where is the rain streak component; and are the transmission map and atmospheric light that determines the fog/mist component, respectively.
Regardless of what rain models to follow, image deraining is a heavily ill-posed problem. Despite many impressive methods published in recent few years, the lack of a large dataset and algorithm benchmarking makes it difficult to evaluate the progress made, and how practically useful those algorithms are. There are several unclear and unsatisfactory aspects of current deraining algorithm development, including but not limited to: i) the modeling of rain is oversimplified, i.e., each method considers and is evaluated with one type of rain only, e.g., rain streak; ii) most quantitative results are reported on synthetic images, which often fail to capture the complexity and characteristics of real rain; iii) as a result of the last point, the evaluation metrics have been mostly limited to (the full-reference) PSNR and SSIM for image restoration purposes. They may become poorly related when it comes to other task purposes, such as human perception quality or computer vision utility .
In this paper, we aim to systematically evaluate state-of-the-art single image deraining methods, in a comprehensive and fair setting. To this end, we construct a large-scale benchmark, called Multi-Purpose Image Deraining (MPID). An overview of MPID could be found in Table 1, and image examples are displayed in Figure 1. Compared with existing synthetic sets, the MPID dataset covers a much larger diversity of rain models (rain streak, raindrop, and rain and mist), including both synthetic and real-world images for evaluation, and featuring diverse contents and sources (for real rainy images). In addition, as the first-of-its-kind efforts in image deraining, we have annotated two sets of real-world rainy images with object bounding boxes from autonomous driving and video surveillance scenarios, respectively, for task-specific evaluation.
Using the MPID benchmark, we evaluate six state-of-the-art single image deraining algorithms. We adopt a wide range of full-reference metrics (PSNR and SSIM), no-reference metrics (NIQE, BLIINDS-II, and SSEQ), as well as human subjective scores to thoroughly examine the performance of image deraining methods. A human subjective study is also conducted. Furthermore, as image deraining might be expected as a preprocessing step for mid- and high-level computer vision tasks, we also evaluate current algorithms in terms of their impact on subsequent object detection tasks, as a “task-specific” evaluation criterion. We reveal the performance gap in various aspects, when these algorithms are applied on synthetic and real images. By extensively comparing the state-of-the-art single image deraining algorithms on the MPID dataset, we gain insights into new research directions for image deraining.
Multi-frame based approaches:
Early methods often require multiple frames to deal with the deraining problem [4, 16, 17, 18, 19, 5, 20, 11]. Garg and Nayar  proposed a rain streak detection and removal method from a video by taking the average intensity of the detected rain streaks from the previous and subsequent frames.  further improved the performance by selecting camera parameters without appreciably altering the scene appearance. However, those methods are not applicable to single image deraining.
Prior based algorithms:
Many deraining methods capitalize on clean image or rain type priors to remove rain [22, 1, 23, 24, 25]. Kang et al.  decomposed an input image into its low and high frequency components. Then they separated the rain streak frequencies from the high frequency layer via sparse coding. Zhu et al.  introduced a rain removal method based on the prior that rain streaks typically span a narrow range of directions. Chen and Hsu  decomposed the background and rain streak layers based on low-rank priors. Li et al. 
use patch-based priors for both the clean background and rain layers in the form of Gaussian mixture models. All of the above approaches rely on good (and relatively simple) crafted priors. As a result, they tend to have unsatisfactory performances on real images with complicated scenes and rain forms.
Data-driven CNN models:
Recently, CNNs have achieved dominant success for image restoration [28, 29] including single image deraining [30, 31]. Fu et al.  proposed a deep detail network (DDN) for removing rain from single images with detailed preserved. Yang et al.  presented a CNN based method to jointly detect and remove rain streaks, using a multi-stream network to capture the rain streak component. A density-aware multi-stream densely connected convolutional neural network was introduced in 
for joint rain density estimation and image deraining. Qian et al.
addressed a different problem of removing raindrops from single images, using visual attention with a generative adversarial network (GAN). Despite the progress of deep-learning-based approaches compared with prior-based rain removal methods, their performance hinge on the synthetic training data, which may become problematic if real rainy images show a domain mismatch.
Several datasets were used to measure and compare the performance of image deraining algorithms. Li et al.  introduced a set of 12 images using photo-realistic rendering techniques. Zhang et al.  synthesized a set of training and testing images with rain streak, using the same way in . The training set consists of 700 images and the testing set consists of 100 images. In addition,  also collects a dataset of 50 real-world rainy images downloaded from the web for qualitative visual comparison.  released a set of clean and rain-drop corrupted image pairs, using a special lens equipment. However, existing datasets are either too small in scale and limited to one rain type (streak or drop), or lack sufficient real-world images for diverse evaluations. Besides, none of them has any semantic annotation nor consider any subsequent task performance.
|Subset||Number of Images||Real/synthetic||Annotations||Metrics|
|Rain streak (T)||2400 (pairs)||synthetic||No|
|Raindrop (T)||861 (pairs)||synthetic||No|
|Rain and mist (T)||700 (pairs)||synthetic||No|
|Subset||Number of Images||Real/synthetic||Annotations||Metrics|
|Rain streak (S)||200 (pairs)||synthetic||No||PSNR, SSIM, NIQE, BLIINDS-II, SSEQ|
|Rain streak (R)||50||real||No||NIQE, BLIINDS-II, SSEQ|
|Raindrop (S)||149 (pairs)||synthetic||No||PSNR, SSIM, NIQE, BLIINDS-II, SSEQ|
|Raindrop (R)||58||real||No||NIQE, BLIINDS-II, SSEQ|
|Rain and mist (S)||70 (pairs)||synthetic||No||PSNR, SSIM, NIQE, BLIINDS-II, SSEQ|
|Rain and mist (R)||30||real||No||NIQE, BLIINDS-II, SSEQ|
|Task-Driven Evaluation Set|
|Subset||Number of Images||Real/synthetic||Annotations||Metrics|
|RID||2496||real||Yes (bounding boxes)||mAP|
|RIS||2048||real||Yes (bounding boxes)||mAP|
We present a new benchmark as a comprehensive platform, for evaluating single image deraining algorithms from a variety of perspectives. Our evaluation angles range from traditional PSNR/SSIM, to no-reference perception-driven metrics and human subjective quality, to “task-driven metrics” [15, 34] indicating how well a target computer vision task can be performed on the derained images. Fitting those purposes, we generate/collect images in large scale, from both synthesis and real world sources, covering diverse real-life scenes, and annotate them when needed. The new benchmark, dubbed Multi-Purpose Image Deraining (MPID), is introduced below in details. An overview of MPID can be found in Table 1.
Following the three rain models in Section 1.1, we create three training sets, named Rain streak (T), Rain drop (T) and Rain and mist (T) sets (T short for “training”), respectively. All three sets are synthesized in controlled settings from clean images.111Note that for Rain drop (T), the data generation used physical simulation  , i.e., with/without lens, rather than algorithm simulation.. All clean images used are collected from the web, and we specifically pick those outdoor rain-free, haze-free photos taken in cloudy daylight, so that the synthesized rainy images look more realistic in terms of lighting condition (for example, there will be no rainy photo in a sunny daylight background).
The Rain streak (T) set contains 2,400 pairs of clean and rainy images, where the rainy images are generated from the clean ones using (1
), with the identical protocol and hyperparameters to[27, 33]. The Rain drop (T) set was borrowed from ’s released training set consisting of 861 pairs of clean and rain-drop corrupted images, upon their authors’ consent. The Rain and mist (T) set is synthesized by first adding haze using the atmospheric scattering model: for each clean image, we estimate depth using the algorithm in [35, 36] as recommended by , set different atmospheric lights by choosing each channel uniformly randomly between , and select uniformly at random between . Then from the synthesized hazy version, we further add rain streaks in the same way as Rain streak (T). We end up with 700 pairs for the Rain and mist (T) set.
Corresponding to three training sets, we generate three synthetic testing set in the same way: denoted as Rain streak (S), Rain drop (S), and Rain and mist (S) (S short for “synthetic testing”), consisting of 200, 149, and 70 pairs, respectively. On each testing set, we evaluate the restoration performance of deraining algorithms, using classical PSNR and SSIM metrics. Further, to predict the derained image’s perceptual quality to human viewers, we introduce the usage of three no-reference IQA models: Naturalness Image Quality Evaluator (NIQE) , spatial-spectral entropy-based quality (SSEQ) , and blind image integrity notator using DCT statistics (BLIINDS-II) , to complement the shortness of PSNR/SSIM. NIQE is a well-known no-reference image quality score to indicate the perceived “naturalness” of an image: a smaller score indicates better perceptual quality. The score of SSEQ and BLIINDS-II that we used range from 0 (worst) to 100 (best).222Note that in  and , a smaller SSEQ/BLIINDS-II score indicates better perceptual quality. We reverse the two scores (100 minus) to make their trends look consistent to full-reference metrics: in our tables the bigger the two values, the better the perceptual quality. We did not do the same to NIQE, because NIQE has no bounded maximum value.
Besides the three above synthetic test sets, we collect three sets of real-world images, that fall into each of three defined rain categories, to evaluate the deraining algorithms’ real-world generalization. The three sets, denoted as Rain streak (R), Raindrop (R), and Rain and mist (R) (R short for “real-world testing”), are collected from the Internet and are carefully inspected to ensure that images in each set fit the pre-defined rain type well. Due to the unavailability of ground truth clean images in real world, we evaluate NIQE, SSEQ, and BLIINDS-II on the three real-world sets. In addition, we also pick a small set of real-world images for human subjective rating of derained results.
As pointed out by several recent works [41, 15, 42, 43], the performance of high-level computer vision tasks, such as object detection and recognition, will deteriorate in the presence of various sensory and environmental degradations. While deraining could be used as pre-processing for many computer vision tasks executed in the rainy conditions, there has been no systematical study on deraining algorithms’ impact on those target tasks. We consider the resulting task performance after deraining as an indirect indicator of the deraining quality. Such a “task-driven” evaluation way has received little attention and can have great implications for outdoor applications.
To conduct such task-driven evaluations, realistic annotated datasets are necessary. To our best knowledge, there has been no dataset available serving the purpose of evaluating deraining algorithms in task-driven ways. We therefore collect two sets by our own: a Rain in Driving (RID) set collected from car-mounted cameras when driving in rainy weathers, and a Rain in surveillance (RIS) set collected from networked traffic surveillance cameras in rainy days.
For either set, we annotate object bounding boxes, and evaluate object detection performance after applying deraining. A summary with object statistics on both RID and RIS sets can be found in Table 2. The two sets differ in many ways: rain type, image quality, object size and angle, and so on. They are representative of real application scenarios where deraining may be desired.
This set contains 2,495 real rainy images from high-resolution driving videos. As we observe, its rain effect is closest to “raindrops” on camera lens. They were captured in diverse real traffic locations and scenes during multiple drives. We label bounding boxes for selected traffic objects: car, person, bus, bicycle, and motorcycle, that commonly appear on the roads of all images. Most images are of 1920 990 resolution, with a few exceptions of 4023 3024 resolution.
This set contains 2,048 real rainy images from relatively lower-resolution surveillance video cameras. They were extracted from a total of 154 surveillance cameras in daytime, ensuring diversity in content (for example, we do not consider frames too close in time). As we observe, its rain effect is closest to “rain and mist” (many cameras have mist condensation during rain, and the low resolution will also cause more foggy effects). We selected and annotated the most common objects in the traffic surveillance scenes: car, person, bus, truck, and motorcycle. The vast majority of cameras have the resolution of 640 368, with a few exceptions of 640 480.
|Degraded||GMM ||JORDER ||DDN ||CGAN ||DID-MDN ||DeRaindrop |
|rain and mist|
|Degraded||GMM ||JORDER ||DDN ||CGAN ||DID-MDN ||DeRaindrop |
|rain and mist|
|rainy||GMM ||JORDER ||DDN ||CGAN ||DID-MDN ||DeRaindrop |
|rain and mist||0.44||1.00||0.70||0.90||1.22||1.40||–|
|Rainy||JORDER ||DDN ||CGAN ||DID-MDN ||DeRaindrop |
We evaluate six representative state-of-the-art algorithms on MPID: Gaussian mixture model prior (GMM) , JOint Rain DEtection and Removal (JORDER) , Deep Detail Network (DDN) , Conditional Generative Adversarial Network (CGAN) , Density-aware Image De-raining method using a Multistream Dense Network (DID-MDN) , and DeRaindrop . All except GMM are state-of-the-art CNN-based deraining algorithms.
Evaluation Protocol. The first five models are specifically developed for removing rain streaks, while the last one targets at removing rain drops. Therefore, we compare them for rain streak sets. Since DeRaindrop is the only recent published method for raindrop removal, to provide more baselines for its performance, we also re-train and evaluate the other five models on the raindrop sets. Finally, since no published method was targeted for removing rain and mist together, we create a cascaded pipeline, by first running each of the five rain streak removal algorithms, followed by feeding into a pre-trained MSCNN dehazing network . MSCNN was chosen because recent dehazing studies [15, 48] endorsed it both to produce the best human-favorable, artifact-free dehazing results, and to benefit subsequent high-level task in haze most. Such cascaded pipeline can be tuned from end to end, and we freeze the MSCNN part during tuning in order to focus on comparing deraining components. All models will be re-trained on the corresponding MPID training set, when evaluated on a certain rain type.
We first compare the derained results on the synthetic images using two full-reference (PSNR and SSIM) and three no-reference metrics (NIQE, SSEQ, and BLIINDS-II). As seen from Table 3, the results have high consensus levels on synthetic data. First, DDN is the obvious winner on the rain streak (S) set, followed by JORDER; the same two methods also perform consistently the best on the rain and mist (S) set. Second, DerainDrop performs the best on the rain drop (S) set, especially significantly surpassing the others in terms of PSNR and SSIM, showing that its specific structure indeeds suits this problem. Other rain streak removal models seem to even hurt PSNR, SSIM and BLINDS-II, compared to the degraded images.
The effectiveness of the winners can be ascribed to the two-step strategy of rain detection and removal. We note that DDN focuses on high frequency details during training stage, while JORDER also first detects the locations of rain streak, then removes rain based on the estimated rain streak regions. Coincidentally, DeRaindrop also uses an attentive generative network to generate raindrops mask first then derain images capitalizing on the masks. Therefore, removing background interference and attentively focusing on rain regions seem to be the main reason of the winners.
We then show the derained results on the real-world images in Table 4, using three no-reference metrics (NIQE, SSEQ, and BLIINDS-II). The rain streak (R) and raindrop (R) sets show consistent results with their synthetic cases: JORDER and DDN rank top-two on the former, while DerainDrop still dominates on the raindrop set. However, some different tendency is observed on the rain and mist (R) set: CGAN becomes the dominant winner on those real images, outperforming both DDN and JORDER with large margins. As we observed, since CGAN is most free of physical priors or rain type assumptions, it has the largest flexibility for re-training to fit different data. Its results is also most photo-realistic due to the adversarial loss. Additionally, the result might also suggest a larger domain gap between synthetic and real rain and mist data.
We next conduct a human subjective survey to evaluate the performance of image deraining algorithms. We follow a standard setting that fits a Bradley-Terry model  to estimate the subjective score for each method so that they can be ranked, with the exactly same routine as described in previous similar works 
. We select 10 images from Rain streak (R), 6 images from Rain drop (R), and 11 images from Rain and mist (R), taking all possible care to ensure that they have very diverse contents and quality. Each rain streak or rain & mist image is processed with each of the five deraining algorithms (except DerainDrop), and the five deraining results, together with the original rainy image, are sent for pairwise comparison to construct the winning matrix. For a rain drop image, the procedure is the same except that it will be processed by all six methods. We collect the pair comparison results of human subject studies from 11 human raters. Despite the relatively small numbers of raters, we observed good consensus and small inter-person variances among raters, on same pairs’ comparison results, which make scores trustworthy.
The subjective scores are reported in Table 5. Note that we did not normalize the scores: so it is the score rank rather than the absolute score values that makes sense here. On the rain streak images, it seems that most human viewers prefer CGAN first, and then DDN. As shown in the first row of Figure 2
, the derained result generated by CGAN is more smooth than others. On the raindrop images, it is somehow to our surprise that DerainDrop is not favored by users; instead, the non-CNN-based GMM method, which showed no advantage under previous objective metrics, was highly preferred by users. We conjecture that the patch-based Gaussian mixture prior can treat and remove both rain streaks and raindrops as “outliers”, and is less sensitive to training/testing data domain difference. Finally on the rain and mist images, DID-MDN receives the highest scores, while CGAN is next to it. This is mainly thanks to incorporating th rain-density subnetwork or GAN, that can provide more information of the scene context and hence improve generalization to complex rain conditions.
While we are in the process of recruiting more human raters to solidify our subject score results more, our results seem to be consistent so far, and might in turn imply that off-the-shelf no-reference perceptual metrics (SSEQ, NIQE, BLINDS-II) do not align well with the real human perception quality of deraining results. In fact, recent works  already discovered similar misalignments, when applying standard no-reference metrics to estimating defogging perceptual quality, and proposed fog-specific metrics. Similar efforts have not been found for deraining yet, and we expect this worthy effort to take place in near future.
|(a) Rainy input||(b) GMM ||(c)JORDER ||(d) DDN ||(e) CGAN ||(f) DID-MDN ||(g) DeRaindrop |
|(a) Rainy input||(b) JORDER ||(c) DDN ||(d) CGAN ||(e) DID-MDN ||(f) DeRaindrop ||(g) Ground-truths|
We first apply all deraining algorithms except GMM333We did not include GMM for the two sets, because (1) it did not yield promising results when we tried to apply it to (part of) the two sets; (2) it runs very slow, given we have two large sets., to pre-processing the two task-driven testing sets. Due to their different rain characteristics, for the RID set, we use deraining algorithms trained on the rain and mist case; for the RIS set, we use deraining algorithms trained on the raindrop case. We visually inspected the derained results and found the rains to be visually attenuated after applying the selected deraining algorithms. We show some derained results on the RID and RIS sets in the supplementary material.
We then study object detection performance on the derained sets, using several state-of-the-art object detection models: Faster R-CNN (FRCNN) , YOLO-V3 , SSD-512 , and RetinaNet . Finally, we compare all deraining algorithms via the mean Average Precision (mAP) results achieved. It is important to note that our primary goal is not to optimize detection performance in rainy days, but to use a strong detection model as a fixed, fair metric on comparing deraining performance from a complementary perspective. In this way, the object detectors should not be adapted for rainy or derained images, and we use all authors’ pre-trained models on MS COCO. The underlying hypothesis is: i) an object detector trained on clean natural images will perform the best, when the input is also from the clean image domain or close; ii) for detection in rain, the better the rain is removed, the better an object detection model (trained on clean images) will then perform. Such task-specific evaluation philosophy follows [34, 15].
Table 6 reports the mAP results comparison for different deraining algorithms, achieved using four different detection models, on both RID and RIS sets. We find that quite aligned conclusions could be drawn from the two sets.
Perhaps surprisingly at the first glance, we find that almost all existing deraining algorithms will deteriorate the detection performance compared to directly using the rainy images444The only exception is FRCNN on the RID set. However, its overall mAP result is the worst compared to the other three. That implies a strong domain mismatch, suggesting that FRCNN results might not be as reliable an indicator for RID deraining performance as the other three., for YOLO-V3, SSD-512, and RetinaNet. Our observation concurs the conclusion of another recent study (on dehazing) : since those deraining algorithms were not trained/optimized towards the end goal of object detection, they are unnecessary to help this goal, and the deraining process itself might have lost discriminative, semantically meaningful true information.
Both results on RID and RIS sets in Table 6 show that YOLO-V3 achieves best detection performance, independently of deraining algorithms applied. Figure 3 shows detections using YOLO-V3 on the respectives rainy images and their derained results for all deraining algorithms considered in this comparison. Since both RID and RIS have many small objects due to their relative long distance from the camera, we believe that here YOLO-V3 benefits from its new multi-scale prediction structure, that is known to improve small object detection dramatically . We further notice a fairly weak correlation between the mAP results with the no-reference evaluation results of the derained images: see supplementary for more details.
This paper proposes a new large-scale benchmark and presents a thorough survey of state-of-the-art single image deraining methods. Based on our evaluation and analysis, we present overall remarks and hypotheses below, which we hope can shed some light on future deraining research:
Rain types are diverse and call for specialized models. Certain models or components are revealed to be promising for specific rain types, e.g., rain detection /attention, GANs, and priors like patch-level GMM. We also advocate a combination of appropriate priors and data-driven methods.
There is no single best deraining algorithm for all rain types. To deal with the real complicated, varying rains, one might need consider a mixture model of experts. Another practically useful direction is to develop scene-specific deraining, e.g., for traffic views.
There is also no single best deraining algorithm under all metrics. When designing a deraining algorithm, one needs be clear about its end purpose. Moreover, classical perceptual metrics themselves might be problematic to evaluate deraining. Developing new metrics could be as important as new algorithms.
Algorithms trained on synthetic paired data may generalize poorly to real data, especially on complicated rain types such as rain and mist. Unpaired training  on all real data could be interesting to explore.
No existing deraining method seems to directly help detection. That may encourage the community to develop new robust algorithms to account for high-level vision problems on real-world rainy images. On the other hand, to realize the goal of robust detection in rain does not have to adopt a de-raining pre-processing; there are other domain adaptation type options, e.g., , which we will discuss in future work.
This work is supported in part by the National Natural Science Foundation of China (No. 61802403) and CCF-DiDi GAIA (YF20180101). The work of Z. Wang is supported in part by the US National Science Foundation under Grant 1755701.
IEEE Conference on Computer Vision and Pattern Recognition, 2017.
A novel tensor-based video rain streaks removal approach via utilizing discriminatively intrinsic priors.In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
Unpaired image-to-image translation using cycle-consistent adversarial networks.arXiv preprint, 2017.