Bad weather conditions are among the biggest challenges in robotics. Fog, snow, haze, rain and dirt can affect both the visual perception of humans and severely diminish the performance of computer vision tasks.
The effects of impurities on visual perception are intricate, but can be classified in two main categories. Firstly, by impurities in the atmosphere that occlude the scene (snow, rain streaks, fog, smog) but which have no distorting effects. Secondly, by impurities that adhere to transparent surfaces and have distorting effects. For example, adherent rain drops act as a fish-eye-lens due to their convex shape, with the image formed inside the droplet being a result of light rays coming from an area that is larger than what the droplet is occluding.
Modelling adherent droplets is a difficult task due to the various sizes and shapes that they can take, distortion, zoom, blur, glare, the effect of gravity, and cohesion and adhesion of water (27). Moreover, although the scene is in focus, the images formed inside droplets in front of the camera is usually blurry and opaque. Finally, if the light refracted inside the droplet comes from a large part of the scene, the image formed in the droplet becomes gray (21).
To the best of our knowledge, there are three main methods through which rainy images with ground-truth are generated in existing literature. The first is manual, and involves using a glass pane to capture a clean and a rainy image sequentially (16). While this offers perfect alignment, it is tedious, and does not scale up to dynamic scenes due to motion and changes in illumination. The second method involves using stereo cameras, where one lens is kept clean and one rainy, with both images captured simultaneously (15). Although the scenes are more diverse and the method scales up, it assumes that the two images can be aligned using a homography which, unfortunately, only works on planar scenes. The third method involves computer-generated rain, which scales up, but there is a large domain gap between real rain and computer-generated rain (15).
Building datasets with both real rain and clear ground truth under hard-to-control outdoor environments is very difficult for static images, let alone scenes containing moving objects. One has to account for changes in illumination, motion, occlusion/dis-occlusion, changes in appearance, and so on. For this reason, many of the datasets focus on static scenes, while those with video sequences do not offer non-rainy ground truth, apart from (15).
We propose a simple method for capturing diverse raindrops, that can make use of existing datasets with clear images and auxiliary task ground truth, but also of arbitrary footage. Our method is inspired by (15), which use a water pump to spray water on a glass pane, and (6), which use a windshield plane that can be tilted and translated in front of the camera. The difference is that we capture monocular images by recording a high-resolution screen. The advantages of this approach are threefold:
perfectly aligned image pairs;
high scalability, with no requirement for a mobile platform such as a vehicle since all data collection happens indoors, unsupervised, and in a controlled environment; and
diverse, in and out of focus droplets and streaks.
In this way, we are able to easily generate image pairs with high diversity, and further demonstrate that data gathered indoors using this method can efficiently be used to both train and pre-train deraining models that generalize well.
2 Related work
Rain modelling: A photometric raindrop model is used by (6) and (17) to construct the appearance of raindrops by tracing the light rays that pass through them, with the droplet boundary modelled by a sphere with equal contact angles. In a follow-up paper (18), 2D Bezier curves were used to incorporate raindrop deformation due to the effect of gravity. Later in (28), the dark band around them is produced with the total reflection phenomenon (light from the environment being reflected back into the waterdrop). Based on (6), (17) and (28), the authors of (15) propose an approach similar to meta-balls (1), where they model synthetic waterdrops by warping normal maps of droplets.
Adding droplets: In the experimental setup of (4), rain is simulated by spraying water on a glass pane in front of the camera, but due to changes in illumination and subject motion, no ground truth is provided. Additionally, a video sequence with real rain is provided, without ground truth. The authors of (27) run experiments to study the effects of disturbance of light sources in environment, various shapes and sizes of raindrops, blurred raindrops and glare, in a droplet detection and removal pipeline in videos. They use a glass pane on which they spray water and capture videos in the real world, however ground truth is only available for the droplets’ position.
Complete reconstruction ground truth (clean and rainy image pairs) was first attempted in (16). To ensure alignment and keep the same refractive index, they use both a sprayed glass pane and a clean glass pane. Additionally, they ensure that the atmospheric conditions and background remain constant during the two shots. They provide 1119 image pairs, however the process is difficult to scale up, and since background needs to be kept constant between the two shots, the images contain only static scenes.
On the other hand, the authors of (15) use a bi-partite chamber that is placed in front of a small-baseline stereo camera, keeping one part dry while the other part is sprayed with water. The system was mounted on a vehicle and image sequences were taken while driving. The authors provide 50000 pairs of undistorted, cropped and aligned images with both rain streaks and raindrops of various shapes and sizes. Although the system is portable and image acquisition is simpler, the method requires a mobile platform, and the alignment is based on a homography that assumes the world is flat, resulting in a small degree of misalignment between image pairs.
Raindrop detection and removal:
The classical approaches for detecting and removing raindrops rely on a combination of image segmentation, pattern recognition, ellipse fitting, template matching(9), (19), (5), (17), (18), multiple image analysis (27), or multiple and/or rotating cameras (11), (26), (24) and (25)
. More recent approaches employ Convolutional Neural Networks and/or adversarial frameworks(4), (16), (13), while in (21) both classical methods and CNNs are combined.
Soil and dirt removal: Since cameras mounted outside the vehicle are exposed to adverse conditions, fully autonomous cars need a system to detect mud and trigger a cleaning mechanism. Recent work done in this area is that of (22), which create a dataset with opaque and transparent soiling used to train a soiling detection network. However, since the acquisition of realistic muddy images is a difficult task, they propose a GAN-based architecture to generate artificial soil on clean images. Similarly, the authors of (4) generate synthetic dirt for training purposes, and validate their results on both synthetic dirt, and real dirt captured using a glass pane. The authors of (14) study how object detection is affected when degrading the image quality, by introducing synthetic fog, frost, snow and other corruptions and effects.
3.1 Data acquisition
We build a rig consisting of a camera on a tripod, a tiltable glass pane simulating a car windshield, multiple water spray nozzles with a recirculating pump and collector tray, and a waterproofed high-resolution (2560x1440) computer monitor. The setup is presented in figure Fig 2.
We start by calibrating our camera for the desired zoom level, and roughly align it such that its optical axis is normal to the screen. For our particular experiment, the surface of the screen was 320mm from the outer lens, the glass pane was offset 170mm from the screen surface, and tilted to 20 degrees, simulating a windshield. After setting the focus, we capture a test pattern that is used to estimate a homography between the monitor and the camera image. Since the image displayed on the screen already represents a 2D projection of the world, we have the advantage of being able to accurately align the camera image with the ground truth image using the estimated homography.
The first pass of data acquisition involves capturing clear images through the glass pane, without any water, in order to account for the refraction of the glass. The second pass involves turning on the water nozzles (with a randomized pressure regimen) and capturing the same images again, resulting in corresponding rainy data. The data acquisition process must take place in a dark room to minimize stray reflections and changes in illumination. All data acquisition is done automatically and unattended, at approximately Hz, once the rig has been set up, and any misalignment is corrected using the computed homography. Note that the datasets are collected with the room lights OFF, without any natural or artificial lighting.
3.2 The deraining model
The principal purpose of our study is not to present a new architecture but to show that a high resolution monitor can be adequately used for data acquisition, hence we base our convolutional de-raining model on the widely-used architecture of Pix2PixHD (23), with convolutional layers, 6 ResNet (7) blocks and 4 up-convolutional layers. We add additive skip connections between each down- and up-convolutional layer, observing that a large part of the input structure, illumination and details can be kept and copied at the output. Similar to (23) and (15), we do not make use of any unstructured pixel-wise losses, instead choosing a PatchGAN (12) adversarial loss combined with discriminator feature losses and a VGG-based perceptual loss.
On the generator output, we apply an adversarial loss:
With the discriminator being trained to minimize:
A VGG-based perceptual loss (10) is additionally used:
with denoting the number of layers used in the loss and weighing the contribution of each layer. We also use a multi-scale discriminator feature loss (23):
with denoting the number of discriminator layers used in the loss and weighing the contribution of each layer.
Finally, the full generator objective is:
The hyperparametersmodulate the importance of the individual terms of the loss. We estimate a discriminator and generator such that:
|Dataset||Model (ours) trained on||RAINY||DERAINED|
|BDD-Rainy||Cityscapes-Original labels + BDD-Original labels||17.84||0.6500||24.02||0.8587|
4.1 Reconstruction results
Image reconstruction results are presented in Table 1. We train three different models, one using the original Cityscapes clear images as ground-truth (Cityscapes-Original), one using the photographed Cityscapes clear images as ground-truth (Cityscapes-Photograph), and one trained on both the the original Cityscapes and original BDD clear images. We present the deraining performance of these three models on both Cityscapes-Rainy and BDD-Rainy. All models significantly improve the quality of the rain-affected images. Secondly, Table 2 shows the performance of our model on a third-party dataset by Qian et al.(16), containing 861 training images and 59 testing images, acquired using a glass-pane manually sprayed with water and a DSLR camera in an outdoor environment. On one hand, we show that our model architecture achieves state-of-the-art results when trained on the target dataset. On the other hand, training solely on Cityscapes-Rainy does not produce top results. We observe that the model performs adequately on droplets that are in-focus or nearly in-focus, but inadequately on diffuse droplets, since these were not present in our training data. However, we show that finetuning (Ours - trained on Cityscapes-Rainy, finetuned on Qian-sample(112 img.)) the model with a very small random sample of the target training set (approximately 10%) yields excellent performance, showing that pre-training a model on our automatically-collected data significantly reduces the need for labelled target domain data.
|Model vs. Dataset||Dataset of (16)|
|Qian et al.(no att.)(16)||30.88||0.8670|
|Qian et al.(full att.)(16)||31.51||0.9213|
|Ours - trained on Qian-Full|
|Ours - trained on Cityscapes-Rainy||27.52||0.8716|
|Ours - trained on Cityscapes-Rainy, finetuned on Qian-sample(112 img.)||30.21||0.8953|
4.2 Segmentation results
We show that rainy images severely degrade semantic segmentation performance, and that deraining the images restores performance. We use an off-the shelf segmentation model (DeepLab v3+(2), trained on the original, vanilla Cityscapes dataset), and derain Cityscapes-Rainy images using the model with top-performing reconstruction results (results in Table 3 and Fig. 4). The segmentation model achieves an mIOU of 0.79 on the original Cityscapes dataset, 0.71 on photographed (dry) Cityscapes, 0.17 on rainy Cityscapes, and a significantly improved value of 0.63 on rainy images that were derained using our model. Additionally, we show in the first two rows of Table 3 that taking photographs of a high-resolution monitor does not lead to a large loss in segmentation quality, as compared to the original images. We believe that the difference in performance (mIOU of 0.71 vs. 0.79 for the original images) can be reduced further by using a higher-resolution setup, better lens and a better alignment methodology.
|Cityscapes Img. vs. Segm. Model||mIoU|
4.3 Qualitative results
We show that our model generalises to other camera setups by recording a real-rainy dataset from inside a moving vehicle, using a camera from a mid-level smartphone. We observe satisfactory reconstruction performance for varying droplet and splatter shapes. We present an example of de-raining of images from such an unseen real domain in Figure 5. Outside of quantitative results on third-party rain datasets presented in Table 2 and on our own data presented in Table 1, we are severely limited in offering further numerical analyses by the lack of third-party rainy datasets with either a clear groundtruth or semantic segmentation labels.
We have shown that a very simple automated setup - a high resolution computer screen, a glass pane sprayed with water nozzles, and a camera - is suitable for generating rainy data from arbitrary datasets, and more importantly from those with ground truth for auxiliary tasks such as semantic segmentation. Both our quantitative and qualitative results show that data collected in this fashion can be effectively used to train image de-raining and de-noising models that generalise well to other environments, and that this data is valuable for closing domain gaps via fine-tuning, especially when the target data is hard to acquire.
Currently, the system is able to tackle adherent rain, while other effects such as atmospheric misting or fog are out of the scope of this work. Attempts at reducing the impact of these other effects can be found throughout the literature, with a good example being (20). Additionally, while the current implementation involves a windscreen-like setup where windscreen wipers could be envisioned to work and alleviate some of the issues stemming from adherent rain, there are many other camera/sensor locations where wipers would be impractical and which would benefit from our approach. Finally, even if effects such as taillight reflections on the wet ground cannot be captured or introduced, this does not reduce the usefulness nor the validity of datasets obtained using this method when used for image reconstruction and denoising purposes. While such a dataset does not capture the full effects of rain, it can be used, at a minimum, to massively and effectively reduce the amount (and cost) of labelled/ground-truth real rainy data that needs to be collected.
6 Limitations and future work
The proposed setup is only able to simulate adherent contaminants. For example, wind, heavy snow, hailstorm and other atmospheric contaminants cannot be simulated if they are supposed to change the structure and appearance of objects already captured in the image. Additionally, the ease of generating data with this device comes at the cost of losing some image quality (see Table 3, second row). Another adherent contaminant that creates an obscuring effect is dirt (dust, mud etc). Soil is synthetically generated in (22), whereas we propose as future work to use a rolling clear film to gather real mud (see Fig. 6).
-  (1982-07) A generalization of algebraic surface drawing. ACM Trans. Graph. 1 (3), pp. 235–256. External Links: Cited by: §2.
-  (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. CoRR abs/1802.02611. External Links: Cited by: §4.2.
The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.
-  (2013-12) Restoring an image taken through a window covered with dirt or rain. In 2013 IEEE International Conference on Computer Vision, Vol. , pp. 633–640. External Links: Cited by: §2, §2, Table 2.
-  (2019-01) Adherent raindrop detection based on morphological operations: volume 5: advanced intelligent systems for computing sciences. pp. 324–331. External Links: Cited by: §2.
-  (2009-06) Raindrop detection on car windshields using geometric-photometric environment construction and intensity-based correlation. In 2009 IEEE Intelligent Vehicles Symposium, Vol. , pp. 610–615. External Links: Cited by: §1, §2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.2.
Image-to-image translation with conditional adversarial networks. CoRR abs/1611.07004. External Links: Cited by: Table 2.
-  (2015-12) An adherent raindrop detection method using MSER. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Vol. , pp. 105–109. External Links: Cited by: §2.
Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, Cited by: §3.3.
-  (2002-01) Removal of adherent waterdrops in images by using multiple cameras.. pp. 80–83. Cited by: §2.
-  (2016) Precomputed real-time texture synthesis with markovian generative adversarial networks. In European Conference on Computer Vision, pp. 702–716. Cited by: §3.2.
-  (2018) A^2net: adjacent aggregation networks for image raindrop removal. CoRR abs/1811.09780. External Links: Cited by: §2.
-  (2019) Benchmarking robustness in object detection: autonomous driving when winter is coming. CoRR abs/1907.07484. External Links: Cited by: §2.
-  (2019) I can see clearly now : image restoration via de-raining. CoRR abs/1901.00893. External Links: Cited by: §1, §1, §1, §2, §2, §3.2, §3.4.
-  (2017) Attentive generative adversarial network for raindrop removal from a single image. CoRR abs/1711.10098. External Links: Cited by: §1, §2, §2, §4.1, Table 2.
-  (2009) Video-based raindrop detection for improved image registration. 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pp. 570–577. Cited by: §2, §2.
-  (2010) Realistic modeling of water droplets for monocular adherent raindrop recognition using bézier curves. In ACCV Workshops, Cited by: §2, §2.
-  (2018-01) Raindrop detection considering extremal regions and salient features. Electronic Imaging 2018, pp. 1–6. External Links: Cited by: §2.
-  (2017) Semantic foggy scene understanding with synthetic data. CoRR abs/1708.07819. External Links: Cited by: §5.
-  (2018) Adherent raindrop detection. Master’s Theses in Mathematical Sciences (eng). Note: Student Paper External Links: Cited by: §1, §2.
-  (2019) SoilingNet: soiling detection on automotive surround-view cameras. CoRR abs/1905.01492. External Links: Cited by: §2, §6.
-  (2018) High-resolution image synthesis and semantic manipulation with conditional GANs. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, pp. 1–13. Cited by: §3.2, §3.3.
-  (2004-04) A virtual wiper-restoration of deteriorated images by using a pan-tilt camera. In IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA ’04. 2004, Vol. 5, pp. 4724–4729 Vol.5. External Links: Cited by: §2.
-  (2005-08) Removal of adherent waterdrops from images acquired with stereo camera. In 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vol. , pp. 400–405. External Links: Cited by: §2.
-  (2003-11) A virtual wiper - restoration of deteriorated images by using multiple cameras. pp. 3126 – 3131 vol.3. External Links: Cited by: §2.
-  (2016-Sep.) Adherent raindrop modeling, detection and removal in video. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (9), pp. 1721–1733. External Links: Cited by: §1, §2, §2.
-  (2016) Waterdrop stereo. CoRR abs/1604.00730. External Links: Cited by: §2.
-  (2018) BDD100K: A diverse driving video database with scalable annotation tooling. CoRR abs/1805.04687. External Links: Cited by: §4.