Removing Rain in Videos: A Large-scale Database and A Two-stream ConvLSTM Approach

06/06/2019 ∙ by Tie Liu, et al. ∙ Beihang University 0

Rain removal has recently attracted increasing research attention, as it is able to enhance the visibility of rain videos. However, the existing learning based rain removal approaches for videos suffer from insufficient training data, especially when applying deep learning to remove rain. In this paper, we establish a large-scale video database for rain removal (LasVR), which consists of 316 rain videos. Then, we observe from our database that there exist the temporal correlation of clean content and similar patterns of rain across video frames. According to these two observations, we propose a two-stream convolutional long- and short- term memory (ConvLSTM) approach for rain removal in videos. The first stream is composed of the subnet for rain detection, while the second stream is the subnet of rain removal that leverages the features from the rain detection subnet. Finally, the experimental results on both synthetic and real rain videos show the proposed approach performs better than other state-of-the-art approaches.



There are no comments yet.


page 1

page 3

page 4

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Rain removal aims to separate rain streaks and then produce the clean content for images or videos. Since rain streaks hamper the visibility in images or videos captured with rain, rain removal has received increasing attention in the recent years. More importantly, the performance of many computer vision algorithms may be severely degraded due to the invisible content. Therefore, rain removal can be used in the computer vision tasks, such as object detection

[1], tracking [2] and segmentation [3]. The past few years have witnessed some attempts for rain removal [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19].

The existing rain removal works can be generally categorized into two classes, i.e., image based and video based approaches. For images, most approaches  [7, 9, 10, 11, 15] formulate rain removal as a signal separation problem. For videos, some early works were proposed to utilize the chromatic properties [6]

, Gaussian mixture model (GMM) 

[20] and low-rank matrix completion [8]

to remove rain. Most recently, benefiting from the great success of deep learning, deep neural networks (DNNs) have been applied for rain removal of videos  

[18, 19]. For example, Liu et al. [19] proposed a hybrid rain model to remove rain, in which the DNNs extract spatial features with temporal coherence of background. In addition, Chen et al. [18] proposed aligning scene content at super pixel (SP) level and averaging the aligned SPs to obtain the intermediate derain output. In [18]

, a convolutional neural network (CNN) is used to restore rain free details and then obtain the final output of clean content. However, the generalization ability of these DNN-based rain removal approaches suffers from insufficient training data, as the existing video databases for rain removal are of small-scale.

Figure 1: Comparision for visual results of our approach with and without rain detection.

To the best of our knowledge, there are only 3 video databases for rain removal, i.e., [8, 18, 19]. Among these databases, [19] is the largest one, which has 60 videos. In [19], the types of rain lack diversity, since they mainly refer to direction, scale and density, but not consider scene depth, opacity, falling speed and wind variation. Even worse, there exist clear horizontal boundaries in their rain videos, which is unrealistic under most circumstances. In contrast to videos, the large-scale databases have been established for rain removal of images. For example, Fu et al. [11] created a database of 14,000 rain images. By modifying rain types, i.e., scale, direction, density, scene depth, opacity and falling speed, each clean image generates 14 rain images. Yang et al. [12] established a challenging database, in which each rain image contains five streak directions. In the database of  [15], Zhang et al. synthesized 13,200 images with three rain-density levels of rain (i.e., light, medium and heavy).

Figure 2: Examples for each category.

In this paper, we thus establish a large-scale database for boosting the performance of DNN-based rain removal on videos. Specifically, our database consists of 316 videos (80,835 frames) with diverse content. Moreover, various rain types are taken into account in our database, including scale, direction, density, scene depth, opacity, falling speed and wind variation. Then, we find from our database that the direction, density and falling speed

of rain streaks are normally invariant across video frames. Therefore, there exist similar patterns of rain, which can be learned by convolutional LSTM (ConvLSTM). In this paper, a two-stream ConvLSTM based approach is proposed to achieve rain detection and removal on videos. To our best knowledge, our approach is a first attempt of applying DNNs to estimate rain streaks by leveraging the similar patterns of rain in videos.

Specifically, our approach consists of two subnets, i.e., the rain detection and rain removal subnets. The rain detection subnet leverages ConvLSTM to learn the similar patterns of rain, which can be used to estimate rain streaks in the current frame. Then, the spatial-temporal features extracted by the rain detection subnet are fused into the rain removal subnet. Consequently, the performance of rain removal can be enhanced by combining the rain detection and rain removal subnets in our approach. Figure

1 illustrates the advantage of our approach in taking into account rain detection. The extensive experiments demonstrate that our two-stream ConvLSTM approach achieves at least 4.23 dB improvement over the state-of-the-art rain removal approaches, in which 2.18 dB improvement is owing to the introduction of rain detection subnet.

The main contributions of our approach are two-fold. (1) We construct a large-scale database of 316 synthetic rain videos, from which two observations about the clean content and rain streaks are obtained. (2) We propose a novel deep learning architecture with two-stream ConvLSTM based subnets for the joint tasks of rain streak detection and removal on videos.

Figure 3: CC results of clean content across two consecutive frames with different distances (ranging from 0 to 10 frames).

2 LasVR Database

The details about establishing our LasVR database are presented as follows.

Figure 4: Architecture of the DNN model in our approach.

Clean videos. We download 142 lossless videos from Video Quality Experts Group111, Harmonic222 and The Consumer Digital Video Library333 To make our database more realistic, we exclude videos with the indoor scenes. In addition, the videos of our database contain a wide range of content categories, e.g., animal, nature landscapes, human scenes, action sports, man-made object and so forth. Moreover, the resolution of our videos ranges from 640 360 to 2560 1600. In our database, videos are cut to make their duration vary from 5 to 25 seconds at the frame rate of 24-30 frames per second. Note that the bitrates are maintained above 10 Mbps when transcoding videos to the uniform MP4 format, such that the videos are with high quality.

Rain videos. Having obtained 142 clean videos, we randomly divide them into training (87 videos), validation (27 videos) and testing (28 videos) sets. Similar to [8, 18], we use the commercial editing software Adobe After Effect [21] to render various types of rain for videos. To ensure the diversity of rain streaks, we set different values for the parameters of rendering, e.g., scale, direction, density, scene depth, opacity, falling speed and wind variation. Consequently, the rendered streaks vary from light drizzling to heavy rain storm and vertical rain to slash line. Then, three rain videos with different parameters over each clean video are generated in the training set, while one rain video is rendered over each clean video in the validation and testing sets. Consequently, the training, validation and testing sets consist of 261, 27 and 28 rain videos. Figure 2 shows some examples for each category of rain videos in our database. The peak signal-to-noise ratio (PSNR) between the rain and raw videos varies from 22.63 dB to 39.11 dB. This also implies the diversity of rain types in our database.

3 Proposed method

3.1 Observations

We mine our database and obtain two observations on the rain videos. The observations refer to the correlation of clean video content and rain patterns, respectively.

Temporal correlation of clean content. It is intuitive that the clean content in consecutive frames is with high correlation. To quantify such correlation, we measure the linear correlation coefficient (CC) of the clean content between two frames. Here, the CC values between the current and the previous or subsequent frames are averaged over all clean videos in our database. Figure 3 shows the CC results when the distance of two adjacent frames varies from 0 to 10. As seen in Figure 3, the CC values are higher than 0.75 within 10 consecutive frames. This indicates the high correlation of clean content across video frames. We further find that the temporal correlation of clean content decreases, when increasing the distance between the current and previous frames. Therefore, there exists the long and short-term dependency of the clean content across frames. In our approach, one stream of ConvLSTM is thus used to learn such dependency for restoring the clean content of rain videos.

Similar patterns of rain streaks. We further find that the rain streaks have similar characteristics for a period of time, e.g., direction, density and falling speed. For density

, we measure the proportion of the rain region in a rain map by a pre-set threshold. The proportion varies from 0.1 to 0.8, while the averaged STandard Deviation (STD) at frame level is 0.04. This indicates that

density is almost the same alongside frames. For direction and falling speed, we select multiple Shi-Tomasi corner points from rain maps and then apply Lucas-Kanade algorithm to estimate optical flow, in order to match the positions of corner points at different time for estimating streak motion. The direction and falling speed range from -73 to +65and 200 to 900 pixels per second, respectively. The STD results of direction and falling speed are 7and 40 pixels per second. We can thus conclude that the direction and falling speed in consecutive frames remain almost invariant.

3.2 Framework

We assume that the frames of an input rain video are . Note that in this paper is a fixed number as the length of ConvLSTM, and each video is -length segments as input. According to the observations in Section 3.1, we restore the rain and clean content through two subnets (i.e., the rain detection and rain removal subnets), respectively. Assume that is rain streak and is clean content at frame , corresponding to . Since there exists temporal correlation across frames, we use all frames to restore and .

Specifically, the rain detection subnet integrates similar patterns of rain in consecutive frames to estimate rain streaks and learn spatial-temporal features of rain detection . The rain removal subnet employs temporal correlation of the clean content and spatial-temporal representation features from the other subnet to restore clean content . Therefore, our approach jointly achieves rain detection and removal, and the combination of two stream networks significantly enhance the performance of rain removal. In this paper, the rain detection and rain removal subnets are formulated by the functions () and (), respectively. Additionally, assume that (), () and () are the functions of spatial, temporal and reconstruction layers in the rain detection subnet, respectively. Mathematically, our approach aims at learning the following mappings:


3.3 Architecture

Figure 4 illustrates the overall architecture of our rain removal approach. Both the subnets of the rain detection and rain removal take sequences as input. First, is fed into a sharing mini-DenseNet for encoding the spatial features of each frame. Subsequently, the features learned by the sharing mini-DenseNet flow into two streams, corresponding to rain detection and rain removal, respectively. In each stream, a dense unit is built to further extract spatial features for rain detection or removal. Then, the output features from the dense unit of each stream are input to bi-directional ConvLSTM. Consequently, the spatial-temporal features of rain detection and removal, denoted as and , are obtained in two subnets. Furthermore, we fuse and to obtain aggregated features . Finally, and are used to restore rain streaks and the clean content via the reconstruction layers.

Spatial layers. The DenseNet [22] introduces various length of inter-layer connections, for encouraging feature reuse and alleviating vanishing gradients. Considering this advantage, we apply the dense units illustrated in Figure 5 to extract spatial features for rain detection and removal, respectively. The spatial layers contain the sharing mini-DenseNet and two dense units. Here, all the dense units are with the same structure. Each dense unit contains 4 convolutional layers, and features from all preceding layers are concatenated together before each layer. Hence, each dense unit consists of 10 inter-layer connections, much more than a 4-layer plain CNN with only 4 connections. At each layer in the dense unit, the number of output channels is 12. Note that the convolutional layers for each time step share the same weights and biases.

Temporal layers. Section 3.1 illustrates the temporal consistency of clean content and similar patterns of rain in consecutive frames. Hence, we adopt the ConvLSTM to learn temporal features of clean content and rain. The ConvLSTM utilizes the convolutional operation (denoted as ) in its gate computation, instead of the Hadamard product. This can preserve the spatial representation of the input frame. For the ConvLSTM layer i at frame t, denotes the input features; , and are the gates of input (I), forget (F) and output (O); , and are the corresponding input modulation (G), memory cell (M) and hidden state (H), respectively. The unidirectional ConvLSTM of layer i at frame t can be formulated as follows:


where denotes the element-wise multiplication. In addition, () and tanh(

) are the activation functions of sigmoid and hyperbolic tangent, respectively. Furthermore, the weights and biases of the

i-th ConvLSTM layer are denoted as and .

In our approach, the ConvLSTM is built on top of the spatial layers, and each stream consists of a 2-layer ConvLSTM. The number of feature maps is 48, and the kernel size is 3 3. We further extend the above unidirectional ConvLSTM to work in a bidirectional fashion.

Figure 5: Structure for the dense unit.

Feature fusion. As discussed above, two streams of subnets aim at rain detection and removal. To combine these two streams together, we fuse the spatial-temporal features from the rain detection subnet into the rain removal subnet. The fusion process is presented as follows,


where 1) is an adjustable hyper-parameter for controlling the fusion degree, and 1 is the matrix with all elements being 1. The aggregated features can be used for restoring clean content.

Figure 6: Comparison of different approaches on 2 example frames in our LasVR database.
Rain videos DDN SE MD MS J4R Ours
PSNR (dB) 29.9404 36.2546 28.5295 30.9407 27.6112 38.4150 42.6486
SSIM 0.8703 0.9528 0.8566 0.8985 0.7969 0.9688 0.9848
Table 1: Quantitative results among rain removal approaches on our database. Best results are marked in bold and the second best results are underlined.

Reconstruction layers. Finally, a K-layer CNN is exploited to estimate rain streaks and restore clean content following the temporal layers. To be more specific, the reconstruction layers in the rain removal subnet can be formulated as:


where the output of the k-th convolutional layer is denoted as

. Moreover, PReLU denotes the activation function of Parametric Rectified Linear Unit, and

and denote the weights and biases for the k-th convolutional layer. is set to 3 and the kernel size is 3 3. Consequently, the clean content at frame can be restored as . Moreover, the reconstruction layers in the rain detection subnet use the same structure as that in the rain removal subnet.

3.4 Training strategy

In our approach, the two subnets are trained jointly in an end-to-end manner. Recall that the restored rain streaks and clean content of the current frame as and , respectively. In addition, and

are the ground-truth rain streaks and clean content, respectively. The loss function of training our DNN is denoted as:


where and are the weights of loss in restoring and , respectively. In (10), the loss function is calculated by the weighted sum of and , which are norm error of restoring and by and . Since the rain detection subnet can enhance the performance of the rain removal subnet, we set to speed up the convergence of the rain detection subnet at the beginning of training. After convergence of rain detection, we set to minimize the norm error between and . Finally, the clean content can be obtained through the two subnets.

4 Experiments

In this section, we present the experimental results to validate the effectiveness of our two-stream ConvLSTM approach. The experiments are conducted over our LasVR database, which is composed of 261 training videos, 27 validation videos and 28 test videos. The 261 training videos are cropped into 64 64 9 cubes, and then we have 34,800 training cubes in total. The batch size is set to 16. We adopt the Adam optimizer with the initial learning rate as to minimize the loss function of (10). In the training stage, we initially set = 1 and = 0.01 of (10

) to train the rain detection subnet. After the rain detection subnet converges, these hyperparameters are set as

= 0.01 and = 1 to train the rain removal subnet.

Quantitative evaluation We evaluate the performance of our approach by comparing with 5 state-of-the-art approaches: deep detail network (DDN) [11], stochastic encoding (SE) [14], matrix decomposition (MD) [23], multi-scale convolutional sparse coding (MS) [17], and joint recurrent rain removal and reconstruction (J4R) [19]. Among them, [11]

is a latest image-based approach, while others are video-based approaches. Note that we retrain the models of DDN and J4R over our database. The performance is evaluated over all test videos in our database. The evaluation metrics are PSNR and structure similarity index (SSIM). The results are reported in Table

1. As seen in Table 1, the DNN-based approaches (i.e., J4R, DDN and our approach) significantly outperform the traditional approaches (SE, MD and MS). In addition, our approach further improves PSNR by 6.39 dB and 4.23 dB over DDN and J4R, respectively.

Rain videos DDN J4R Ours
PSNR (dB) 28.7223 32.0166 32.4301 34.0161
SSIM 0.8806 0.9248 0.9381 0.9517
Table 2: Quantitative results among rain removal approaches on the database [18].
Figure 7: Comparison of different approaches on real videos.

Qualitative evaluation. In Figure 6, we present the subjective results of rain removal on 2 sample frames to visually demonstrate the results. As observed, the DNN-based approaches of J4R and DDN remove most rain streaks. However, they still fail in the removal of some rain streaks, when their opacity level is high or the structure of clean content is similar to that of rain streaks. In contrast, our approach attains promising visual effect, compared with the ground truth clean videos. We further qualitatively evaluate the performance of different approaches on real rain videos downloaded from the Internet444 As presented in Figure 7, MD tends to incur the blurring effect due to the evident moving objects. The SE approach still leaves some large rain streaks in frames, and the J4R approach removes most of rain streaks, but incurring detail loss. In contrast, our approach is able to remove rain streaks and meanwhile well restore the clean video frame.

Evaluation on generalization ability. To evaluate the generalization ability of our approach, we further evaluate the performance of our approach and other approaches on the database of [18]. As mentioned above, the DNN-based approaches (ours, DDN and J4R) show great advantages over the traditional approaches (MD, SE and MS), and thus we only compare our approach with DDN, J4R and SPAC-CNN [18]. As observed in Table 2, the PSNR values of our approach are 2.00 dB and 1.59 dB higher than those of DDN and J4R, respectively. In addition, the averaged PSNR and SSIM results of SPAC-CNN are 3.01 dB and 0.04 as reported in [18], while the results of our approach are 5.29 dB and 0.07. This demonstrates the high generalization ability of our approach for rain removal on videos.

Ours w/o rain detection Ours-CNN
PSNR (dB) 40.4702 41.3051
SSIM 0.9799 0.9822
Table 3: Quantitative results compared with two baseline configurations on our database.

Ablation study. We test the rain removal subnet without rain detection subnet, in order to verify the effectiveness of integrating the subnet of rain detection in rain removal. As observed in Table 3, our approach achieves 2.18 dB higher than only rain removal in terms of PSNR. Similarly, the SSIM result of our approach is also 0.005 higher than that of only rain removal. These results indicate the performance of rain removal takes advantage of rain detection in our approach. We further evaluate the performance of our approach replacing dense units by the plain convolutional layers (named as Ours-CNN). It can be seen that our approach increases PSNR by 1.34 dB over Ours-CNN. This verifies the effectiveness of dense units applied in our approach for rain removal in videos.

5 Conclusion

In this paper, we have proposed a DNN-based approach for removing rain in videos. We first established a large-scale database, which is comprised by 316 videos with diverse rain patterns, for learning the model of DNN. By investigating our database, we found two intrinsic characteristics of rain videos, i.e., content correlation and similar patterns of rain. Then, we proposed a two-stream ConvLSTM structure that includes the rain detection and rain removal subnets. As such, rain detection and removal can be jointly achieved. More importantly, the performance of rain removal can be enhanced when detecting rain streaks and restoring clean video in a uniform DNN structure. The performance enhancement was verified through the ablation study of our experimental results. In addition, the experiments showed that our approach significantly outperforms 5 state-of-the-art approaches for both synthetic and real-world rain videos.


  • [1] David G Lowe, “Object recognition from local scale-invariant features,” in ICCV, 1999, vol. 2, pp. 1150–1157.
  • [2] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer, “Kernel-based object tracking,” TPAMI, vol. 25, no. 5, pp. 564–577, 2003.
  • [3] David W Murray and Bernard F Buxton, “Scene segmentation from visual motion using global optimization,” TPAMI, , no. 2, pp. 220–228, 1987.
  • [4] Kshitiz Garg and Shree K Nayar, “Detection and removal of rain from videos,” in CVPR, 2004, vol. 1, pp. I–I.
  • [5] Xiaopeng Zhang, Hao Li, Yingyi Qi, Wee Kheng Leow, and Teck Khim Ng, “Rain removal in video by combining temporal and chromatic properties,” in ICME, 2006, pp. 461–464.
  • [6] Peng Liu, Jing Xu, Jiafeng Liu, and Xianglong Tang, “Pixel based temporal analysis using chromatic property for removing rain from videos,” Computer and information science, vol. 2, no. 1, pp. 53, 2009.
  • [7] Li-Wei Kang, Chia-Wen Lin, and Yu-Hsiang Fu, “Automatic single-image-based rain streaks removal via image decomposition,” TIP, vol. 21, no. 4, pp. 1742, 2012.
  • [8] Jin-Hwan Kim, Jae-Young Sim, and Chang-Su Kim, “Video deraining and desnowing using temporal correlation and low-rank matrix completion,” TIP, vol. 24, no. 9, pp. 2658–2670, 2015.
  • [9] Yu Luo, Yong Xu, and Hui Ji, “Removing rain from a single image via discriminative sparse coding,” in ICCV, 2015, pp. 3397–3405.
  • [10] Yu Li, Robby T Tan, Xiaojie Guo, Jiangbo Lu, and Michael S Brown, “Rain streak removal using layer priors,” in CVPR, 2016, pp. 2736–2744.
  • [11] Xueyang Fu, Jiabin Huang, Delu Zeng, Yue Huang, Xinghao Ding, and John Paisley, “Removing rain from single images via a deep detail network,” in CVPR, 2017, pp. 1715–1723.
  • [12] Wenhan Yang, Robby T Tan, Jiashi Feng, Jiaying Liu, Zongming Guo, and Shuicheng Yan, “Deep joint rain detection and removal from a single image,” in CVPR, 2017, pp. 1357–1366.
  • [13] Tai-Xiang Jiang, Ting-Zhu Huang, Xi-Le Zhao, Liang-Jian Deng, and Yao Wang,

    “A novel tensor-based video rain streaks removal approach via utilizing discriminatively intrinsic priors,”

    in CVPR, 2017.
  • [14] Wei Wei, Lixuan Yi, Qi Xie, Qian Zhao, Deyu Meng, and Zongben Xu, “Should we encode rain streaks in video as deterministic or stochastic,” in CVPR, 2017, pp. 2516–2525.
  • [15] He Zhang and Vishal M Patel, “Density-aware single image de-raining using a multi-stream dense network,” arXiv preprint arXiv:1802.07412, 2018.
  • [16] Zhiwen Fan, Huafeng Wu, Xueyang Fu, Yue Hunag, and Xinghao Ding, “Residual-guide feature fusion network for single image deraining,” arXiv preprint arXiv:1804.07493, 2018.
  • [17] Minghan Li, Qi Xie, Qian Zhao, Wei Wei, Shuhang Gu, Jing Tao, and Deyu Meng, “Video rain streak removal by multiscale convolutional sparse coding,” in CVPR, 2018, pp. 6644–6653.
  • [18] Jie Chen, Cheen-Hau Tan, Junhui Hou, Lap-Pui Chau, and He Li, “Robust video content alignment and compensation for rain removal in a cnn framework,” arXiv preprint arXiv:1803.10433, 2018.
  • [19] Jiaying Liu, Wenhan Yang, Shuai Yang, and Zongming Guo, “Erase or fill? deep joint recurrent rain removal and reconstruction in videos,” in CVPR, 2018, pp. 3233–3242.
  • [20] Jérémie Bossu, Nicolas Hautière, and Jean-Philippe Tarel, “Rain or snow detection in image sequences through use of a histogram of orientation of streaks,” IJCV, vol. 93, no. 3, pp. 348–367, 2011.
  • [21] “Adobe after effects software,”
  • [22] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger, “Densely connected convolutional networks.,” in CVPR, 2017, vol. 1, p. 3.
  • [23] Weihong Ren, Jiandong Tian, Zhi Han, Antoni Chan, and Yandong Tang, “Video desnowing and deraining based on matrix decomposition,” in CVPR, 2017, pp. 4210–4219.