Video Rain/Snow Removal by Transformed Online Multiscale Convolutional Sparse Coding

09/13/2019 ∙ by Minghan Li, et al. ∙ 16

Video rain/snow removal from surveillance videos is an important task in the computer vision community since rain/snow existed in videos can severely degenerate the performance of many surveillance system. Various methods have been investigated extensively, but most only consider consistent rain/snow under stable background scenes. Rain/snow captured from practical surveillance camera, however, is always highly dynamic in time with the background scene transformed occasionally. To this issue, this paper proposes a novel rain/snow removal approach, which fully considers dynamic statistics of both rain/snow and background scenes taken from a video sequence. Specifically, the rain/snow is encoded as an online multi-scale convolutional sparse coding (OMS-CSC) model, which not only finely delivers the sparse scattering and multi-scale shapes of real rain/snow, but also well encodes their temporally dynamic configurations by real-time ameliorated parameters in the model. Furthermore, a transformation operator imposed on the background scenes is further embedded into the proposed model, which finely conveys the dynamic background transformations, such as rotations, scalings and distortions, inevitably existed in a real video sequence. The approach so constructed can naturally better adapt to the dynamic rain/snow as well as background changes, and also suitable to deal with the streaming video attributed its online learning mode. The proposed model is formulated in a concise maximum a posterior (MAP) framework and is readily solved by the ADMM algorithm. Compared with the state-of-the-art online and offline video rain/snow removal methods, the proposed method achieves better performance on synthetic and real videos datasets both visually and quantitatively. Specifically, our method can be implemented in relatively high efficiency, showing its potential to real-time video rain/snow removal.



There are no comments yet.


page 2

page 8

page 9

page 10

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Videos captured from outdoor surveillance system are often contaminated by rain or snow, which has a negative effect on the perceptual quality and tends to degrade the performance of subsequent video processing tasks, such as human detection [8], person re-identification [10], object tracking [32] and scene analysis [19]. Thus, removing rain and snow from surveillance videos is an important video pre-processing step and has attracted much attention in the computer vision community.

In recent decades, various methods have been proposed for removing rain from a video. The earliest video rain removal approach was proposed based on the photometry property of rain [13]. After that, more methods taking advantage of the essential physical characteristics of rain, such as photometric appearance [14], chromatic consistency [50], shape and brightness [3] and spatial-temporal configurations [36], were introduced to better separate rain streaks from the background of videos. However, these methods don’t utilize the prior knowledge of video structure, such as spatial smoothness of foreground objects and temporal similarity of background scenes, and thus cannot always obtain satisfactory performance especially in complex scenes. In recent years, low-rank models [7] show a great potential for this task and always achieve state-of-the-art performance due to their better consideration of video structure prior knowledge both in foreground and background. Specifically, these methods not only use the low-rank structure for the background, but also fully facilitate the prior knowledge of the rain, such as sparsity and spatial smoothness [33, 40]

. Very recently, deep learning based methods have also been proposed for this task. These methods address the problem of video rain removal by constructing deep recurrent convolutional networks 

[26] or deep convolutional network [6] and implement the task in a popular end-to-end learning manner.

Albeit achieving good progress, most of current methods are implemented on a pre-fixed length of videos and assume consistent rain/snow shapes under static background scenes. This, however, is evidently deviated from the real scenarios. On one hand, the rain/snow contained in a video sequence is generally with configurations changed constantly along time, as typically shown in Fig. 1. On the other hand, the background scene in video is also always dynamic, and inevitably contains timely transformations such as translation, rotation, scaling and distortion, due to camera jitters. Lacking considerations to such dynamic characteristics inclines to degenerate the performance of current methods in such real cases. Besides, as the dramatically increasing surveillance cameras installed all over the world, the real video is always coming online as a streaming format. Most current methods, however, are implemented/trained on a prefixed video sequence, and thus cannot finely and efficiently adapt to such kinds of streaming videos continually and endlessly coming in time. These issues have hampered the availability of existing methods in real applications and thus is worthy to be specifically investigated.

Fig. 1: The first column: Three frames in a video with snow gradually varied from heavy to light. The second and third columns: background scenes and snow of the frames obtained by the OTMS-CSC method. The fourth to sixth columns: three snow layers separated by OTMS-CSC.

Against the aforementioned issues, this paper proposes a new video rain/snow removal method by fully encoding the dynamic statistics of both rain/snow and background scenes in a video along time into the model, and realizing it with an online mode to make it potentially available to handle constantly coming streaming video sequence. Specifically, inspired by the multi-scale convolutional sparse coding (MS-CSC) model designed for video rain removal (still for static rain) previously proposed in [24], which finely delivers the sparse scattering and multi-scale shapes of real rain, this work encodes the dynamic temporal changing tendency of rain/snow as a dynamic MS-CSC framework by timely parameter amelioration for the model in an online implementation manner. Besides, a transformation operator capable of being adaptively updated along time is imposed on the background scenes to finely fit the dynamic background transformations existed in a video sequence. All these knowledge are formulated into a concise maximum a posterior (MAP) framework, which can be easily solved by alternative optimization technique.

In all, the contribution of this work can be mainly summarized as follows: 1) An online MS-CSC model is specifically designed for encoding dynamic rain/snow with temporal variations. The model is formulated as a concise probabilistic framework, where the feature map representing rain/snow knowledge of each video frame is gradually ameliorated under regularization of a penalty for enforcing them close to those calculated from the previous frames. In this manner, the insightful dynamic rain properties, i.e., the correlation and distinctiveness of rain/snow along different video frames, can be finely delivered. 2) An affine transformation operator is further embedded into the proposed model, and can be automatically adjusted to fit a wide range of video background transformations. This makes the method more robust to general camera movements, like rotation, translation, scaling or distortion. 3) The adopted online learning manner makes the method possess fixed space complexity all along but not gradually increasing ones (mostly to infinity) as most conventional methods, and fixed time complexity for any fixed length of newly coming frames. This guarantees the feasibility of our method on any length of video sequence, and provides potential for the method to handle real streaming data. 4) The superiority of the proposed method in robustness and efficiency are comprehensively substantiated by experiments implemented on synthetic and real videos, including those with evident rain/snow variations and/or dynamic camera jitters, both visually and quantitatively, as compared with other state-of-the-art methods. Specifically, the performance of our method, directly executed on the streaming video sequence, can exceed those deep learning ones, requiring more pre-collected training data sources (pairs of rainy/snowy and corresponding clean video frames). This, to a certain extent, shows that the popularly employed data-driven deep learning methodology, requiring dominant source of supervised training samples and computation powers, might not be the only fashion for solving any computer vision tasks. It might be still necessary and important for elaborately designing probabilistic models through possibly thoroughly understanding the investigated problem and application scenarios.

The rest of paper is organized as follows. Section 2 introduces the related works. Section 3 reviews the offline MS-CSC model suitable for removing static rain and proposes the online transformed MS-CSC model as well as its solving algorithm. Section 4 demonstrates experimental results on synthetic and real rainy/snowy videos to substantiate the superiority of the proposed method. Finally, conclusions are drawn in Section 5.

2 Related Works

In this section, we give a brief review on the methods of video rain and snow removal. The related developments on single image rain and snow removal, multi-scale modeling and video alignment are also introduced for literature comprehensiveness.

2.1 Video Rain and Snow Removal Methods

Garg and Nayar [13] made the earliest study on the photometric appearance of rain drops and developed a rain detection method by utilizing a linear space-time correlation model. To better reduce the effects of rain before camera shots in images/videos, Garg and Nayar [14, 15] further proposed a method by adjusting the camera parameters such as field depth and exposure time.

In the past years, more physical intrinsic properties of rain streaks have been explored and formulated in algorithm designing. For example, Zhang et al. [50]

incorporated both chromatic and temporal properties and utilized K-means clustering for distinguishing background and rain streaks from videos. Later, Barnum et al. 

[3] first considered the impact of snow on videos. They derived a physical model for representing raindrops and snowflakes and used them to determine the general shape and brightness of a single streak. The streak model combined with the statistical properties of rain and snow can then conduct how they affect the spatial-temporal frequencies of an image sequence. To enhance the robustness of rain removal, Barnum et al. [2] employed the regular visual effects of rain and snow in global frequency information to approximate rain streaks as a motion-blurred Gaussian. Afterwards, to integrate more prior knowledge of the task, Jiang et al. [20]

proposed a tensor-based video rain streak removal approach by considering the sparsity of rain streaks, smoothness along the raindrops and the rain-perpendicular direction, and global and local correlation along time direction.

In recent years, low-rank based models have drawn more research attention for the task of video rain/snow removal. Chen et al. [7] first investigated spatial-temporal correlation among local patches with rain streaks and used low-rank term to help extract rain streaks from a video. Later, Kim et al. [22]

proposed a rain and snow removal method which is also designed based on temporal correlation and low-rank matrix completion. This method uses extra supervised knowledge (images/videos with/without rain streaks) to help training a rain classifier. To further exclude false candidates, Santhaseelan et al. 

[34] used local phase congruency to detect rain and applied chromatic constrain. To deal with heavy rain and snow in dynamic scenes, Ren et al. [33] divided rain into sparse and dense ones based on the low-rank hypothesis of the background. Based on the low-rank background assumption, Wei et al. [40] further encoded rain streaks as a patch-based mixture of Gaussians. Such stochastic manner for encoding rain streaks could make the method deliver a wider range of rain information.

Very recently, motivated by the booming of deep learning (DL) techniques, several DL methods also appeared for the task. Liu et al. [26] addressed the problem by constructing deep recurrent convolutional networks, which builds a joint recurrent rain removal and reconstruction network that seamlessly integrates rain degradation classification, spatial texture appearances based rain removal, and temporal coherence based background detail reconstruction. Meanwhile, Chen et al. [6] proposed a deep derain framework which applies superpixel segmentation to decompose the scene into depth consistent units. Alignment of scene contents are done at the super-pixel level to handle the videos with highly complex and dynamic scenes.

2.2 Single Image Rain and Snow Removal Methods

For literature comprehensiveness, we also briefly review the rain/snow removal methods for a single image. Kang et al. [21] firstly formulated the problem as an image decomposition problem based on morphological component analysis, which achieves rain component from the high frequency part of an image by using dictionary learning and sparse coding. Later, Luo et al. [28] built a nonlinear screen blend model based on discriminative sparse codes. Besides, Ding et al. [9] designed a guided smoothing filter to obtain a coarse rain-free or snow-free image, and Li et al. [25] utilized patch-based GMM priors to distinguish and remove rain from background in a single image. Wang et al. [38] designed a 3-layer hierarchical scheme to classify the high-frequency part into rain/snow and non-rain/snow components. Gu et al. [18] jointly analyzed sparse representation and synthesis sparse representation to encode background scene and rain streaks. Meanwhile, Zhang et al. [48] learned a set of generic sparsity-based and low-rank representation-based convolutional filters for efficiently representing background and rain streaks in an image.

Recently, DL-based methods represent the new trend for this task. Fu et al. [11] firstly developed a deep CNN model to extract discriminative features of rain in high frequency layer of an image. The training pairs are constructed based on the whole image. Later, Fu et al. [12] constructed the training pairs by using image patches and utilized the res-net as the classifier. Zhang et al. [47] first proposed a derain network based on generative adversarial network for single image derain. Yang et al. [42] designed a multi-task DL architecture that learns the binary rain streak map, the appearance of rain streaks and the clean background. Liu et al. [27] proposed a multistage and multi-scale network to deal with the removal of translucent and opaque snow particles. Very recently, Yang et al. [43] constructed a contextualized deep network, which incorporates a binary rain map indicating rain-streak regions, and accommodates various shapes, directions, and sizes of overlapping rain streaks as well as rain accumulation to model heavy rain.

Although these image-based methods can also deal with rain/snow removal in a video via a rough frame-by-frame manner, the missing use of the important temporal information for such a specific task inclines to make the video-based methods perform significantly better than image-based ones.

2.3 Multi-scale Approaches

Since multi-scale represents a general essence of various visual concepts, multi-scale approaches have been applied to wide range of computer vision tasks. E.g., for image segmentation, Baatz et al. [1] used a scale parameter to control the average image object size, making the method adaptable to different scales of interests. For image quality assessment, Wang et al. [39] proposed a multi-scale structural similarity method and developed an image synthesis method to calibrate the parameters that define the relative importance of different scales. To improve the invariance of CNN activations, Gong et al. [16] presented a simple but effective scheme to design a multi-scale orderless pooling regime. For dense prediction, Yu et al. [45] developed a convolutional network module using dilated convolutions to systematically aggregate multi-scale contextual information without losing resolution.

2.4 Alignment Approaches for Videos

Since camera jitter tends to damage the low-rank background structure of a video, we always need to align the transformed videos to accurately extract the low-rank background. Many alignment methods have been attempted to this issue. For example, Zhang et al. [52] proposed an approach to directly extract certain 3D invariant structures through their 2D images by undoing the (affine or projective) domain transformations. Zhang et al. [51] further proposed a general method for recovering low-rank 3-order tensors, which introduced auxiliary variables and relaxed the hard equality constraints by the ADMM method. Yong et al. [44] proposed an alignment method for aligning the video background based on optimizing a supplemental affine transformation operator, and applied it to the task of dynamic background subtraction.

3 Online Transformed MS-CSC model for Dynamic Video Rain/Snow Removal

This work is inspired by our previous conference work [24], proposing an offline multi-scale convolutional sparse coding (MS-CSC) model, specifically designed for rain removal issue (with consistent rain temporarily) in a fixed length of video sequence. We thus first introduce the formulation of this offline model.

3.1 Offline MS-CSC Model

Let denote the input video, where and represent the height, width and the number of frames, respectively. We assume that the video is decomposed as:


where represent background scene, rain layer, moving objects, and background noise of the video, respectively. These parts can then be modeled separately as follows [24].

Background Modeling: For a fixed length of video sequence captured from a surveillance camera, the background tends to keep steady over the frames, and thus can be rationally assumed to be resided on a low-dimensional subspace [30, 53, 54, 5], leading to its low-rank matrix factorization representation as:


where . The operation ‘Fold’ refers to fold up each matrix column into the corresponding frame matrix, and thus is a tensor with the same size as .

Rain Layer Modeling: Since rain in a video contain repetitive local patterns sparsely scattering over different areas, and also exhibits multi-scale property due to its occurrence positions with different distances to the cameras, multi-scale convolutional sparse coding (MS-CSC) [46] is thus utilized to model rain as follows:


where is a set of feature maps that approximate the rain streak positions, and denotes the filters representing the repetitive local patterns of rain streaks. and denote the numbers of entire filters and filters at the -th scale, respectively. Considering the sparsity of feature maps, the -penalty [31] is utilized to regularize them.

Moving objects Modeling: Motivated by the work [40], Markov random field (MRF) is used to explicitly detect the moving objects. Let be a binary tensor denoting the moving object support:


and be the complementary of (i.e., , is a tensor with all elements as 1). Eq.(1) is then reformulated as:


where denotes the element-wise multiplication. Since moving objects always exhibit smooth property, total variation (TV) penalty is adopted to regularize them. Additionally, considering the sparse feature and continuous shapes along both space and time of moving object, -penalty and weighted 3-dimensional total variation (3DTV) penalty are both employed to regularize the moving objects support simultaneously.

By assuming that the background noise follows an i.i.d. Gaussian, we can then integrate the aforementioned three models imposed on background, rain streak and moving objects to get the MS-CSC model for offline video rain removal as follows [24]:

where are the variables involved in the problem to be optimized.

3.2 Online Transformed MS-CSC Model

The previous MS-CSC model is specifically designed for rain removal in a fixed length of video under the assumption that the rain is of consistent configuration along time. Specifically, the rain feature maps (as defined in Eq. (3)) of all video frames attained under fixed filters are assumed to follow a unique independent and identically distributed Laplacian. The real rain shapes, however, are always both correlated and distinctive along time, and varied from frame to frame across the entire video. The simple encoding manner of MS-CSC is thus inappropriate to real scenarios. We thus present the online MS-CSC model, which not only provides a more proper way to describe temporally dynamic rain/snow, but also makes the method more efficient and potentially applicable to streaming videos with continuously increasing frames in real time.

Denote the newly coming frame as , where and represent the height and width of this frame, respectively, and denotes the total number of pixels in this frame. Similar to (1), we then decompose as the following three parts:


where represent the background scene, rain layer, moving objects and background noise of the current frame, respectively. We then put forward the schemes to model these parts based on the dynamic characteristics of rain/snow.

3.2.1 Modeling dynamic rain/snow layer

Similar to the aforementioned offline MS-CSC model, we also adopt MS-CSC model [46] to represent the the repetitive local patterns and multi-scale shapes of rain streaks, namely:


where is a set of feature maps that approximate the rain streak positions, and denotes the filters representing the repetitive local patterns of rain streaks. and denote the filter number and the filter at the -th scale, respectively.

Similar to the MS-CSC model, we also assume the feature map of the current frame follows a Laplacian distribution (i.e., imposed with penalty as Eq. (3.1), which, however, has its specific scale parameter different with others, namely:


where the scale parameter is specified for the current frame reflecting the specific rain shape in this frame. Furthermore, the correlation of rain between current and previous frames is represented by the following prior term imposed on :


where and represent the scale parameter learned from the previous frames. Here

denotes the Inverse-Gamma distribution, a conjugate prior to

, whose mode is exactly the one of previously learned (i.e., ). It is then naturally delivered that the correlation of rain shapes between current frame and the learned knowledge from previous ones.

In the way as aforementioned, the dynamic characteristic of rain/snow across a video can then be rationally represented. In specific, the scale parameter in each frame is specifically learned and different from one another, finely representing the distinctiveness (i.e. ’non-identical’) of rain/snow among different frames. Furthermore, the scale parameter of feature map distribution for the current frame is regularized by that of previously learned ones, well encoding the correlation (i.e., ’non-independent’) across especially adjacent frames. The model is thus expected to better adapt to the variations of the dynamic rain/snow.

3.2.2 Modeling moving object and background noise layers

Following the MS-CSC model, we also adopt MRF to detect the moving objects. Let be a binary matrix denoting the moving object support, which is defined as


Let be complementary of satisfying . Eq.(6) can then be equivalently expressed as:


Like the optimization problem (3.1), by assuming all elements of the background noise

follow a Gaussian distribution with zero mean and variance

, we can then get the probabilistic model for the component of as follows:


Similar to the dynamic shapes of rain in practical video, the background noise embedded in the video is also with dynamic forms, and also both distinctive and correlated among video frames. We can then also represent this dynamic knowledge. Specifically, for video noise in the current frame with variance , we model it in the similar modeling manner as aforementioned, i.e., imposing conjugate prior to as:


where and denote the variance of Gaussian noise learned from the previous frames. The mode of this prior is also the knowledge previously learned (i.e., ). This encoding manner is thus also able to deliver the dynamic property of noises/snow along the video.

3.2.3 Modeling dynamic video background

To tackle dynamic shapes of background scenes in a video due to camera jitter, i.e., video transformations like translation, rotation and scaling, a flexible affine transformation operation is imposed on the background. In the decomposition form (6) for the current frame , the background component is expressed to be transformed from the previous one as , where denotes the transformed operator implemented on the initial background , and can be formulated as an affine or projective transformation [44]. Then, Eq.(11) and (12) are reformulated as:


3.2.4 Online Transformed MS-CSC Model

For convenience, we denote all involved parameters as and the parameters in the current and last frames as and , respectively. Based on the models provided in the last sections, given the previous parameters and newly coming frame , we can then obtain the posterior distribution of as follows:


Through maximizing this posterior, the updated parameters for the current frame can then be attained. This MAP problem can then be equivalently expressed as the following minimization problem:




Specifically, and correspond to the regularization terms for the distributions of feature map and noises embedded in , respectively, which can be more intuitively understood by the following equivalent forms:


where denotes the KL divergence between two distributions. Particularly, it can be easily observed that functions to rectify the rain streaks on the current frame with parameter to approximate the previously learned rain streaks with parameter , so as to make the rain shapes in the adjacent frames correlated. Similarly, the regularization term inclines to enforce the background noise in the current frame close to that embedded in the previous ones. This easily explains why our method can fit dynamic rain, as well as varying background noises, in a video with evidently non-i.i.d. configurations.

The corresponding augmented Lagrangian function of Eq. (3.2.4) can be written as follows:


where and are the Lagrange variable and the penalty parameter, respectively.

3.3 ADMM Algorithm

We can then readily adopt ADMM algorithm to iteratively optimize each variable involved in Eq. (3.2.4). To simplify the relevant subproblems, we will utilize the following equation:

Next, we discuss how to solve each subproblem separately.

Update : The subproblem with respect is


This subproblem is a standard energy minimization problem, which can be efficiently solved by graph cut algorithm [4, 23].

Update : The subproblem with respect to is


which is easily solved by the TV regularization algorithm [37].

Update and : Since is a nonlinear geometric transform, it’s hard to directly optimize it and we resort to the following linear approximation:


where is the Jacobian of with respect to . We can iteratively approximate the original nonlinear transformation with a locally linear approximation, as . Therefore, the subproblem with respect to can be reformulated as:


It can be solved in closed-form. The solution is:


Fixing , we can use Eq. (26) to update the background.

Update : The subproblem with respect is


This subproblem is a standard CSC problem and can be readily solved by [41], which adopts the ADMM scheme and FFT to improve computation efficiency.

Update : The subproblem with respect to is


We use online learning algorithm for sparse coding [29] to update the filters . The algorithm utilizes block-coordinate descent with warm restarts .

Update : The subproblem with respect to is


The closed-form solution is


where .

Update : Following the general ADMM setting, can be updated as:


Update : The subproblem with respect is


Its closed-form solution is:


where .

Update : The subproblem with respect to is


Its closed-form solution is:


where .

The algorithm for solving this online transformed MS-CSC (OTMS-CSC) model can then be summarized as Algorithm 1.

0:  The newly coming frame: ; model variables of last frame: ; the parameters of last frame: .
0:  , .
1:  if   then
2:     update by using the strategy suggested in Sec. 3.4.2.
3:  end if
4:  while not converge do
5:     Update by Eq. (28) and update .
6:     Update aligned background by Eq. (26).
7:     Update by Eq.(24), (25), respectively.
8:     Update by Eq.(29), (30), respectively.
9:     Update by Eq.(32), (33), respectively.
10:     Update by Eq.(35), (37), respectively.
11:  end while
11:  ;  Recovered frame = .
Algorithm 1 Algorithm for OTMS-CSC Model

3.4 Some Remarks

3.4.1 Explanation for function of regularizations

It should be noted that the regularization in Eq. (21) and Eq. (22) intrinsically conduct the superiority of the proposed OTMS-CSC model for removing dynamic rain/snow. Specifically, the offline MS-CSC model [24] intrinsically specifies one unique value for the parameter as well as to represent the background noise variance and scale parameter in feature map representing rain/snow, respectively, for all the frames of the video. The offline model is thus only suitable to be used in the video with static background and consistent rain/snow shapes. The OTMS-CSC model, however, can finely handle dynamic rain with videos with dynamic rain and varying background noises. This advantage is naturally conducted by the fact that the model assumes that each frame has its own specific noise parameter and scale parameter , by simultaneously fitting the knowledge of the current frame and being regularized by those ( and ) obtained from the previous frames. This makes this model, implemented for each new frame in an online mode, better adapt the specific structures of rain/snow or background for the current frame, generally varied from those for previous ones.

To more intuitively clarify this point, we illustrate in Fig. 2 the changing tendencies of parameters and for a sequence of video frames, containing snow varying from heavy to light, as shown in Fig. 10. It can be seen that both and are gradually decreasing along time, finely reflecting the dynamic changes of snow along time.

3.4.2 Background Amelioration

Our method gradually updates the background of the current frame from the affine transformation on that of the last frame by Eq. (26). Due to constantly temporal scene shifting of the videos (especially brought by the camera moving along a certain direction in a short time) and incremental accumulation of computing errors, the recovered video background tends to be gradually deviated from the real one, which always makes the rain-removed videos look more or less blurry after a period of algorithm computing. To alleviate this issue, our algorithm needs to specifically ameliorate the background knowledge after implementing certain frames by our algorithm.

Fig. 2: The changing tendency of the noise variance and the scale parameter along a video (as shown in Fig. 1) containing dynamic snow varying from heavy to light. Since there are three different scales of filters (used for , , patch sizes , respectively) are utilized, there are three scale parameter changing curves.

Our strategy is as follows: When our algorithm is run iterations (the current frame is denoted as the one), we then pick up two frames before and after current frame to get a subgroup as:


We then easily align all other frames under the reference of the current frame by using the similar manner as we introduced in Eq. (26), to obtain the aligned subgroup as (a tensor):


where (), and is calculated readily by Eq. (26)-(28). Then we can easily calculate the optimal rank-one approximation of the unfolded matrix of efficiently by SVD, and replace as to get the new ameliorated background initialization.

(a) Input/GT (b) Garg et al. [15] (c) Jiang et al. [20] (d) Ren et al. [33] (e) Wei et al. [40] (f) Liu et al. [26] (g) MS-CSC (h) OTMS-CSC
Fig. 3: (a) An input rainy frame (upper) and its groudtruth clean one (lower). (b)-(h) Recovered frames (upper) and extracted rain layers (lower) by different competing methods.

3.4.3 Potential to be used for streaming videos

It is evident that the proposed OTMS-CSC algorithm is implemented in an online mode, i.e., each time run on a unique newly coming frame. This learning manner makes our method potentially applicable to practical streaming videos. In specific, in each implementation stage for a frame , the algorithm only requires a fixed memory to restore related parameters . Besides, since the implementation is similar to each new frame, its time complexity is also fixed in each learning stage. This makes our method potentially feasible to the practical videos continuously coming with streaming format beyond current offline methods, which not only need increasingly more space complexity for larger length of videos, but also require increasingly larger time complexity for larger video sequence (even need to pre-implement the algorithms on the entire video again). This makes them hardly useable to this typical real video format in practice. Comparatively, our method makes the real-time execution of rain removal possible to be realized for practical streaming video. What we need to do is to improve the efficiency of our algorithm on one frame to make it gradually meet the real-time requirements. Possible regimes include further improvement on hardware power, further speed-up on algorithm implementation (like modify it distributed/parrallel or transform it in faster implementation platform), or replace some of its stages with faster algorithms. This is a meaningful issue worthy of making further endeavors in future research.

(a) Input (b) GT (c) Garg et al. [15] (d) Jiang et al. [20] (e) Ren et al. [33] (f) Wei et al. [40] (g) Liu et al. [26] (h) MS-CSC (i) OTMS-CSC
Fig. 4: (a)(b) An input frame with heavy rain and its groundtruth clean one. (c)-(i) Recovered frames obtained by different competing methods.
(a) Input/GT (b) Garg et al. [15] (c) Jiang et al. [20] (d) Ren et al. [33] (e) Wei et al. [40] (f) Liu et al. [26] (f) MS-CSC (g) OTMS-CSC
Fig. 5: (a) an input snowy frame (upper) and its groudtruth clean one (lower). (b)-(g) Recovered frames (upper) and extracted snow layers (lower) by different competing methods.
(a) Input (b) Garg et al. [13] (c) Jiang et al. [20] (d) Ren et al. [33] (e) Groundtruth (f) Liu et al. [26] (g) TMS-CSC (h) OTMS-CSC
Fig. 6: (a) Two typical input frame in a video with heavy rain. (b)-(h) Recovered frames obtained by different competing methods.

  Types Static videos Dynamic video Dataset Fig. 3 Fig. 4 Fig. 5 Fig. 6 Metrics PSNR VIF FSIM SSIM PSNR VIF FSIM SSIM PSNR VIF FSIM SSIM PSNR VIF FSIM SSIM   Input 28.22 0.637 0.935 0.927 23.82 0.766 0.970 0.929 27.93 0.595 0.859 0.831 29.32 0.752 0.995 0.909 Garg [15] 29.83 0.661 0.955 0.946 24.64 0.750 0.972 0.920 35.87 0.819 0.957 0.950 36.11 0.849 0.977 0.969 Jiang[20] 31.01 0.767 0.967 0.959 24.32 0.713 0.966 0.929 35.80 0.779 0.982 0.977 32.51 0.693 0.998 0.960 Ren [33] 28.26 0.685 0.970 0.962 23.52 0.681 0.966 0.927 30.34 0.921 0.753 0.995 31.33 0.626 0.994 0.956 Wei [40] 29.76 0.822 0.991 0.986 24.43 0.761 0.973 0.943 34.58 0.945 0.996 0.993 - - - - Liu [26] 27.56 0.626 0.995 0.941 22.19 0.555 0.946 0.895 31.56 0.616 0.996 0.946 34.69 0.716 0.998 0.965 (T)MS-CSC 33.89 0.865 0.992 0.992 25.37 0.790 0.980 0.957 42.95 0.980 0.999 0.997 36.90 0.862 0.999 0.982 OTMS-CSC 32.58 0.853 0.991 0.989 25.91 0.796 0.979 0.957 46.29 0.988 0.999 0.999 37.65 0.869 0.983 0.966  

TABLE I: Quantitative performance comparison of all competing methods on synthetic rainy and snowy videos.

4 Experimental Results

In this section, we evaluate the performance of our method on videos with synthetic and real rain/snow in both quantitative and qualitative perspectives. Some state-of-the-art video rain/snow removal methods have also been implemented for comparison, including Garg et al. [13]111 rain/, Jiang et al. [20]222Code is provided by the authors, Ren et al. [33]333, Wei et al. [40]444, Li et al. [24] and Liu et al. [26]555 Note that these methods contain both model-driven MAP-based and data-driven DL representative state-of-the-arts for a comprehensive comparison. Furthermore, through introducing traditional offline alignment strategy into the MS-CSC model, called transformed MS-CSC or TMS-CSC, this offline method can also be ameliorated to adapt to videos with background transformations. All experiments were implemented on a PC with i7 CPU and 32G RAM. To make a sufficiently comprehensive comparison, more video demonstrations on the obtained results by completing methods have been reported in our specifically constructed website666 for easy observation.

4.1 Experiments on Videos with Synthetic Rain/Snow

We first introduce experiments executed on four videos with synthetic rain/snow, three with static backgrounds, as shown in Fig. 35, and one with evidently dynamic background with evident translations among adjacent frames, as depicted in Fig. 6. The clean videos as pshown in Fig. 4 and Fig.6 are downloaded from CDNET database[17]777, and those of Fig. 3 and Fig. 5 are downloaded from Youtube888 and Xi’an Jiaotong University surveillance camera, respectively. Especially, the videos as shown in 4 and 5 contain heavy rain and snow forming serious occlusions to background scene and foreground objects throughout the video sequences, respectively. The rain/snow with various types were synthetically generated by Photoshop on a black background.

(a) Input (b) Garg et al. [15] (c) Jiang et al. [20] (d) Ren et al. [33] (e) Wei et al. [40] (f) Liu et al. [26] (g) MS-CSC (h) OTMS-CSC
Fig. 7: (a) An input frame of a real rainy video with complex moving objects. (b)-(h) Recovered frames obtained by different competing methods.
(a) Input (b) Garg et al. [15] (c) Jiang et al. [20] (d) Ren et al. [33] (e) Wei et al. [40] (f) Liu et al. [26] (g) MS-CSC (h) OTMS-CSC
Fig. 8: (a) An input frame of a real rainy video captured at night. (b)-(h) Recovered frames obtained by different competing methods.

From Fig. 3 and Fig. 4, we can easily observe that the compared methods proposed by Garg et al., Jiang et al. and Liu et al. haven’t completely removed the rain streaks and the method proposed by Ren et al. has not finely kept the shape of the moving object when removing the rain streaks. Besides, as shown in the second row of Fig. 3, the rain layer extracted by all other competing methods contain more or less additional background information. Comparatively, the proposed OTMS-CSC method, as well as its offline version MS-CSC, can finely remove the rain in the video and well maintain the shape and texture details.

From Fig. 5, it can be seen that most competing methods have not finely removed snow from the video, and the separated snow layer by MS-CSC method improperly contains certain moving objects. Comparatively, the OTMS-CSC method has a better performance in both snow removing and background/foreground detail preservation.

For dynamic videos as shown in Fig. 6, we can observe that the methods proposed by Garg et al. and Ren et al. have not fully removed the rain details on the images. The method proposed by Ren et al. as well as the offline TMS-CSC method have not finely preserved the structure of the moving objects from foreground, and that proposed by Jiang et al. has also not done well in background detail preservation (like the texture of wall). Comparatively, our proposed OTMS-CSC method attains a relatively better performance in both aspects.

Quantitative comparisons are also presented in Table I, which fully complies with the aforementioned visual observations. Specifically, we adopt four image quality assessment (IQA) metrics to evaluate the performance of all competing methods, namely, PSNR, VIF [35], FSIM [49] and SSIM [39]. From the table, it can be seen that our proposed OTMS-CSC model can perform best or the second best in almost all cases in terms of all IQAs, as compared with other competing methods. Considering that all other methods are implemented on the entire video (iteratively utilizing the video multiple times) or need additionally pre-collected training data while our method is sequentially implemented in the video sequence (i.e., each frame is only iterated one time and then dropped out), it should be rational to say our method is efficient.

(a) Input (b) Garg et al. [15] (c) Jiang et al. [20] (d) Ren et al. [33] (e) Wei et al. [40] (f) Liu et al. [26] (g) MS-CSC (h) OTMS-CSC
Fig. 9: (a) An input frame of a real snowy video with poor visibility. (b)-(h) Recovered frames obtained by different competing methods.
(a) Input (b) Garg et al. [15] (c) Jiang et al. [20] (d) Ren et al. [33] (e) Wei et al. [40] (f) Liu et al. [26] (g) MS-CSC (h) OTMS-CSC
Fig. 10: (a) An input frame of a real video with dynamic snow shapts. (b)-(h) Recovered frames obtained by different competing methods.
(a) Input (b) Garg et al. [13] (c) Jiang et al. [20] (d) Ren et al. [33] (e) Liu et al. [26] (f) TMS-CSC (g) OTMS-CSC
Fig. 11: (a) Two input frame of a real snowy video with fast horizontal movement. (b)-(h) Recovered frames obtained by different competing methods.
(a) Input (b) Garg et al. [13] (c) Jiang et al. [20] (d) Ren et al. [33] (e) Liu et al. [26] (f) TMS-CSC (g) OTMS-CSC
Fig. 12: (a) Two input frame of a real snowy video with obvious illumination variation. (b)-(h) Recovered frames obtained by different competing methods.
(a) Input (b) Garg et al. [13] (c) Jiang et al. [20] (d) Ren et al. [33] (e) Liu et al. [26] (f) TMS-CSC (g) OTMS-CSC
Fig. 13: (a) Two input frame of a real snowy video with scale transformation. (b)-(g) Recovered frames obtained by different competing methods.
(a) Input (b) Garg et al. [13] (c) Jiang et al. [20] (d) Ren et al. [33] (e) Liu et al. [26] (f) TMS-CSC (g) OTMS-CSC
Fig. 14: (a) Two input frame of a real aerial video with complex moving objects. (b)-(g) Recovered frames obtained by different competing methods.

4.2 Experiments on Videos with Real Rain/Snow

We further evaluate the performance of the proposed method on videos with real rainy or snowy scenarios. Eight real videos have been included in our experiments, including three captured under static backgrounds (as shown in Fig. 79) and five under dynamic backgrounds (as shown in Fig. 1014) with typical transformations like random jitter, translation, scale transformation and aerial view. Fig. 7 and Fig. 11 are two public rain videos both used in [14], and the videos of Fig. 8, Fig. 10, Fig. 12 and Fig. 13 are downloaded from Youtube999{v=KzEv1h-JgaY, v=kNTYEKjXqzs, v=wb3gWRcKyCI, v=HbgoKKj7TNA} respectively.

The videos shown in Fig. 7 and Fig. 8 are captured by surveillance equipments in street, containing dynamically varying rain structures along time. From the figures, we can easily observe that the derained frames of all other compared methods still contain certain rain streaks and the extracted rain layer is mixed with edges from the background. By contrast, the OTMS-CSC method, as well as MS-SCS, is capable of better removing all the rain streaks without mixing extra information into the rain layer.

Fig. 9 and Fig. 10 show two real snowy video sequences captured on a real scene with poor visibility containing dynamic backgrounds. It is easy to see from the figures that most other competing methods have degenerated performance in snow removing, especially in the area around the light. Comparatively, our method can finely remove the snow and preserve the texture detail of the frame.

Fig. 11 and Fig. 12 show the snow removal results on real videos with fast horizontal movement and obvious illumination variations, respectively. From Fig. 11, it can be seen that the methods proposed by Garg et al. and Jiang et al. cannot fully remove the snow and recover the texture information underlying the frames. The methods proposed by Ren et al. and Liu et al. fail to detect and remove the snowflakes since they are not capable of dealing with video transformations. The OTMS-CSC method, as well as TMS-CSC, can obtain better visualized performance since they consider the background transformation in the modeling. This verifies that aligning the video background can help to improve the final performance of snow removal especially for dynamic videos.

Fig. 13 and Fig. 14 show two challenging real snowy videos. Fig. 13 is captured in the condition of light snow and most backgrounds are covered with white snow, and thus it is not easy even for humans to observe the falling snow in a frame. Fig. 14 is also challenging since it is a aerial video and with evident scale variations across frames. It is seen from the figures that our method can still perform relatively satisfactory in these videos, which verifies its robustness in real cases. Please refer to the website6 for more comprehensive illustration of the video results.

4.3 Run time comparison

To show the efficiency of the proposed online method, we list the average running time per frame of each compared method in Table II in four representative static and dynamic videos with synthetic and real rain/snow, respectively. From the table, the speed advantage of the OTMS-CSC method is evident attributed to its online learning manner. Besides, as we show in Fig. 15, this online method has a good scalability, i.e., its time cost is linearly increasing with more input video frames, naturally due to its fixed training time on each video frame. Together with its fixed space complexity along time as discussed in Sec. 3.4.3, the method is expected to be potentially useful for real streaming videos.

(a) Fig. 5 (b) Fig. 6 (c) Fig. 10 (d) Fig. 13
Fig. 15: Run time comparison of comparable methods on several videos. The black point denotes the method over the current frames will report an out of memory error.


Type Dataset Size Ren [33] Wei [40] Liu [26] MS-CSC OTMS-CSC
Static Fig. 5 3.67 8.62 4.82 3.37 0.96
Fig. 10 8.05 13.30 4.82 2.69 0.88
Dynamic Fig. 6 50.3 - 4.03 9.53 0.87
Fig. 13 80.4 - 8.55 23.47 1.36


TABLE II: Run time comparison of all competing methods on four typical rainy/snowy videos.

5 Conclusion

In this paper, we have proposed a new rain/snow removal method for surveillance videos containing dynamic rain/snow captured with camera jitter. Both dynamic characteristics of rain/snow variations and background scenes along time inevitably encountered in real cases, have been fully considered in our method. Especially, the method is with a natural online implementation manner, with fixed space and time complexity for handling each frame of continuously coming videos, making it potentially useful for dealing with practical streaming video sequences. In the future, we will further ameliorate the capability of the proposed method in more challenging video cases, like those captured under moving cameras or those under background with strong color contrast and rain/snow with large streak shapes, and try to design rational techniques or use some advanced computing equipments to further speed up the method for each unique frame to make it meet with the real-time requirements on practical streaming videos.


  • [1] M. Baatz and A. Schape (2000) An optimization approach for high quality multi-scale image segmentation. Beitrage Zum Agit-symposium (), pp. 12–23. Cited by: §2.3.
  • [2] P. C. Barnum, S. Narasimhan, and T. Kanade (2010) Analysis of rain and snow in frequency space.

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    86 (2), pp. 256.
    Cited by: §2.1.
  • [3] P. Barnum, T. Kanade, and S. Narasimhan Spatio-temporal frequency analysis for removing rain and snow from videos. Proceedings of the First International Workshop on Photometric Analysis For Computer Vision. Cited by: §1, §2.1.
  • [4] Y. Boykov, O. Veksler, and R. Zabih (2001) Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (11), pp. 1222–1239. Cited by: §3.3.
  • [5] X. Cao, Q. Zhao, D. Meng, Y. Chen, and Z. Xu (2016) Robust low-rank matrix factorization under general mixture noise distributions. IEEE Transactions on Image Processing 25 (10), pp. 4677–4690. Cited by: §3.1.
  • [6] J. Chen, C. Tan, J. Hou, and chau Lap-Pui (2018) Robust video content alignment and compensation for clear vision through the rain. Computer Vision and Pattern Recognition (), pp. . Cited by: §1, §2.1.
  • [7] Y. Chen and C. Hsu (2013) A generalized low-rank appearance model for spatio-temporally correlated rain streaks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1968–1975. Cited by: §1, §2.1.
  • [8] N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. IEEE Computer Society Conference on Computer Vision 1 (12), pp. 886–893. Cited by: §1.
  • [9] X. Ding, L. Chen, X. Zheng, Y. Huang, and D. Zeng (2016) Single image rain and snow removal via guided l0 smoothing filter. Multimedia Tools and Applications 75 (5), pp. 2697–2712. Cited by: §2.2.
  • [10] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani (2010) Person re-identification by symmetry-driven accumulation of local features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. . Cited by: §1.
  • [11] X. Fu, J. Huang, X. Ding, Y. Liao, and J. Paisley (2017) Clearing the skies: a deep network architecture for single-image rain removal. IEEE Transactions on Image Processing 26 (6), pp. 2944–2956. Cited by: §2.2.
  • [12] X. Fu, J. Huang, D. Zeng, Y. Huang, X. Ding, and J. Paisley (2017) Removing rain from single images via a deep detail network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3855–3863. Cited by: §2.2.
  • [13] K. Garg and S. K. Nayar (2004) Detection and removal of rain from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. I–I. Cited by: §1, §2.1, Fig. 6, Fig. 11, Fig. 12, Fig. 13, Fig. 14, §4.
  • [14] K. Garg and S. K. Nayar (2005) When does a camera see rain?. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1067–1074. Cited by: §1, §2.1, §4.2.
  • [15] K. Garg and S. K. Nayar (2007) Vision and rain. International Journal of Computer Vision 75 (1), pp. 3–27. Cited by: §2.1, Fig. 3, Fig. 4, Fig. 5, TABLE I, Fig. 10, Fig. 7, Fig. 8, Fig. 9.
  • [16] Y. Gong, L. Wang, R. Guo, and S. Lazebnik Multi-scale orderless pooling of deep convolutional activation features. Springer International Publishing. Cited by: §2.3.
  • [17] N. Goyette, P. Jodoin, F. Porikli, J. Konrad, and P. Ishwar (2012) Changedetection. net: a new change detection benchmark dataset. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, pp. 1–8. Cited by: §4.1.
  • [18] S. Gu, D. Meng, W. Zuo, and L. Zhang (2017) Joint convolutional analysis and synthesis sparse representation for single image layer separation.. In Proceedings of the IEEE International Conference on Computer Vision, pp. . Cited by: §2.2.
  • [19] L. Itti, C. Koch, and E. Niebur (1998) A model of saliency-based visual attention for rapid scene analysis.. A model of saliency-based visual attention for rapid scene analysis. 20 (11), pp. 1254–1259. Cited by: §1.
  • [20] T. Jiang, X. Zhao, L. Deng, and Y. Wang (2017) A novel tensor-based video rain streaks removal approach via utilizing discriminatively intrinsic priors.. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. . Cited by: §2.1, Fig. 3, Fig. 4, Fig. 5, Fig. 6, TABLE I, Fig. 10, Fig. 11, Fig. 12, Fig. 13, Fig. 14, Fig. 7, Fig. 8, Fig. 9, §4.
  • [21] L. Kang, C. Lin, and Y. Fu (2012) Automatic single-image-based rain streaks removal via image decomposition. IEEE Transactions on Image Processing 21 (4), pp. 1742–1755. Cited by: §2.2.
  • [22] J. H. Kim, J. Y. Sim, and C. S. Kim (2015) Video deraining and desnowing using temporal correlation and low-rank matrix completion.. IEEE Transactions on Image Processing 24 (9), pp. 2658–70. Cited by: §2.1.
  • [23] V. Kolmogorov and R. Zabin (2004) What energy functions can be minimized via graph cuts?. IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (2), pp. 147–159. Cited by: §3.3.
  • [24] M. Li, Q. Xie, Q. Zhao, W. Wei, S. Gu, J. Tao, and D. Meng (2018) Video rain removal by multiscale convolutional sparse coding. Computer Vision and Pattern Recognition (), pp. . Cited by: §1, §3.1, §3.1, §3.4.1, §3, §4.
  • [25] Y. Li, R. T. Tan, X. Guo, J. Lu, and M. S. Brown (2016) Rain streak removal using layer priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2736–2744. Cited by: §2.2.
  • [26] J. Liu, W. Yang, S. Yang, and Z. Guo (2018) Erase or fill? deep joint recurrent rain removal and reconstruction in videos. Computer Vision and Pattern Recognition (), pp. . Cited by: §1, §2.1, Fig. 3, Fig. 4, Fig. 5, Fig. 6, TABLE I, Fig. 10, Fig. 11, Fig. 12, Fig. 13, Fig. 14, Fig. 7, Fig. 8, Fig. 9, TABLE II, §4.
  • [27] Y. Liu, D. Jaw, S. Huang, and J. Hwang (2018) DesnowNet: context-aware deep network for snow removal. IEEE Transactions on Image Processing 27 (6), pp. 3064–3073. Cited by: §2.2.
  • [28] Y. Luo, Y. Xu, and H. Ji (2015) Removing rain from a single image via discriminative sparse coding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3397–3405. Cited by: §2.2.
  • [29] J. Mairal, F. Bach, J. Ponce, and G. Sapiro (2009) Online learning for matrix factorization and sparse coding.

    Journal of Machine Learning Research

    11 (1), pp. 19–60.
    Cited by: §3.3.
  • [30] D. Meng and F. De La Torre (2013) Robust matrix factorization with unknown noise. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1337–1344. Cited by: §3.1.
  • [31] D. Meng, Q. Zhao, and Z. Xu (2012) Improve robustness of sparse pca by l 1-norm maximization. In Pattern Recognition, pp. 487–497. Cited by: §3.1.
  • [32] S. Mukhopadhyay and A. K. Tripathi (2014) Combating bad weather part i: rain removal from video. Synthesis Lectures on Image, Video, and Multimedia Processing 7 (2), pp. 1–93. Cited by: