1 Introduction
Videos captured from outdoor surveillance system are often contaminated by rain or snow, which has a negative effect on the perceptual quality and tends to degrade the performance of subsequent video processing tasks, such as human detection [8], person reidentification [10], object tracking [32] and scene analysis [19]. Thus, removing rain and snow from surveillance videos is an important video preprocessing step and has attracted much attention in the computer vision community.
In recent decades, various methods have been proposed for removing rain from a video. The earliest video rain removal approach was proposed based on the photometry property of rain [13]. After that, more methods taking advantage of the essential physical characteristics of rain, such as photometric appearance [14], chromatic consistency [50], shape and brightness [3] and spatialtemporal configurations [36], were introduced to better separate rain streaks from the background of videos. However, these methods don’t utilize the prior knowledge of video structure, such as spatial smoothness of foreground objects and temporal similarity of background scenes, and thus cannot always obtain satisfactory performance especially in complex scenes. In recent years, lowrank models [7] show a great potential for this task and always achieve stateoftheart performance due to their better consideration of video structure prior knowledge both in foreground and background. Specifically, these methods not only use the lowrank structure for the background, but also fully facilitate the prior knowledge of the rain, such as sparsity and spatial smoothness [33, 40]
. Very recently, deep learning based methods have also been proposed for this task. These methods address the problem of video rain removal by constructing deep recurrent convolutional networks
[26] or deep convolutional network [6] and implement the task in a popular endtoend learning manner.Albeit achieving good progress, most of current methods are implemented on a prefixed length of videos and assume consistent rain/snow shapes under static background scenes. This, however, is evidently deviated from the real scenarios. On one hand, the rain/snow contained in a video sequence is generally with configurations changed constantly along time, as typically shown in Fig. 1. On the other hand, the background scene in video is also always dynamic, and inevitably contains timely transformations such as translation, rotation, scaling and distortion, due to camera jitters. Lacking considerations to such dynamic characteristics inclines to degenerate the performance of current methods in such real cases. Besides, as the dramatically increasing surveillance cameras installed all over the world, the real video is always coming online as a streaming format. Most current methods, however, are implemented/trained on a prefixed video sequence, and thus cannot finely and efficiently adapt to such kinds of streaming videos continually and endlessly coming in time. These issues have hampered the availability of existing methods in real applications and thus is worthy to be specifically investigated.
Against the aforementioned issues, this paper proposes a new video rain/snow removal method by fully encoding the dynamic statistics of both rain/snow and background scenes in a video along time into the model, and realizing it with an online mode to make it potentially available to handle constantly coming streaming video sequence. Specifically, inspired by the multiscale convolutional sparse coding (MSCSC) model designed for video rain removal (still for static rain) previously proposed in [24], which finely delivers the sparse scattering and multiscale shapes of real rain, this work encodes the dynamic temporal changing tendency of rain/snow as a dynamic MSCSC framework by timely parameter amelioration for the model in an online implementation manner. Besides, a transformation operator capable of being adaptively updated along time is imposed on the background scenes to finely fit the dynamic background transformations existed in a video sequence. All these knowledge are formulated into a concise maximum a posterior (MAP) framework, which can be easily solved by alternative optimization technique.
In all, the contribution of this work can be mainly summarized as follows: 1) An online MSCSC model is specifically designed for encoding dynamic rain/snow with temporal variations. The model is formulated as a concise probabilistic framework, where the feature map representing rain/snow knowledge of each video frame is gradually ameliorated under regularization of a penalty for enforcing them close to those calculated from the previous frames. In this manner, the insightful dynamic rain properties, i.e., the correlation and distinctiveness of rain/snow along different video frames, can be finely delivered. 2) An affine transformation operator is further embedded into the proposed model, and can be automatically adjusted to fit a wide range of video background transformations. This makes the method more robust to general camera movements, like rotation, translation, scaling or distortion. 3) The adopted online learning manner makes the method possess fixed space complexity all along but not gradually increasing ones (mostly to infinity) as most conventional methods, and fixed time complexity for any fixed length of newly coming frames. This guarantees the feasibility of our method on any length of video sequence, and provides potential for the method to handle real streaming data. 4) The superiority of the proposed method in robustness and efficiency are comprehensively substantiated by experiments implemented on synthetic and real videos, including those with evident rain/snow variations and/or dynamic camera jitters, both visually and quantitatively, as compared with other stateoftheart methods. Specifically, the performance of our method, directly executed on the streaming video sequence, can exceed those deep learning ones, requiring more precollected training data sources (pairs of rainy/snowy and corresponding clean video frames). This, to a certain extent, shows that the popularly employed datadriven deep learning methodology, requiring dominant source of supervised training samples and computation powers, might not be the only fashion for solving any computer vision tasks. It might be still necessary and important for elaborately designing probabilistic models through possibly thoroughly understanding the investigated problem and application scenarios.
The rest of paper is organized as follows. Section 2 introduces the related works. Section 3 reviews the offline MSCSC model suitable for removing static rain and proposes the online transformed MSCSC model as well as its solving algorithm. Section 4 demonstrates experimental results on synthetic and real rainy/snowy videos to substantiate the superiority of the proposed method. Finally, conclusions are drawn in Section 5.
2 Related Works
In this section, we give a brief review on the methods of video rain and snow removal. The related developments on single image rain and snow removal, multiscale modeling and video alignment are also introduced for literature comprehensiveness.
2.1 Video Rain and Snow Removal Methods
Garg and Nayar [13] made the earliest study on the photometric appearance of rain drops and developed a rain detection method by utilizing a linear spacetime correlation model. To better reduce the effects of rain before camera shots in images/videos, Garg and Nayar [14, 15] further proposed a method by adjusting the camera parameters such as field depth and exposure time.
In the past years, more physical intrinsic properties of rain streaks have been explored and formulated in algorithm designing. For example, Zhang et al. [50]
incorporated both chromatic and temporal properties and utilized Kmeans clustering for distinguishing background and rain streaks from videos. Later, Barnum et al.
[3] first considered the impact of snow on videos. They derived a physical model for representing raindrops and snowflakes and used them to determine the general shape and brightness of a single streak. The streak model combined with the statistical properties of rain and snow can then conduct how they affect the spatialtemporal frequencies of an image sequence. To enhance the robustness of rain removal, Barnum et al. [2] employed the regular visual effects of rain and snow in global frequency information to approximate rain streaks as a motionblurred Gaussian. Afterwards, to integrate more prior knowledge of the task, Jiang et al. [20]proposed a tensorbased video rain streak removal approach by considering the sparsity of rain streaks, smoothness along the raindrops and the rainperpendicular direction, and global and local correlation along time direction.
In recent years, lowrank based models have drawn more research attention for the task of video rain/snow removal. Chen et al. [7] first investigated spatialtemporal correlation among local patches with rain streaks and used lowrank term to help extract rain streaks from a video. Later, Kim et al. [22]
proposed a rain and snow removal method which is also designed based on temporal correlation and lowrank matrix completion. This method uses extra supervised knowledge (images/videos with/without rain streaks) to help training a rain classifier. To further exclude false candidates, Santhaseelan et al.
[34] used local phase congruency to detect rain and applied chromatic constrain. To deal with heavy rain and snow in dynamic scenes, Ren et al. [33] divided rain into sparse and dense ones based on the lowrank hypothesis of the background. Based on the lowrank background assumption, Wei et al. [40] further encoded rain streaks as a patchbased mixture of Gaussians. Such stochastic manner for encoding rain streaks could make the method deliver a wider range of rain information.Very recently, motivated by the booming of deep learning (DL) techniques, several DL methods also appeared for the task. Liu et al. [26] addressed the problem by constructing deep recurrent convolutional networks, which builds a joint recurrent rain removal and reconstruction network that seamlessly integrates rain degradation classification, spatial texture appearances based rain removal, and temporal coherence based background detail reconstruction. Meanwhile, Chen et al. [6] proposed a deep derain framework which applies superpixel segmentation to decompose the scene into depth consistent units. Alignment of scene contents are done at the superpixel level to handle the videos with highly complex and dynamic scenes.
2.2 Single Image Rain and Snow Removal Methods
For literature comprehensiveness, we also briefly review the rain/snow removal methods for a single image. Kang et al. [21] firstly formulated the problem as an image decomposition problem based on morphological component analysis, which achieves rain component from the high frequency part of an image by using dictionary learning and sparse coding. Later, Luo et al. [28] built a nonlinear screen blend model based on discriminative sparse codes. Besides, Ding et al. [9] designed a guided smoothing filter to obtain a coarse rainfree or snowfree image, and Li et al. [25] utilized patchbased GMM priors to distinguish and remove rain from background in a single image. Wang et al. [38] designed a 3layer hierarchical scheme to classify the highfrequency part into rain/snow and nonrain/snow components. Gu et al. [18] jointly analyzed sparse representation and synthesis sparse representation to encode background scene and rain streaks. Meanwhile, Zhang et al. [48] learned a set of generic sparsitybased and lowrank representationbased convolutional filters for efficiently representing background and rain streaks in an image.
Recently, DLbased methods represent the new trend for this task. Fu et al. [11] firstly developed a deep CNN model to extract discriminative features of rain in high frequency layer of an image. The training pairs are constructed based on the whole image. Later, Fu et al. [12] constructed the training pairs by using image patches and utilized the resnet as the classifier. Zhang et al. [47] first proposed a derain network based on generative adversarial network for single image derain. Yang et al. [42] designed a multitask DL architecture that learns the binary rain streak map, the appearance of rain streaks and the clean background. Liu et al. [27] proposed a multistage and multiscale network to deal with the removal of translucent and opaque snow particles. Very recently, Yang et al. [43] constructed a contextualized deep network, which incorporates a binary rain map indicating rainstreak regions, and accommodates various shapes, directions, and sizes of overlapping rain streaks as well as rain accumulation to model heavy rain.
Although these imagebased methods can also deal with rain/snow removal in a video via a rough framebyframe manner, the missing use of the important temporal information for such a specific task inclines to make the videobased methods perform significantly better than imagebased ones.
2.3 Multiscale Approaches
Since multiscale represents a general essence of various visual concepts, multiscale approaches have been applied to wide range of computer vision tasks. E.g., for image segmentation, Baatz et al. [1] used a scale parameter to control the average image object size, making the method adaptable to different scales of interests. For image quality assessment, Wang et al. [39] proposed a multiscale structural similarity method and developed an image synthesis method to calibrate the parameters that define the relative importance of different scales. To improve the invariance of CNN activations, Gong et al. [16] presented a simple but effective scheme to design a multiscale orderless pooling regime. For dense prediction, Yu et al. [45] developed a convolutional network module using dilated convolutions to systematically aggregate multiscale contextual information without losing resolution.
2.4 Alignment Approaches for Videos
Since camera jitter tends to damage the lowrank background structure of a video, we always need to align the transformed videos to accurately extract the lowrank background. Many alignment methods have been attempted to this issue. For example, Zhang et al. [52] proposed an approach to directly extract certain 3D invariant structures through their 2D images by undoing the (affine or projective) domain transformations. Zhang et al. [51] further proposed a general method for recovering lowrank 3order tensors, which introduced auxiliary variables and relaxed the hard equality constraints by the ADMM method. Yong et al. [44] proposed an alignment method for aligning the video background based on optimizing a supplemental affine transformation operator, and applied it to the task of dynamic background subtraction.
3 Online Transformed MSCSC model for Dynamic Video Rain/Snow Removal
This work is inspired by our previous conference work [24], proposing an offline multiscale convolutional sparse coding (MSCSC) model, specifically designed for rain removal issue (with consistent rain temporarily) in a fixed length of video sequence. We thus first introduce the formulation of this offline model.
3.1 Offline MSCSC Model
Let denote the input video, where and represent the height, width and the number of frames, respectively. We assume that the video is decomposed as:
(1) 
where represent background scene, rain layer, moving objects, and background noise of the video, respectively. These parts can then be modeled separately as follows [24].
Background Modeling: For a fixed length of video sequence captured from a surveillance camera, the background tends to keep steady over the frames, and thus can be rationally assumed to be resided on a lowdimensional subspace [30, 53, 54, 5], leading to its lowrank matrix factorization representation as:
(2) 
where . The operation ‘Fold’ refers to fold up each matrix column into the corresponding frame matrix, and thus is a tensor with the same size as .
Rain Layer Modeling: Since rain in a video contain repetitive local patterns sparsely scattering over different areas, and also exhibits multiscale property due to its occurrence positions with different distances to the cameras, multiscale convolutional sparse coding (MSCSC) [46] is thus utilized to model rain as follows:
(3) 
where is a set of feature maps that approximate the rain streak positions, and denotes the filters representing the repetitive local patterns of rain streaks. and denote the numbers of entire filters and filters at the th scale, respectively. Considering the sparsity of feature maps, the penalty [31] is utilized to regularize them.
Moving objects Modeling: Motivated by the work [40], Markov random field (MRF) is used to explicitly detect the moving objects. Let be a binary tensor denoting the moving object support:
(4) 
and be the complementary of (i.e., , is a tensor with all elements as 1). Eq.(1) is then reformulated as:
(5) 
where denotes the elementwise multiplication. Since moving objects always exhibit smooth property, total variation (TV) penalty is adopted to regularize them. Additionally, considering the sparse feature and continuous shapes along both space and time of moving object, penalty and weighted 3dimensional total variation (3DTV) penalty are both employed to regularize the moving objects support simultaneously.
By assuming that the background noise follows an i.i.d. Gaussian, we can then integrate the aforementioned three models imposed on background, rain streak and moving objects to get the MSCSC model for offline video rain removal as follows [24]:
where are the variables involved in the problem to be optimized.
3.2 Online Transformed MSCSC Model
The previous MSCSC model is specifically designed for rain removal in a fixed length of video under the assumption that the rain is of consistent configuration along time. Specifically, the rain feature maps (as defined in Eq. (3)) of all video frames attained under fixed filters are assumed to follow a unique independent and identically distributed Laplacian. The real rain shapes, however, are always both correlated and distinctive along time, and varied from frame to frame across the entire video. The simple encoding manner of MSCSC is thus inappropriate to real scenarios. We thus present the online MSCSC model, which not only provides a more proper way to describe temporally dynamic rain/snow, but also makes the method more efficient and potentially applicable to streaming videos with continuously increasing frames in real time.
Denote the newly coming frame as , where and represent the height and width of this frame, respectively, and denotes the total number of pixels in this frame. Similar to (1), we then decompose as the following three parts:
(6) 
where represent the background scene, rain layer, moving objects and background noise of the current frame, respectively. We then put forward the schemes to model these parts based on the dynamic characteristics of rain/snow.
3.2.1 Modeling dynamic rain/snow layer
Similar to the aforementioned offline MSCSC model, we also adopt MSCSC model [46] to represent the the repetitive local patterns and multiscale shapes of rain streaks, namely:
(7) 
where is a set of feature maps that approximate the rain streak positions, and denotes the filters representing the repetitive local patterns of rain streaks. and denote the filter number and the filter at the th scale, respectively.
Similar to the MSCSC model, we also assume the feature map of the current frame follows a Laplacian distribution (i.e., imposed with penalty as Eq. (3.1), which, however, has its specific scale parameter different with others, namely:
(8) 
where the scale parameter is specified for the current frame reflecting the specific rain shape in this frame. Furthermore, the correlation of rain between current and previous frames is represented by the following prior term imposed on :
(9) 
where and represent the scale parameter learned from the previous frames. Here
denotes the InverseGamma distribution, a conjugate prior to
, whose mode is exactly the one of previously learned (i.e., ). It is then naturally delivered that the correlation of rain shapes between current frame and the learned knowledge from previous ones.In the way as aforementioned, the dynamic characteristic of rain/snow across a video can then be rationally represented. In specific, the scale parameter in each frame is specifically learned and different from one another, finely representing the distinctiveness (i.e. ’nonidentical’) of rain/snow among different frames. Furthermore, the scale parameter of feature map distribution for the current frame is regularized by that of previously learned ones, well encoding the correlation (i.e., ’nonindependent’) across especially adjacent frames. The model is thus expected to better adapt to the variations of the dynamic rain/snow.
3.2.2 Modeling moving object and background noise layers
Following the MSCSC model, we also adopt MRF to detect the moving objects. Let be a binary matrix denoting the moving object support, which is defined as
(10) 
Let be complementary of satisfying . Eq.(6) can then be equivalently expressed as:
(11) 
Like the optimization problem (3.1), by assuming all elements of the background noise
follow a Gaussian distribution with zero mean and variance
, we can then get the probabilistic model for the component of as follows:(12) 
Similar to the dynamic shapes of rain in practical video, the background noise embedded in the video is also with dynamic forms, and also both distinctive and correlated among video frames. We can then also represent this dynamic knowledge. Specifically, for video noise in the current frame with variance , we model it in the similar modeling manner as aforementioned, i.e., imposing conjugate prior to as:
(13) 
where and denote the variance of Gaussian noise learned from the previous frames. The mode of this prior is also the knowledge previously learned (i.e., ). This encoding manner is thus also able to deliver the dynamic property of noises/snow along the video.
3.2.3 Modeling dynamic video background
To tackle dynamic shapes of background scenes in a video due to camera jitter, i.e., video transformations like translation, rotation and scaling, a flexible affine transformation operation is imposed on the background. In the decomposition form (6) for the current frame , the background component is expressed to be transformed from the previous one as , where denotes the transformed operator implemented on the initial background , and can be formulated as an affine or projective transformation [44]. Then, Eq.(11) and (12) are reformulated as:
(14) 
(15) 
3.2.4 Online Transformed MSCSC Model
For convenience, we denote all involved parameters as and the parameters in the current and last frames as and , respectively. Based on the models provided in the last sections, given the previous parameters and newly coming frame , we can then obtain the posterior distribution of as follows:
(16) 
Through maximizing this posterior, the updated parameters for the current frame can then be attained. This MAP problem can then be equivalently expressed as the following minimization problem:
(17) 
where
(18)  
(19)  
(20) 
Specifically, and correspond to the regularization terms for the distributions of feature map and noises embedded in , respectively, which can be more intuitively understood by the following equivalent forms:
(21) 
(22) 
where denotes the KL divergence between two distributions. Particularly, it can be easily observed that functions to rectify the rain streaks on the current frame with parameter to approximate the previously learned rain streaks with parameter , so as to make the rain shapes in the adjacent frames correlated. Similarly, the regularization term inclines to enforce the background noise in the current frame close to that embedded in the previous ones. This easily explains why our method can fit dynamic rain, as well as varying background noises, in a video with evidently noni.i.d. configurations.
The corresponding augmented Lagrangian function of Eq. (3.2.4) can be written as follows:
(23) 
where and are the Lagrange variable and the penalty parameter, respectively.
3.3 ADMM Algorithm
We can then readily adopt ADMM algorithm to iteratively optimize each variable involved in Eq. (3.2.4). To simplify the relevant subproblems, we will utilize the following equation:
Next, we discuss how to solve each subproblem separately.
Update : The subproblem with respect is
(24) 
This subproblem is a standard energy minimization problem, which can be efficiently solved by graph cut algorithm [4, 23].
Update : The subproblem with respect to is
(25) 
which is easily solved by the TV regularization algorithm [37].
Update and : Since is a nonlinear geometric transform, it’s hard to directly optimize it and we resort to the following linear approximation:
(26) 
where is the Jacobian of with respect to . We can iteratively approximate the original nonlinear transformation with a locally linear approximation, as . Therefore, the subproblem with respect to can be reformulated as:
(27) 
It can be solved in closedform. The solution is:
(28) 
Fixing , we can use Eq. (26) to update the background.
Update : The subproblem with respect is
(29) 
This subproblem is a standard CSC problem and can be readily solved by [41], which adopts the ADMM scheme and FFT to improve computation efficiency.
Update : The subproblem with respect to is
(30) 
We use online learning algorithm for sparse coding [29] to update the filters . The algorithm utilizes blockcoordinate descent with warm restarts .
Update : The subproblem with respect to is
(31) 
The closedform solution is
(32) 
where .
Update : Following the general ADMM setting, can be updated as:
(33) 
Update : The subproblem with respect is
(34) 
Its closedform solution is:
(35) 
where .
Update : The subproblem with respect to is
(36) 
Its closedform solution is:
(37) 
where .
The algorithm for solving this online transformed MSCSC (OTMSCSC) model can then be summarized as Algorithm 1.
3.4 Some Remarks
3.4.1 Explanation for function of regularizations
It should be noted that the regularization in Eq. (21) and Eq. (22) intrinsically conduct the superiority of the proposed OTMSCSC model for removing dynamic rain/snow. Specifically, the offline MSCSC model [24] intrinsically specifies one unique value for the parameter as well as to represent the background noise variance and scale parameter in feature map representing rain/snow, respectively, for all the frames of the video. The offline model is thus only suitable to be used in the video with static background and consistent rain/snow shapes. The OTMSCSC model, however, can finely handle dynamic rain with videos with dynamic rain and varying background noises. This advantage is naturally conducted by the fact that the model assumes that each frame has its own specific noise parameter and scale parameter , by simultaneously fitting the knowledge of the current frame and being regularized by those ( and ) obtained from the previous frames. This makes this model, implemented for each new frame in an online mode, better adapt the specific structures of rain/snow or background for the current frame, generally varied from those for previous ones.
To more intuitively clarify this point, we illustrate in Fig. 2 the changing tendencies of parameters and for a sequence of video frames, containing snow varying from heavy to light, as shown in Fig. 10. It can be seen that both and are gradually decreasing along time, finely reflecting the dynamic changes of snow along time.
3.4.2 Background Amelioration
Our method gradually updates the background of the current frame from the affine transformation on that of the last frame by Eq. (26). Due to constantly temporal scene shifting of the videos (especially brought by the camera moving along a certain direction in a short time) and incremental accumulation of computing errors, the recovered video background tends to be gradually deviated from the real one, which always makes the rainremoved videos look more or less blurry after a period of algorithm computing. To alleviate this issue, our algorithm needs to specifically ameliorate the background knowledge after implementing certain frames by our algorithm.
Our strategy is as follows: When our algorithm is run iterations (the current frame is denoted as the one), we then pick up two frames before and after current frame to get a subgroup as:
(38) 
We then easily align all other frames under the reference of the current frame by using the similar manner as we introduced in Eq. (26), to obtain the aligned subgroup as (a tensor):
(39) 
where (), and is calculated readily by Eq. (26)(28). Then we can easily calculate the optimal rankone approximation of the unfolded matrix of efficiently by SVD, and replace as to get the new ameliorated background initialization.
3.4.3 Potential to be used for streaming videos
It is evident that the proposed OTMSCSC algorithm is implemented in an online mode, i.e., each time run on a unique newly coming frame. This learning manner makes our method potentially applicable to practical streaming videos. In specific, in each implementation stage for a frame , the algorithm only requires a fixed memory to restore related parameters . Besides, since the implementation is similar to each new frame, its time complexity is also fixed in each learning stage. This makes our method potentially feasible to the practical videos continuously coming with streaming format beyond current offline methods, which not only need increasingly more space complexity for larger length of videos, but also require increasingly larger time complexity for larger video sequence (even need to preimplement the algorithms on the entire video again). This makes them hardly useable to this typical real video format in practice. Comparatively, our method makes the realtime execution of rain removal possible to be realized for practical streaming video. What we need to do is to improve the efficiency of our algorithm on one frame to make it gradually meet the realtime requirements. Possible regimes include further improvement on hardware power, further speedup on algorithm implementation (like modify it distributed/parrallel or transform it in faster implementation platform), or replace some of its stages with faster algorithms. This is a meaningful issue worthy of making further endeavors in future research.
4 Experimental Results
In this section, we evaluate the performance of our method on videos with synthetic and real rain/snow in both quantitative and qualitative perspectives. Some stateoftheart video rain/snow removal methods have also been implemented for comparison, including Garg et al. [13]^{1}^{1}1http://www.cs.columbia.edu/CAVE/projects/camera rain/, Jiang et al. [20]^{2}^{2}2Code is provided by the authors, Ren et al. [33]^{3}^{3}3http://vision.sia.cn/our%20team/RenWeihonghomepage/visionrenweihong%28English%29.html, Wei et al. [40]^{4}^{4}4http://vision.sia.cn/our%20team/RenWeihonghomepage/visionrenweihong%28English%29.html, Li et al. [24] and Liu et al. [26]^{5}^{5}5https://github.com/flyywh/J4RNetDeepVideoDerainingCVPR2018. Note that these methods contain both modeldriven MAPbased and datadriven DL representative stateofthearts for a comprehensive comparison. Furthermore, through introducing traditional offline alignment strategy into the MSCSC model, called transformed MSCSC or TMSCSC, this offline method can also be ameliorated to adapt to videos with background transformations. All experiments were implemented on a PC with i7 CPU and 32G RAM. To make a sufficiently comprehensive comparison, more video demonstrations on the obtained results by completing methods have been reported in our specifically constructed website^{6}^{6}6https://sites.google.com/view/onlinetmscsc/ for easy observation.
4.1 Experiments on Videos with Synthetic Rain/Snow
We first introduce experiments executed on four videos with synthetic rain/snow, three with static backgrounds, as shown in Fig. 3 – 5, and one with evidently dynamic background with evident translations among adjacent frames, as depicted in Fig. 6. The clean videos as pshown in Fig. 4 and Fig.6 are downloaded from CDNET database[17]^{7}^{7}7http://www.changedetection.net, and those of Fig. 3 and Fig. 5 are downloaded from Youtube^{8}^{8}8https://www.youtube.com/watch?v=aOhdnllS0_k and Xi’an Jiaotong University surveillance camera, respectively. Especially, the videos as shown in 4 and 5 contain heavy rain and snow forming serious occlusions to background scene and foreground objects throughout the video sequences, respectively. The rain/snow with various types were synthetically generated by Photoshop on a black background.
From Fig. 3 and Fig. 4, we can easily observe that the compared methods proposed by Garg et al., Jiang et al. and Liu et al. haven’t completely removed the rain streaks and the method proposed by Ren et al. has not finely kept the shape of the moving object when removing the rain streaks. Besides, as shown in the second row of Fig. 3, the rain layer extracted by all other competing methods contain more or less additional background information. Comparatively, the proposed OTMSCSC method, as well as its offline version MSCSC, can finely remove the rain in the video and well maintain the shape and texture details.
From Fig. 5, it can be seen that most competing methods have not finely removed snow from the video, and the separated snow layer by MSCSC method improperly contains certain moving objects. Comparatively, the OTMSCSC method has a better performance in both snow removing and background/foreground detail preservation.
For dynamic videos as shown in Fig. 6, we can observe that the methods proposed by Garg et al. and Ren et al. have not fully removed the rain details on the images. The method proposed by Ren et al. as well as the offline TMSCSC method have not finely preserved the structure of the moving objects from foreground, and that proposed by Jiang et al. has also not done well in background detail preservation (like the texture of wall). Comparatively, our proposed OTMSCSC method attains a relatively better performance in both aspects.
Quantitative comparisons are also presented in Table I, which fully complies with the aforementioned visual observations. Specifically, we adopt four image quality assessment (IQA) metrics to evaluate the performance of all competing methods, namely, PSNR, VIF [35], FSIM [49] and SSIM [39]. From the table, it can be seen that our proposed OTMSCSC model can perform best or the second best in almost all cases in terms of all IQAs, as compared with other competing methods. Considering that all other methods are implemented on the entire video (iteratively utilizing the video multiple times) or need additionally precollected training data while our method is sequentially implemented in the video sequence (i.e., each frame is only iterated one time and then dropped out), it should be rational to say our method is efficient.
4.2 Experiments on Videos with Real Rain/Snow
We further evaluate the performance of the proposed method on videos with real rainy or snowy scenarios. Eight real videos have been included in our experiments, including three captured under static backgrounds (as shown in Fig. 7–9) and five under dynamic backgrounds (as shown in Fig. 10–14) with typical transformations like random jitter, translation, scale transformation and aerial view. Fig. 7 and Fig. 11 are two public rain videos both used in [14], and the videos of Fig. 8, Fig. 10, Fig. 12 and Fig. 13 are downloaded from Youtube^{9}^{9}9https://www.youtube.com/watch?{v=KzEv1hJgaY, v=kNTYEKjXqzs, v=wb3gWRcKyCI, v=HbgoKKj7TNA} respectively.
The videos shown in Fig. 7 and Fig. 8 are captured by surveillance equipments in street, containing dynamically varying rain structures along time. From the figures, we can easily observe that the derained frames of all other compared methods still contain certain rain streaks and the extracted rain layer is mixed with edges from the background. By contrast, the OTMSCSC method, as well as MSSCS, is capable of better removing all the rain streaks without mixing extra information into the rain layer.
Fig. 9 and Fig. 10 show two real snowy video sequences captured on a real scene with poor visibility containing dynamic backgrounds. It is easy to see from the figures that most other competing methods have degenerated performance in snow removing, especially in the area around the light. Comparatively, our method can finely remove the snow and preserve the texture detail of the frame.
Fig. 11 and Fig. 12 show the snow removal results on real videos with fast horizontal movement and obvious illumination variations, respectively. From Fig. 11, it can be seen that the methods proposed by Garg et al. and Jiang et al. cannot fully remove the snow and recover the texture information underlying the frames. The methods proposed by Ren et al. and Liu et al. fail to detect and remove the snowflakes since they are not capable of dealing with video transformations. The OTMSCSC method, as well as TMSCSC, can obtain better visualized performance since they consider the background transformation in the modeling. This verifies that aligning the video background can help to improve the final performance of snow removal especially for dynamic videos.
Fig. 13 and Fig. 14 show two challenging real snowy videos. Fig. 13 is captured in the condition of light snow and most backgrounds are covered with white snow, and thus it is not easy even for humans to observe the falling snow in a frame. Fig. 14 is also challenging since it is a aerial video and with evident scale variations across frames. It is seen from the figures that our method can still perform relatively satisfactory in these videos, which verifies its robustness in real cases. Please refer to the website^{6} for more comprehensive illustration of the video results.
4.3 Run time comparison
To show the efficiency of the proposed online method, we list the average running time per frame of each compared method in Table II in four representative static and dynamic videos with synthetic and real rain/snow, respectively. From the table, the speed advantage of the OTMSCSC method is evident attributed to its online learning manner. Besides, as we show in Fig. 15, this online method has a good scalability, i.e., its time cost is linearly increasing with more input video frames, naturally due to its fixed training time on each video frame. Together with its fixed space complexity along time as discussed in Sec. 3.4.3, the method is expected to be potentially useful for real streaming videos.
5 Conclusion
In this paper, we have proposed a new rain/snow removal method for surveillance videos containing dynamic rain/snow captured with camera jitter. Both dynamic characteristics of rain/snow variations and background scenes along time inevitably encountered in real cases, have been fully considered in our method. Especially, the method is with a natural online implementation manner, with fixed space and time complexity for handling each frame of continuously coming videos, making it potentially useful for dealing with practical streaming video sequences. In the future, we will further ameliorate the capability of the proposed method in more challenging video cases, like those captured under moving cameras or those under background with strong color contrast and rain/snow with large streak shapes, and try to design rational techniques or use some advanced computing equipments to further speed up the method for each unique frame to make it meet with the realtime requirements on practical streaming videos.
References
 [1] (2000) An optimization approach for high quality multiscale image segmentation. Beitrage Zum Agitsymposium (), pp. 12–23. Cited by: §2.3.

[2]
(2010)
Analysis of rain and snow in frequency space.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
86 (2), pp. 256. Cited by: §2.1.  [3] Spatiotemporal frequency analysis for removing rain and snow from videos. Proceedings of the First International Workshop on Photometric Analysis For Computer Vision. Cited by: §1, §2.1.
 [4] (2001) Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (11), pp. 1222–1239. Cited by: §3.3.
 [5] (2016) Robust lowrank matrix factorization under general mixture noise distributions. IEEE Transactions on Image Processing 25 (10), pp. 4677–4690. Cited by: §3.1.
 [6] (2018) Robust video content alignment and compensation for clear vision through the rain. Computer Vision and Pattern Recognition (), pp. . Cited by: §1, §2.1.
 [7] (2013) A generalized lowrank appearance model for spatiotemporally correlated rain streaks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1968–1975. Cited by: §1, §2.1.
 [8] (2005) Histograms of oriented gradients for human detection. IEEE Computer Society Conference on Computer Vision 1 (12), pp. 886–893. Cited by: §1.
 [9] (2016) Single image rain and snow removal via guided l0 smoothing filter. Multimedia Tools and Applications 75 (5), pp. 2697–2712. Cited by: §2.2.
 [10] (2010) Person reidentification by symmetrydriven accumulation of local features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. . Cited by: §1.
 [11] (2017) Clearing the skies: a deep network architecture for singleimage rain removal. IEEE Transactions on Image Processing 26 (6), pp. 2944–2956. Cited by: §2.2.
 [12] (2017) Removing rain from single images via a deep detail network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3855–3863. Cited by: §2.2.
 [13] (2004) Detection and removal of rain from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. I–I. Cited by: §1, §2.1, Fig. 6, Fig. 11, Fig. 12, Fig. 13, Fig. 14, §4.
 [14] (2005) When does a camera see rain?. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1067–1074. Cited by: §1, §2.1, §4.2.
 [15] (2007) Vision and rain. International Journal of Computer Vision 75 (1), pp. 3–27. Cited by: §2.1, Fig. 3, Fig. 4, Fig. 5, TABLE I, Fig. 10, Fig. 7, Fig. 8, Fig. 9.
 [16] Multiscale orderless pooling of deep convolutional activation features. Springer International Publishing. Cited by: §2.3.
 [17] (2012) Changedetection. net: a new change detection benchmark dataset. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, pp. 1–8. Cited by: §4.1.
 [18] (2017) Joint convolutional analysis and synthesis sparse representation for single image layer separation.. In Proceedings of the IEEE International Conference on Computer Vision, pp. . Cited by: §2.2.
 [19] (1998) A model of saliencybased visual attention for rapid scene analysis.. A model of saliencybased visual attention for rapid scene analysis. 20 (11), pp. 1254–1259. Cited by: §1.
 [20] (2017) A novel tensorbased video rain streaks removal approach via utilizing discriminatively intrinsic priors.. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. . Cited by: §2.1, Fig. 3, Fig. 4, Fig. 5, Fig. 6, TABLE I, Fig. 10, Fig. 11, Fig. 12, Fig. 13, Fig. 14, Fig. 7, Fig. 8, Fig. 9, §4.
 [21] (2012) Automatic singleimagebased rain streaks removal via image decomposition. IEEE Transactions on Image Processing 21 (4), pp. 1742–1755. Cited by: §2.2.
 [22] (2015) Video deraining and desnowing using temporal correlation and lowrank matrix completion.. IEEE Transactions on Image Processing 24 (9), pp. 2658–70. Cited by: §2.1.
 [23] (2004) What energy functions can be minimized via graph cuts?. IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (2), pp. 147–159. Cited by: §3.3.
 [24] (2018) Video rain removal by multiscale convolutional sparse coding. Computer Vision and Pattern Recognition (), pp. . Cited by: §1, §3.1, §3.1, §3.4.1, §3, §4.
 [25] (2016) Rain streak removal using layer priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2736–2744. Cited by: §2.2.
 [26] (2018) Erase or fill? deep joint recurrent rain removal and reconstruction in videos. Computer Vision and Pattern Recognition (), pp. . Cited by: §1, §2.1, Fig. 3, Fig. 4, Fig. 5, Fig. 6, TABLE I, Fig. 10, Fig. 11, Fig. 12, Fig. 13, Fig. 14, Fig. 7, Fig. 8, Fig. 9, TABLE II, §4.
 [27] (2018) DesnowNet: contextaware deep network for snow removal. IEEE Transactions on Image Processing 27 (6), pp. 3064–3073. Cited by: §2.2.
 [28] (2015) Removing rain from a single image via discriminative sparse coding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3397–3405. Cited by: §2.2.

[29]
(2009)
Online learning for matrix factorization and sparse coding.
Journal of Machine Learning Research
11 (1), pp. 19–60. Cited by: §3.3.  [30] (2013) Robust matrix factorization with unknown noise. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1337–1344. Cited by: §3.1.
 [31] (2012) Improve robustness of sparse pca by l 1norm maximization. In Pattern Recognition, pp. 487–497. Cited by: §3.1.
 [32] (2014) Combating bad weather part i: rain removal from video. Synthesis Lectures on Image, Video, and Multimedia Processing 7 (2), pp. 1–93. Cited by: §1.
 [33] (2017) Video desnowing and deraining based on matrix decomposition.. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. . Cited by: §1, §2.1, Fig. 3, Fig. 4, Fig. 5, Fig. 6, TABLE I, Fig. 10, Fig. 11, Fig. 12, Fig. 13, Fig. 14, Fig. 7,