Triple-level Model Inferred Collaborative Network Architecture for Video Deraining

by   Pan Mu, et al.
Dalian University of Technology
NetEase, Inc

Video deraining is an important issue for outdoor vision systems and has been investigated extensively. However, designing optimal architectures by the aggregating model formation and data distribution is a challenging task for video deraining. In this paper, we develop a model-guided triple-level optimization framework to deduce network architecture with cooperating optimization and auto-searching mechanism, named Triple-level Model Inferred Cooperating Searching (TMICS), for dealing with various video rain circumstances. In particular, to mitigate the problem that existing methods cannot cover various rain streaks distribution, we first design a hyper-parameter optimization model about task variable and hyper-parameter. Based on the proposed optimization model, we design a collaborative structure for video deraining. This structure includes Dominant Network Architecture (DNA) and Companionate Network Architecture (CNA) that is cooperated by introducing an Attention-based Averaging Scheme (AAS). To better explore inter-frame information from videos, we introduce a macroscopic structure searching scheme that searches from Optical Flow Module (OFM) and Temporal Grouping Module (TGM) to help restore latent frame. In addition, we apply the differentiable neural architecture searching from a compact candidate set of task-specific operations to discover desirable rain streaks removal architectures automatically. Extensive experiments on various datasets demonstrate that our model shows significant improvements in fidelity and temporal consistency over the state-of-the-art works. Source code is available at



There are no comments yet.


page 1

page 5

page 7

page 8

page 9

page 10

page 12


NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification

This paper introduces a fast and efficient network architecture, NeXtVLA...

Learnable Pooling Methods for Video Classification

We introduce modifications to state-of-the-art approaches to aggregating...

Video Action Recognition Via Neural Architecture Searching

Deep neural networks have achieved great success for video analysis and ...

The UniNAS framework: combining modules in arbitrarily complex configurations with argument trees

Designing code to be simplistic yet to offer choice is a tightrope walk....

Multi-Source Fusion and Automatic Predictor Selection for Zero-Shot Video Object Segmentation

Location and appearance are the key cues for video object segmentation. ...

Automatically Searching for U-Net Image Translator Architecture

Image translators have been successfully applied to many important low l...

G-DARTS-A: Groups of Channel Parallel Sampling with Attention

Differentiable Architecture Search (DARTS) provides a baseline for searc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the flourishing development of computer technology, outdoor vision system plays a critical role in many real-world applications. For instance, the advent of low-cost technology in the field of video capture systems has made it easier for various organizations to adopt surveillance technology. However, bad weather harms perceptual performance and degrades video quality. This degradation could prejudice outdoor multimedia systems and influence the visibility of real-world images captured by camera drones. Thus, developing efficient rain streaks removal method is imperative for a wide range of computer vision tasks, such as video surveillance, intelligent vehicles, object detection, tracking, and remote sensing monitoring, etc.

J4RNet FastDerain
Fig. 1: Visual comparison with different video deraining methods on real-world rainy frames. Comparing with JORDER [51], J4RNet [25], FastDeRain [18] and SPANet [47], our developed method (i.e., TMICS) performs the best visual quality under these various types of complex rain streaks obviously.

An early study for video deraining introduced temporal median filters [44] to deal with each pixel. In [12]

, by developing two separate models that capture the dynamics and the photometry of rain through its physical properties, a comprehensive model for the visual appearance of rain was proposed. Subsequently, a large number of research efforts have been dedicated to remove rain streaks from different rain scenes. The existing methods mainly include two categories: frequency domain-based methods and time domain-based schemes.

Frequency domain-based methods rely on the characteristic in frequency space. For example, in [1]

, by applying physical and statistical characteristics in the frequency domain, researchers derived a simple rain model to determine the general properties of a single streak. However, as for estimating rain streaks, frequency domain-based methods tend to roughly, and often show errors for complicated rain streaks.

To further improve the performance of video derain, some schemes focus on utilizing temporal dynamics information of consecutive frames to remove rain streaks from rain videos [4, 17, 40, 25, 18, 48]. Specifically, by dividing rain streaks into two types (i.e., sparse and dense scenes), frequency domain-based matrix decomposition [40]

methodology was applied for removing rain streaks. Based on the discriminative characteristics of rain streaks, tensor-based video derain methods were developed 

[17, 22, 18, 48]

. These schemes are not effective for heavy rain due to lacking of data information. Recently, data-driven based deep learning methodologies are popularly used in rain streak removal applications, such as CNN-based reconstruction network 

[25], recurrent dual-level flow network [50] and self-learned network [54], etc. However, the performance of CNN-based methods depends on well-designed architectures and subtle adjustments that are hard to be constructed.

To mitigate the above issues, we develop a model-guided auto-searching method for removing different video rain streaks. Specifically, based on the formulated triple-level video deraining optimization framework, a collaborative learning scheme is deduced via introducing cooperating optimization and network architecture searching mechanism, named as Triple-level Model Inferred Cooperating Searching (TMICS). The pipeline is shown in Figure 2. Firstly, different from existing video deraining works that only concern motion case, we introduce a macroscopic structure searching scheme from Optical Flow-estimation Module (OFM) and Temporal Grouping Module (TGM) for inter-frames information extraction. Secondly, the existing learning-based video de-raining methods rely on training data and cannot cover rain streaks distribution information that differs significantly from training data. To mitigate this issue, we design a hyper-parameter optimization model about task variable and hyper-parameter. For the task variable propagation, we develop a collaborative structure, i.e., Dominant Network Architecture (DNA) and Companionate Network Architecture (CNA), to deduce the cooperating optimization procedure. In addition, we introduce an Attention-based Averaging Scheme (AAS) to effectively fuse features from collaborative structures by the guidance of hyper-parameter. Thirdly, the performance of CNN-based methods depends on well-designed architectures and subtle adjustments that are hard to be constructed. To deal with this problem, based on the above hyper-parameter optimization model, we apply the microscopic network architectures searching [24] from a compact task-specific search space to discover desirable video deraining architectures. Finally, the ablation study demonstrates the effectiveness of the developed modules. Extensive evaluations illustrate that the proposed framework performs favorably against state-of-the-art video deraining methods. Figure 1 demonstrates the superiority of our proposed approach over existing methods in a challenging video sequence. The contributions are summarized as follows:

  • We develop a triple-level model-driven framework with macroscopic structure searching and microscopic collaborative architecture searching schemes for video deraining. The designed model not only optimizes task variables (or network parameters), but also optimizes hyper-parameters and architecture weights.

  • To mitigate the problem that existing methods cannot cover various rain streaks distribution, we design a hyper-parameter optimization model which has the ability to estimate various rain streaks. We design a collaborative structure, i.e., DNA and CNA, via AAS to preserve details and structure. This structure helps improve the generalization capability of network.

  • We apply automatically microscopic search strategy from task-specific search space to discover desirable network. To find a suitable video frame estimation module, we design a macroscopic structure searching scheme via combining OFM and temporal TGM. The above search strategies provide a trade-off between automated search and experience-driven.

  • Extensive experiments demonstrate the effectiveness of our proposed method and show comparable results against state-of-the-art video deraining methods on different video rainy benchmarks.

Fig. 2: Schematic of proposed framework TMICS for video deraining. (I): A rough framework based on the designed model in Eq. (2). (II): The detailed procedure of TMICS with deep network architecture.

Ii Related Work

Single Image Derain: Traditional single image deraining methods usually apply inherent physical features to characterize rain streaks. For example, in [19], sparse coding was applied to divide rain streaks from high frequency layer. Prior-based strategies explore prior knowledge to recover clear image from the rainy one, such as morphological component analysis [11], low-rank assumptions [2], guided filter [60, 8], dictionary learning [33]

, Gaussian mixture model 

[23] and joint convolutional analysis and synthesis sparse representation [15], etc. In recent years, deep learning-based methods govern rain streaks removal literature, such as shallow CNN-based schemes [10, 39], dilated convolution [51], dense blocks [57] and self-supervised method [55]. Besides, model-driven methodologies have also witnessed the rapid progress of deep learning in image deraining field [46, 35, 3, 53, 31]. Additionally, by introducing directional gradient operator of arbitrary direction, an efficient and robust constraints-based model was proposed in [36].

Video Derain: Different from single image rain streaks removal, video deraining can additionally make use of the temporal correlation and dynamics to explore the intrinsic properties. An early attempt method for video deraining was developed in [44, 12, 13] that utilized a space-time correlation model to analyze the visual effects of rain streaks. Thereafter, a variety of methods are proposed for video deraining. One branch of these methods is to explore the inherent rain streaks priors and general background signals, such as morphological component analysis [20], spatio-temporal correlation of patch groups [6], directional prior of rain streaks [17], Gaussian mixture model [49], low-rank regularization [21, 48], matrix decomposition [40] and tensor model [17, 22, 48]. Deep learning based schemes have been investigated in rain streaks removal application [25, 50, 54, 42, 52]. For example, in [25], a CNN based reconstruction network was developed for video rain streaks removal by integrating rain degradation classification, spatial texture appearances based rain removal. In [50], a recurrent network was designed for synthesizing visually authentic rain videos to predict the rain-related variables and perform an inverse scheme to estimate rain-free frame.

Rain Model: In general, video rain model can be formulated as where means the time-steps of video frames, , and denote the captured image with rain streaks, rain-free background and rain streaks respectively. In [40], a reconstructed rain model was developed by separating rain streaks into two types, i.e. sparse and dense rain streaks. This model can be written as

where , and represent the intensity fluctuations caused by foregrounds, sparse and dense rain streaks respectively. In [50], rain accumulation and accumulation flow were considered in the following form

where is the global atmospheric light, means the atmospheric transmission which is correlated with the scene depth and denotes the rain accumulation flow layer.

Neural Architecture Search: Neural Architecture Search (NAS) [9]

aims to discover high-performance task-specific architectures to replace heavy manual design automatically. Early search strategies for NAS apply evolutionary algorithm 

[37, 38]

and reinforcement learning 

[61], which achieve remarkable performance along with much inefficient computation, spent on architecture evaluations at the same time. In light of these computation issues, various of gradient-based differentiable approaches have been proposed.

By performing gradient-descent form for the bi-level search scheme, a continuous relaxation method is developed [24] to optimize both the model weights and architecture parameters. Several differentiable architecture search algorithms also show comparable performance on low-level vision tasks. As for image restoration (e.g., image deraining), HiNAS [56] searches for both inner cell architectures and outer layer widths to be memory and computation efficient. CLEARER [14] combines multi-scale search space with task-flexible modules for image denoising and deraining.

Iii Our Approach

In this section, we first construct a triple-level optimization model to jointly optimize network architectures, task variables and hyper-parameters in Section III-A. Subsequently, we provide a detailed procedure to deduce the optimization model in Section III-B. Finally, the training details are illustrated in Section III-C. The overall procedure is presented in Algorithm 1 and the detailed pipeline is shown in Figure 2.

Iii-a Model Formulation for Video Deraining

In this work, we take into account more complicated rain synthesis model and exploring the temporal information of rainy videos. With this goal, we first re-construct rainy video frame as the following model


where , and . Here and represent the element-wise multiplication and convolution operation respectively. denotes the constructed model aiming to generate auxiliary rain streaks. means the weight between different rain streaks types. Indeed, is usually considered as independent and identically distributed sample.

Note that the main aim of video deraining is to find a clear background frame, i.e., . Under the above model (1), the introduced hyper-parameter aims to help us find a better task variable. Thus, a hyper-parameter optimization model (see [30, 26, 27]) is introduced to jointly optimize task variable and hyper-parameter that can be formulated as the following scheme


where and

denote the loss function in the upper-level and lower-level respectively,

represents the temporal alignment module, and is the input frame number. Indeed, module is designed for introducing frame information which is usually used through different forms in video derain methods [50, 54].

As for optimizing in Eq. (2), it can be written as an averaging scheme based on [29, 41, 28]. Thus, for any fixed , we have the following averaging-based scheme


where and respectively denote the constructed Dominant Network Architecture (DNA) and Companionate Network Architecture (CNA). and represent the learned optimal network parameters for and respectively. and represent the input rainy image of the current frame and the corresponding ground truth respectively. As for -subproblem, a single numerical parameter cannot well cooperate the constructed dominant and companionate network. Thus, with the above obtained , we introduce an Attention-based Averaging Scheme (AAS) to learn -subproblem.

Moreover, designing network by experience in video deraining requires substantial efforts. How to trade-off between automated search and experience-driven is a crucial task for discovering structure and task-specific architectures (i.e., ). Motivated by the success of NAS in high-level vision field [24], under the bi-level optimization scheme in Eq. (2), we construct a triple-level optimization model to search network architectures for video rain streaks removal. The above model can be reformulated as the following form


where and are the architecture relaxation weights, and denote training and validation datasets. Indeed, Eq. (2) and Eq. (4) imply a triple-level optimization model with network parameters , (the first level), architecture parameter , (the second level) and hyper-parameter (the third level). Then, we provide detailed procedure for finding the optimal solution of the above optimization formulation in the following section III-B. We summarize some necessary definitions in Table I.

Iii-B Optimization Procedure

With the developed optimization models in Eqs. (2)-(3), this part introduces microscopic architecture searching for and macroscopic structure searching for frame alignment module . Specifically, we construct a collaborative rain streaks removal structure, i.e., and , to jointly characterize different rain streaks distribution. Then, with the above structures, we design an attention-based module to simulate .

Iii-B1 Optimizing Hyper-parameter

As for , we design an attention mechanism to obtain an adaptive map that can directly learn inter-channel information from global context and contribute to our performance. Indeed, the attention-based module auto-cooperates different video rain streaks distribution and helps improve the generalization capability of the network. In detail, given two restored frames, we perform two shared convolutions to extract features and then introduce the global average pooling to encode weights of each spatial channels. Soft-max operator is adopted to generate for each estimated frame.

Iii-B2 Macroscopic Structure Search for Frame Alignment Scheme

Choosing suitable alignment modules to estimate spatial-temporal information from task-specific data distribution is important for video deraining. In this work, we introduce a macroscopic structure search mechanism to select a suitable alignment module. In other words, we choose either TGM or OFM for frame alignment module by NAS. Indeed, we can formulate this search strategy using relaxation weights,

where . We leverage these hybrid aligned features to feed in the search phase. The detailed structure is shown in Figure 3.

For the motion estimation (i.e., OFM), by applying recurrent all-pairs field transforms as stated in [45], we introduce an optical flow prediction to obtain the spatial-temporal information from a sequence video frames . With this pre-trained network, it produces a series of aligned frames for subsequent spatial-temporal information extraction,

where means the updated flow.

For TGM, the input sequence is divided into two groups [16]. Specifically, given a consecutive rainy video frame sequence, we select neighboring frames to calculate the reference frame . The two groups are

Note that each group represents a certain type of frame rate. Then, the residual block and fusion block with shared weights are applied to extract and fuse spatial-temporal information within the above two groups,

where is the residual block as stated in Figure 3.

Iii-B3 Constructing Auxiliary Rain Streaks

After the above calculation, we roughly estimate rain streaks by using a simple residual network structure which can be written as . We apply the estimated rain streaks to encode the physical structural factors underlying rain streaks which can be written as

This model enables the capability of characterizing a wide range of rain streaks, such as small/large, blur rain streaks etc. Here represents the constructed filter, denotes the corresponding weight, and means multiply operation. In this model, we apply two groups of filters to simulate rain types. Consequently, the re-constructed frames can be

As real-world rainy scene usually contains many rain streaks cases, thus it is troublesome to cover intricate rain streaks distribution (such as heavy or combined rain streaks) with simple rain model. The constructed model provides us an opportunity to enable the capability of characterizing a wide range of rain streaks, such as small/large, blur rain streaks etc.

Fig. 3: A diagram of the architecture to illustrate OFM and TGM. Five consecutive frames are used to derain the middle frame. “Concat” means the concatenation operation.

Iii-B4 Microscopic Architecture Search for

Generally, designing an efficient task-specific neural network structure (i.e.,

) requires substantial architecture engineering. In other words, it is significant but difficult to decide which module and how many convolutions, dilation, residual blocks are applied in a network. Focus on these key points, we apply the differentiable architecture search strategy [24] from a discrete set of candidate operation cells to discover the task-specific DNA and CNA network simultaneously. These cells consist of an ordered sequence of nodes to automatically discover desirable rain streaks removal architectures, and the pipeline is shown in Figure 4. For each intermediate, it relaxes the discrete architecture parameters into a continuous distribution to perform a differentiable search.

0:  Input necessary parameters.
0:  .
1:  while not converged do
2:     Estimating frame content with .
3:     Obtaining complex rain streaks by .
4:     Learning microscopic architecture through:
5:     while not converged do
6:        Updating , with weighted shared by
7:     end while
8:     Updating architecture and by gradient descent:
9:     With the obtained and , updating by
11:     Output , , , and .
12:  end while
Algorithm 1 Procedure for solving the model in Eq. (4)
Fig. 4: A diagram of the auto-searching architecture to illustrate the optimization procedure of one cell. (a) The unknown units in initial cell. (b) The searching procedure on candidate units with continuous relaxation. (c) The final obtained cell through the learned relaxation weights.
Notation Description Notation Description Notation Description
Element-wise multiplication Input frame Frame alignment scheme
Convolution operation Background frame Optical flow-estimation module
Multiply Rain streaks Temporal grouping module
Constructed filter Architecture weights Task-specific architectures
Hyper-parameter Architecture weights Residual Block
Generating complex input frame Network parameters
TABLE I: Summary of some important notations.


Image Num.
training / test
Rain Types
RainSynLight25 1900 / 775 sythetic rain streaks [25]
RainSynComplex25 1900 / 775 sythetic rain streaks [25]
NTURain 2400 / 1682 sythetic and real rain streaks [5]
LasVR – / 200
sythetic rain streaks [32]


TABLE II: Summary of the training / test image number and rain streaks types of four different datasets used in this work.

Task-oriented Search Space. As proper search space is crucial for finding the architecture backbone, we introduce a series of basic operators as the compact candidate search space, including residual blocks, dense blocks, dilated convolutions and attention mechanisms, which have been corroborated their effectiveness for rain removal tasks [51, 47]. The pre-defined operators are listed in the following:

  • Residual: and residual blocks;

  • Dense: and dense blocks;

  • Attention: spatial attention and channel attention;

  • Dilated: and dilated convolution with DF=2,

where DF denotes the dilated rate. Note that each convolution is followed by a batch normalization layer and a ReLU activation layer, and all operators include three layers of convolutions. We take

-th layer as an example and calculate the choice of a particular operation by relaxing networks with a soft-max operation. Let be the input of the -th layer and , then we have


where denotes the operation realized by the -th operation at -th layer and means the corresponding weight. Here and .

The Final Derived Architectures. Subsequently, we can leverage the differentiable search to perform Eq. (4). More specific, we set four layers of intermediates to construct the basic blocks (i.e., cells). Then we cascade four cells to establish the concrete architectures (i.e., DNA and CNA) at the searching and training phase. More searching and training configurations are shown in the following Section III-C. It is worth emphasizing that the relaxed weights are shared by both networks to simplify the searching procedure. Derived from the searching phase, we can obtain the final architectures for light and heavy rain scenarios respectively. Specifically, as for the micro search, the cell for light rain consists of residual block, spatial attention, dilated convolutions and channel attention orderly. Moreover, the cell for heavy rain includes residual block, spatial attention, channel attention and residual block, which indicates cells for heavy rain have the wider respective fields and residual information to capture long-range rain streaks reasonably. The attention mechanisms and dilated convolutions are also effective and verified in previous works [47, 51]. As for temporal alignment module , TGM has been searched for heavy rain and OFM has been searched for light rain. That may be caused by the serious warping error of OFM for heavy rain.

In summary, designing an appropriate deep network architecture for each task and data set is tedious and time-consuming. The neural architecture search strategy attempts to find suitable architecture automatically and provides a trade-off between automated search and experience-driven. Indeed, the designed microscopic automatically searched architectures from a task-specific search space to discover desirable video deraining architectures. Further, the macroscopic structure searching scheme combines optical flow-estimation and temporal grouping module to jointly concern motion and rain streaks occlusion circumstance. Besides, the designed collaborative scheme can preserve detail and structure of video when estimating various rain streaks distribution. Thus, the derived model are efficient compared with some existing handcrafted networks.

Iii-C Implementation Details

In this part, we report more details about the loss functions, searching configurations and training configurations.

Loss Functions. As for and described in Section III, we apply the negative SSIM loss (i.e., , see [53]) and loss between the restored video frames (i.e., ) and their ground truths as


where is a parameter. The same loss function is applied for with . As for , it is constructed as


where is a positive parameter and is the regularization about architecture parameter

Actually, the above will impose the distribution of to ignore the weakest candidates gradually for discovering the most suitable operation of -th layer effectively.

Search Configurations.

Considering the different distribution of rain streaks, we choose the RainSynLight25 and RainSynComplex25 as the datasets to search the diverse optimal architectures respectively. During the searching phase, we only sample twenty groups of rainy frames to search 80 epochs with batch size of 4. Furthermore we define the train and validation loss using Eq. (

7) with = 0.75 and = 0.01. The SGD optimizer is employed for the search phase. The learning rate of searching architecture is . On the other hand, the initial learning rate of updating parameters is and decays to with cosine annealing strategy. In order to provide a warm start for network parameters, we only update network weights before the first 30 epochs.

Training Configurations. Derived from the searching phase, we perform two-stage training strategy. First, we train the cooperate architecture (i.e., DNA and CNA) end-to-end with loss and negative SSIM loss (i.e., Eq. (6) with = 0.75). We exploit the cosine annealing strategy to decay the initial learning rate to . Employing the Adam optimizer and cropping the patches of size randomly, we train the whole networks for 160 epochs. Data augmentations, including the horizontal and vertical flipping are implemented. After obtaining the pre-trained part of above networks, we train the fused attention mechanism leveraging a small learning rate of for 50 epochs.


Frames 3 5 7
RainSynComplex25 25.69 / 0.8276 28.90 / 0.8743 23.91 / 0.7731
LasVR 32.36 / 0.9063 33.41 / 0.9102 32.16 / 0.9070


TABLE III: Performance (PSNR and SSIM) of an ablation study about different frames on RainSynComplex25 and LasVR data sets.
Fig. 5: Ablation study about the triple level models. PSNR and SSIM scores of a video sequence are plotted in the top row with three different settings, i.e., w / o CNA, w / o DNA, w / DNA + CNA. The bottom row shows their visual performances.
Fig. 6: Ablation study about the hyper-parameter . In the top row, the left subfigure is 7-th frame of input video and the right hand is the PSNR scores of a video with different hyper-parameter settings. The bottom row is the performance cropped from the visual results of different settings. The “Adaptive” setting means attention mechanism.


Settings (a) (b) (c) (d) (e)
SynHybrid 30.35 32.7369 32.8993 33.9986 34.1054
0.9058 0.9347 0.9376 0.9411 0.9424
LasVR 31.3549 32.6170 32.6902 33.1070 33.4157
0.9042 0.9074 0.9088 0.9120 0.9144


TABLE IV: Performance (PSNR and SSIM) of estimating different strategies, including MDA (Manually Designed Architecture), ASA (Auto-Searching Architecture), GARS (Generating Auxiliary Rain Streaks), on SynHybrid and LasVR data sets.
RainSynLight25 24.43 / 0.7312 31.57 / 0.9508 23.78 / 0.8140 30.37 / 0.9235 27.46 / 0.8844 32.96 / 0.9434
RainSynComplex25 16.57 / 0.5833 26.92 / 0.8011 17.51 / 0.5888 20.20 / 0.6335 18.25 / 0.5824 24.13 / 0.7163
NTURain 25.63 / 0.7600 30.54 / 0.9255 25.67 / 0.8819 32.69 / 0.9451 30.58 / 0.9413 30.73 / 0.9407
RainSynLight25 32.78 / 0.9239 27.08 / 0.8858 35.80 / 0.9622 - / - 36.10 / 0.9674 36.65 / 0.9689
RainSynComplex25 21.21 / 0.5854 17.51 / 0.7108 27.72 / 0.8239 - / - 28.90 / 0.8743 29.49 / 0.8933
NTURain 33.11 / 0.9475 29.02 / 0.9158 36.05 / 0.9676 34.89 / 0.9540 36.64 / 0.9702 37.38 / 0.9704
TABLE V: Averaged PSNR and SSIM results among different rain streaks removal methods on three synthesized video datasets. TMICS_S denotes the result from single DNA. Red and blue colors are used to indicate top and rank, respectively.
FastDerain J4R-Net JORDER SpacCNN TMICS (Ours)
Fig. 7: Video deraining performance comparison on two videos from RainSynLight25 (the top row) and RainSynComplex25 (the bottom row) respectively.
FastDerain J4R-Net JORDER SpacCNN TMICS (Ours)
Fig. 8: Video deraining results on two types of videos (i.e., synthetic and real frames in the top and bottom row respectively) from NTURain dataset.

Iv Experimental Results

To evaluate the proposed method, we first conduct kinds of detailed ablation studies to analyze the effectiveness of frame aligned modules, auto-searching architectures, and rain streaks generating module. Subsequently, a series of qualitative and quantitative assessments are performed comparing with existing state-of-the-art approaches. Experimental results demonstrate the effectiveness and superiority of our approach.

Iv-a Ablation Experiments

Datasets. We compare the proposed method with state-of-the-arts on RainSynLight25, RainSynComplex25 and NTURain datasets. The ablation experiments are conducted on LasVR and hybrid SynComplex&Light datasets. The detailed information (e.g., training / test numbers and rain streak types) are summarized in Table. II.


We use two most widely used numerical metrics, i.e., Peak Signal-to-Noise Ratio (PSNR) and Structure Similarity Index (SSIM) to evaluate the performance of different methods. Following previous works, results are evaluated on luminance channel. To further measure the performance of different methods, we also analyze the perceptual quality based on Visual Information Fidelity (VIF) 

[43], Feature SIMilarity (FSIM) [58], Natural Image Quality Evaluator (NIQE) [34], Learned Perceptual Image Patch Similarity (LPIPS) [59] and tLPIPS [7]. Higher PSNR, SSIM values imply a better pixel-wise accuracy, and lower NIQE, LPIPS and tLPIPS values represent better perceptual quality. Furthermore, higher VIF and FSIM also indicate more visual-pleasant results.

Input MS-CSC FastDerain JORDER
SPANet J4R-Net SpacCNN TMICS (Ours)
Fig. 9: Video deraining performance comparison on three real-world rainy videos. The remaining rain streaks are marked with red boxes. The background details are marked with a yellow arrow.
JORDER 0.2753 0.8243 5.270 0.407 0.199
FastDeRain 0.3347 0.8607 8.708 0.454 0.155
J4R-Net 0.2753 0.8243 3.804 0.274 0.137
SpacCNN 0.1975 0.7592 4.933 0.386 0.153
TMICS_S 0.3927 0.9112 3.555 0.152 0.063
TMICS 0.4190 0.9243 3.284 0.147 0.059
TABLE VI: Averaged VIF, FSIM, NIQE, LPIPS and tLPIPS results among different deraining methods on RainSynComplex25 dataset.

Evaluating Suitable Frames Number. To evaluate the best frame number, we conduct an experiment on RainSynComplex25 and LasVR datasets about three neighborhood frame settings (i.e., 3, 5 and 7). The corresponding experimental results are listed in Table III. Observed that unsuitable frames damage the performance in both RainSynComplex25 and LasVR datasets. Thus, in following experiments, we utilize five continuous observations to estimate one rain-free frame.

Impact of the Developed Triple-level Model. To demonstrate the effectiveness of the propose triple-level model, we plotted one group of representative results from RainSynComplex25 in Figure 5. Obviously, the result of without DNA still has some residual rain streaks. The result without CNA coupling with rain streaks generation module obtains promising visual results and remove the dominant rain streaks. Our proposed AAS (i.e., with CNA + DNA) generates the most vivid background and achieve rain-free performance. This is because the developed AAS scheme helps preserve both structure and details when removing more rain streaks.

We then explore the hyper-parameter for AAS by comparing different settings and the experimental results are shown in Figure 6. Specifically, by choosing a fixed (i.e., ), the performance still has space to be improved. Fortunately, the proposed attention mechanism can provide adaptive and incorporate networks leading to a better result.

Influence of Different Settings. To validate the effectiveness of each module in our developed method, we make a comprehensive ablation study about five different settings, named Optical Flow Module (OFM), Temporal Grouping Module (TGM), Manually Designed Architecture (MDA), Auto-Searching Architecture (ASA), and Generating Auxiliary Rain Streaks (GARS). The corresponding numerical results are summarized in Table IV.

Impact of Networks Architecture Search Setting: To investigate the proposed ASA for collaborative networks, we compare this strategy with manually designed network (i.e., MDA). Specifically, for the manually designed structure, we stack residual block, dense block, dilated convolution and spatial attention orderly from searching space to design the basic blocks and manually designed network. The experimental results are listed in Table IV-(a) and (b). Observed that model (b) with auto-searching strategy outperforms manually designed model (i.e., model (a)) in terms of PSNR and SSIM scores. The superiority of the proposed ASA module demonstrates that the network searching architecture can effectively exploit basic operators in search space.

Impact of GARS Module: We also compare the effectiveness of GARS in Table IV-(d) and (e). The collaborative networks with GRS obtains significant improvements based on the auxiliary information of structural factors. This is mainly because the generated auxiliary rain streaks enable the capability of network to characterize a wide range of video circumstances.

Frames Estimation Modules: To explore the proposed alternative strategy, we introduce three baseline models, i.e. (b) only with OFM, (c) only with TGM and (d) with alternative OFM or TGM. As can be seen in Table IV, model (d) outperforms (b) and (c) both on SynHybrid and LasVR datasets. This experiment illustrate the effectiveness of the developed alternative module.

Iv-B Comparison with State-of-the-Art

Comparing on Synthesized Datasets. We compare the developed method with some state-of-the-art approaches on video deraining, including both single-image deraining methods (i.e., JOint Rain DEtection and Removal network (JORDER [51]), Density-aware Single-Image Deraining using Multi-stream Dense Network (DID-MDN [57]), SPatial Attentive single-image deraining network (SPANet [47])) and video deraining methods (i.e., Multi-Scale Convolutional Sparse Coding (MS-CSC [22]), FastDeRain [18], Joint Recurrent Rain Removal and Reconstruction Network (J4R-Net [25]), SuperPixel Alignment and Compensation CNN (SpacCNN [5]), DualFlow [50], CLEARER [14] and Self-Learned Deraining Network (SLDNet [54])). The performance of video rain streaks removal is evaluated on three rainy video benchmarks: RainSynLight25, RainSynComplex25 and NTURain, which involve diversified kinds of rain streaks including direction, scale, density and intensity.

As for the quantitative comparison, we calculated two widely used metrics (i.e., PSNR and SSIM), and listed experimental results in Table V. It can be seen that our approach shows significant superiority to previous methods on three datasets. Compared with recently proposed DualFlow, our method attains more than 0.85dB, 1.77dB and 1.33dB in PSNR on RainSynLight25, RainSynComplex25 and NTURain datasets respectively. The result of single DNA (i.e., TMICS_S) also obtains promising performances. This corroborates the flexibility and universality of our proposed method when dealing with various video situation with different rain streaks types. Except for PSNR and SSIM metrics that measure performance by pixel-wise accuracy, we further calculate NIQE, LPIPS and tLPIPS to evaluate the quantified perceptual quality. We compare our method with four methods which have relative high PSNR and SSIM scores (i.e., JORDER, FastDerain, J4R-Net and SpacCNN), and the comparison results on challenging RainSynComplex25 are listed in Table VI

. As the lower NIQE, LPIPS and tLPIPS values denote better perceptual quality, our method achieves the best perceptual performance on different evaluating metrics.

As for the qualitative performance, Figure 7 and Figure 8 show the visual comparisons between our scheme and four best methods with relative high PSNR and SSIM scores (i.e., FastDerain, J4R-Net, JORDER and SpacCNN). Observed that in Figure 7, the proposed method performs the best while other methods (i.e., FastDeRain, JORDER, J4RNet, and SpacCNN) still have rain streaks. Besides, Figure 8 depicts the visual performance on NTURain dataset. Observed that FastDerain, J4R-Net, and JORDER still have rain streaks and mistake the background details as rain streaks. SpacCNN generates a much blurred background. Overall, our method provides more effective performance with less remaining rain streaks, abundant details, and less blurs.

Comparing on Real-world Video Datasets. To further illustrate the performance of our method, we chose three difficult real-world rainy videos with various rain circumstances against a series of competitive methods, including MS-CSC, FastDerain, JORDER, SPANet, J4R-Net and SpacCNN. These rainy video sequences are collected from different scenes, i.e., YouTube111, movie clips, and Mixkit222 As shown in Figure 9, previous methods tend to leave distinct rain streaks and mistake background details as rain streaks. Fortunately, our developed method shows good capability that preserves background details as well as removes more rain streaks.

V Conclusion

This work developed a model-guided auto-searching method by the formulated triple-level video deraining optimization framework for removing different video rain streaks. We first introduced a macroscopic structure searching scheme for inter-frames information extraction. Then, we designed a cooperating optimization model about task variables and hyper-parameter based on the re-constructed comprehensive model. For the task variable propagation, we designed two collaborative structures, i.e., DNA and CNA. Subsequently, we introduced an attention-based averaging scheme to effectively fuse features from collaborative structures. To obtain state-of-the-art neural network structures (i.e., DNA and CNA), we applied the microscopic network architectures searching from a compact task-specific search space to discover desirable video deraining architectures.


  • [1] P. Barnum, T. Kanade, and S. Narasimhan (2007) Spatio-temporal frequency analysis for removing rain and snow from videos. In International Workshop on Photometric Analysis for Computer Vision, pp. 8. Cited by: §I.
  • [2] Y. Chang, L. Yan, and S. Zhong (2017) Transformed low-rank model for line pattern noise removal. In IEEE ICCV, pp. 1726–1734. Cited by: §II.
  • [3] J. Chen, P. Mu, R. Liu, X. Fan, and Z. Luo (2020) Flexible bilevel image layer modeling for robust deraining. In IEEE ICME, pp. 1–6. Cited by: §II.
  • [4] J. Chen and L. Chau (2013) A rain pixel recovery algorithm for videos with highly dynamic scenes. IEEE Transactions on Image Processing 23 (3), pp. 1097–1104. Cited by: §I.
  • [5] J. Chen, C. Tan, J. Hou, L. Chau, and H. Li (2018) Robust video content alignment and compensation for rain removal in a cnn framework. In IEEE CVPR, pp. 6286–6295. Cited by: TABLE II, §IV-B.
  • [6] Y. Chen and C. Hsu (2013) A generalized low-rank appearance model for spatio-temporally correlated rain streaks. In IEEE ICCV, pp. 1968–1975. Cited by: §II.
  • [7] M. Chu, Y. Xie, L. Leal-Taixé, and N. Thuerey (2018)

    Temporally coherent gans for video super-resolution (tecogan)

    arXiv preprint arXiv:1811.09393 1 (2), pp. 3. Cited by: §IV-A.
  • [8] X. Ding, L. Chen, X. Zheng, Y. Huang, and D. Zeng (2016) Single image rain and snow removal via guided l0 smoothing filter. Multimedia Tools and Applications 75 (5), pp. 2697–2712. Cited by: §II.
  • [9] T. Elsken, J. H. Metzen, and F. Hutter (2019) Neural architecture search: a survey.

    Journal of Machine Learning Research

    20, pp. 1–21.
    Cited by: §II.
  • [10] X. Fu, J. Huang, D. Zeng, Y. Huang, X. Ding, and J. Paisley (2017) Removing rain from single images via a deep detail network. In IEEE CVPR, pp. 3855–3863. Cited by: §II.
  • [11] Y. Fu, L. Kang, C. Lin, and C. Hsu (2011) Single-frame-based rain removal via image decomposition. In IEEE ICASSP, pp. 1453–1456. Cited by: §II.
  • [12] K. Garg and S. K. Nayar (2004) Detection and removal of rain from videos. In IEEE CVPR, Vol. 1, pp. I–I. Cited by: §I, §II.
  • [13] K. Garg and S. K. Nayar (2007) Vision and rain. International Journal of Computer Vision 75 (1), pp. 3–27. Cited by: §II.
  • [14] Y. Gou, B. Li, Z. Liu, S. Yang, and X. Peng (2020) CLEARER: multi-scale neural architecture search for image restoration. In NeurIPS, Vol. 33. Cited by: §II, §IV-B.
  • [15] S. Gu, D. Meng, W. Zuo, and L. Zhang (2017) Joint convolutional analysis and synthesis sparse representation for single image layer separation. pp. 1717–1725. Cited by: §II.
  • [16] T. Isobe, S. Li, X. Jia, S. Yuan, G. Slabaugh, C. Xu, Y. Li, S. Wang, and Q. Tian (2020) Video super-resolution with temporal group attention. In CVPR, pp. 8008–8017. Cited by: §III-B2.
  • [17] T. Jiang, T. Huang, X. Zhao, L. Deng, and Y. Wang (2017) A novel tensor-based video rain streaks removal approach via utilizing discriminatively intrinsic priors. In IEEE CVPR, pp. 4057–4066. Cited by: §I, §II.
  • [18] T. Jiang, T. Huang, X. Zhao, L. Deng, and Y. Wang (2019) FastDeRain: a novel video rain streak removal method using directional gradient priors. IEEE Transactions on Image Processing 28 (4), pp. 2089–2102. Cited by: Fig. 1, §I, §IV-B.
  • [19] L. Kang, C. Lin, and Y. Fu (2011) Automatic single-image-based rain streaks removal via image decomposition. IEEE Transactions on Image Processing 21 (4), pp. 1742–1755. Cited by: §II.
  • [20] L. Kang, C. Lin, C. Lin, and Y. Lin (2012) Self-learning-based rain streak removal for image/video. In ISCAS, pp. 1871–1874. Cited by: §II.
  • [21] J. Kim, J. Sim, and C. Kim (2015) Video deraining and desnowing using temporal correlation and low-rank matrix completion. IEEE Transactions on Image Processing 24 (9), pp. 2658–2670. Cited by: §II.
  • [22] M. Li, Q. Xie, Q. Zhao, W. Wei, S. Gu, J. Tao, and D. Meng (2018) Video rain streak removal by multiscale convolutional sparse coding. In IEEE CVPR, pp. 6644–6653. Cited by: §I, §II, §IV-B.
  • [23] Y. Li, R. T. Tan, X. Guo, J. Lu, and M. S. Brown (2016) Rain streak removal using layer priors. In IEEE CVPR, pp. 2736–2744. Cited by: §II.
  • [24] H. Liu, K. Simonyan, and Y. Yang (2019) DARTS: differentiable architecture search. In ICLR, Cited by: §I, §II, §III-A, §III-B4.
  • [25] J. Liu, W. Yang, S. Yang, and Z. Guo (2018) Erase or fill? deep joint recurrent rain removal and reconstruction in videos. IEEE CVPR, pp. 3233–3242. Cited by: Fig. 1, §I, §II, TABLE II, §IV-B.
  • [26] R. Liu, J. Gao, J. Zhang, D. Meng, and Z. Lin (2021) Investigating bi-level optimization for learning and vision from a unified perspective: a survey and beyond. arXiv preprint arXiv:2101.11517. Cited by: §III-A.
  • [27] R. Liu, Y. Liu, S. Zeng, and J. Zhang (2021) Towards gradient-based bilevel optimization with non-convex followers and beyond. In NeruIPS, Cited by: §III-A.
  • [28] R. Liu, P. Mu, J. Chen, X. Fan, and Z. Luo (2020) Investigating task-driven latent feasibility for nonconvex image modeling. IEEE Transactions on Image Processing 29, pp. 7629–7640. Cited by: §III-A.
  • [29] R. Liu, P. Mu, X. Yuan, S. Zeng, and J. Zhang (2020) A generic first-order algorithmic framework for bi-level programming beyond lower-level singleton. In ICML, pp. 6305–6315. Cited by: §III-A.
  • [30] R. Liu, P. Mu, X. Yuan, S. Zeng, and J. Zhang (2021) A generic descent aggregation framework for gradient-based bi-level optimization. arXiv preprint arXiv:2102.07976. Cited by: §III-A.
  • [31] R. Liu, P. Mu, and J. Zhang (2021) Investigating customization strategies and convergence behaviors of task-specific admm. IEEE Transactions on Image Processing 30, pp. 8278–8292. Cited by: §II.
  • [32] T. Liu, M. Xu, and Z. Wang (2019) Removing rain in videos: a large-scale database and a two-stream convlstm approach. In IEEE ICME, pp. 664–669. Cited by: TABLE II.
  • [33] Y. Luo, Y. Xu, and H. Ji (2015) Removing rain from a single image via discriminative sparse coding. IEEE ICCV, pp. 3397–3405. Cited by: §II.
  • [34] A. Mittal, R. Soundararajan, and A. C. Bovik (2012) Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters, pp. 209–212. Cited by: §IV-A.
  • [35] P. Mu, J. Chen, R. Liu, X. Fan, and Z. Luo (2019) Learning bilevel layer priors for single image rain streaks removal. IEEE Signal Processing Letters 26 (2), pp. 307–311. Cited by: §II.
  • [36] W. Ran, Y. Yang, and H. Lu (2020) Single image rain removal boosting via directional gradient. In IEEE ICME, pp. 1–6. Cited by: §II.
  • [37] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019)

    Regularized evolution for image classifier architecture search

    In AAAI, Vol. 33, pp. 4780–4789. Cited by: §II.
  • [38] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin (2017) Large-scale evolution of image classifiers. In ICML, pp. 2902–2911. Cited by: §II.
  • [39] D. Ren, W. Zuo, Q. Hu, P. Zhu, and D. Meng (2019) Progressive image deraining networks: a better and simpler baseline. In IEEE CVPR, pp. 3937–3946. Cited by: §II.
  • [40] W. Ren, J. Tian, Z. Han, A. Chan, and Y. Tang (2017) Video desnowing and deraining based on matrix decomposition. In IEEE CVPR, pp. 4210–4219. Cited by: §I, §II, §II.
  • [41] S. Sabach and S. Shtern (2017) A first order method for solving convex bilevel optimization problems. SIAM Journal on Optimization 27 (2), pp. 640–660. Cited by: §III-A.
  • [42] P. K. Sharma, S. Ghosh, and A. Sur (2021) High-quality frame recurrent video de-raining with multi-contextual adversarial network. ACM Transactions on Multimedia Computing, Communications, and Applications 17 (2), pp. 1–24. Cited by: §II.
  • [43] H. R. Sheikh and A. C. Bovik (2006) Image information and visual quality. IEEE Transactions on Image Processing, pp. 430–444. Cited by: §IV-A.
  • [44] S. Starik and M. Werman (2003) Simulation of rain in videos. In Texture Workshop, ICCV, Vol. 2, pp. 406–409. Cited by: §I, §II.
  • [45] Z. Teed and J. Deng (2020) Raft: recurrent all-pairs field transforms for optical flow. In ECCV, pp. 402–419. Cited by: §III-B2.
  • [46] H. Wang, Y. Wu, Q. Xie, Q. Zhao, Y. Liang, S. Zhang, and D. Meng (2020) Structural residual learning for single image rain removal. Knowledge-Based Systems, pp. 106595. Cited by: §II.
  • [47] T. Wang, X. Yang, K. Xu, S. Chen, Q. Zhang, and R. W. H. Lau (2019) Spatial attentive single-image deraining with a high quality real rain dataset. IEEE CVPR, pp. 12270–12279. Cited by: Fig. 1, §III-B4, §III-B4, §IV-B.
  • [48] Y. Wang, T. Huang, X. Zhao, and T. Jiang (2020) Video deraining via nonlocal low-rank regularization. Applied Mathematical Modelling 79, pp. 896–913. Cited by: §I, §II.
  • [49] W. Wei, L. Yi, Q. Xie, Q. Zhao, D. Meng, and Z. Xu (2017) Should we encode rain streaks in video as deterministic or stochastic?. In IEEE ICCV, pp. 2516–2525. Cited by: §II.
  • [50] W. Yang, J. Liu, and J. Feng (2019) Frame-consistent recurrent video deraining with dual-level flow. In IEEE CVPR, pp. 1661–1670. Cited by: §I, §II, §II, §III-A, §IV-B.
  • [51] W. Yang, R. T. Tan, J. Feng, J. Liu, Z. Guo, and S. Yan (2017) Deep joint rain detection and removal from a single image. IEEE CVPR, pp. 1685–1694. Cited by: Fig. 1, §II, §III-B4, §III-B4, §IV-B.
  • [52] W. Yang, R. T. Tan, J. Feng, S. Wang, B. Cheng, and J. Liu (2021) Recurrent multi-frame deraining: combining physics guidance and adversarial learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §II.
  • [53] W. Yang, R. T. Tan, S. Wang, Y. Fang, and J. Liu (2020) Single image deraining: from model-based to data-driven and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1. Cited by: §II, §III-C.
  • [54] W. Yang, R. T. Tan, S. Wang, and J. Liu (2020) Self-learning video rain streak removal: when cyclic consistency meets temporal correspondence. In IEEE CVPR, pp. 1720–1729. Cited by: §I, §II, §III-A, §IV-B.
  • [55] W. Yang, S. Wang, D. Xu, X. Wang, and J. Liu (2020) Towards scale-free rain streak removal via self-supervised fractal band learning.. In AAAI, pp. 12629–12636. Cited by: §II.
  • [56] H. Zhang, Y. Li, H. Chen, and C. Shen (2020) Memory-efficient hierarchical neural architecture search for image denoising. In IEEE CVPR, pp. 3657–3666. Cited by: §II.
  • [57] H. Zhang and V. M. Patel (2018) Density-aware single image de-raining using a multi-stream dense network. IEEE CVPR, pp. 695–704. Cited by: §II, §IV-B.
  • [58] L. Zhang, L. Zhang, X. Mou, and D. Zhang (2011) FSIM: a feature similarity index for image quality assessment. IEEE Transactions on Image Processing, pp. 2378–2386. Cited by: §IV-A.
  • [59] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    In IEEE CVPR, pp. 586–595. Cited by: §IV-A.
  • [60] X. Zheng, Y. Liao, W. Guo, X. Fu, and X. Ding (2013) Single-image-based rain and snow removal using multi-guided filter. In NeurIPS, pp. 258–265. Cited by: §II.
  • [61] B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. In ICLR, Cited by: §II.