1 Introduction
Inspired by compressive sensing (CS) [13, 4, 12], various computational imaging systems [1, 56] have been developed to enhance the imaging performance. In particular, snapshot compressive imaging (SCI) is one important branch of computational imaging techniques with wide applications [18, 47, 39], which utilizes a 2dimensional (2D) camera to capture the desired 3dimensional (3D) data (videos or hyperspectral images) by imposing modulations and then compressing into a single frame (measurement). Different from conventional imaging systems employing directly sampling strategies, such SCI systems compress the highdimensional signals along time (i.e., CACTI [28, 59]) or spectrum (i.e., CASSI [48]) to obtain the compressed measurements, leading to promising advantages in acquisition efficiency, storage consumption, and low bandwidth footprints.
Taking the compressed measurements as input, SCI reconstruction algorithms focus on efficiently and effectively recovering the desired video frames. The first mainstream methods are based on optimization models, which differ in various prior knowledge, eg, total variation (TV) used in GAPTV [55] and TwIST [2], sparsity in GMM based methods [54, 57] , and nonlocal lowrank in DeSCI [27]
. The second mainstream methods are based on deep learning networks, which prefer learning the direct mapping from measurements to video frames, such as offtheshelf Unet
[38], RevSCI [5]based on 3D convolutional neural network (CNN), BIRNAT based on recurrent neural network (RNN)
[6]. Recently, some researchers combine deep networks with optimization methods to reconstruct the desired videos, such as deep unfolding [53, 24, 29, 31] and plugandplay algorithms [58, 60]. Typically, for video reconstruction, capturing nonlocal interactions is of central importance, since similar pixel patterns frequently recur between distant pixels across spacetime [40, 50]. However, despite existing approaches can reconstruct the videos decently, they are limited in efficiently capturing longrange dependencies across video frames, restricting further performance improvement.As the most usually utilized deep learning networks, both the convolutional and recurrent operations build blocks that process one local neighborhood at a time, either in space or time. While the straightforward way of using deep stacks of CNN or RNN could naturally extend their receptive field, this strategy is inherently limited in computational inefficiency, optimization difficulties, and multihop dependency modeling, i.e., difficulty in delivering messages back or forth between distant positions. On the other hand, although the selfattention mechanisms have been applied to capture the global (nonlocal) spatial dependencies for SCI [6, 33, 32], they are inherently limited in the following perspectives: 1) severe computation overhead, 2) redundant information by capturing relations across all pairs of pixels, and 3) only performing along the spatial dimension and neglecting the temporal dimension. Consequently, there remains a gap in constructing an elaborately designed model to efficiently capture both the local and global spatialtemporal dependencies for video SCI reconstruction.
Bearing these concerns in mind, we propose a graph neural networks (GNN) based approach to directly compute interactions between any two positions regardless of their spatialtemporal distance for better video representation learning. Furthermore, a dynamically sampling scheme is designed to select the most relative neighbours for each node, to alleviate the computational issues while enjoying the advantages brought by graphstructured features. In addition to modeling on video inputs, we observe that optical flow as an offtheshelf module is helpful to find nonlocal dependency [50], motivating us to leverage optical flow into our graph construction. As depicted in Fig. 1 (a), the motion information could facilitate the network to find the most relative neighbours. Typically, there are two solutions to further improve the current stateoftheart (SOTA) results: a) developing a new endtoend network accounting for nonlocal spatial and temporal dependencies, or b) building an efficient, flexible, and generic addon module appended to existing networks to improve the results. Apparently, a) is computationally expensive, consuming large GPU memory and long training time. Thus, in this paper, instead of designing more complicated network architectures, we utilize b) to address the aforementioned shortcomings and efficiently model nonlocal spatialtemporal dependencies.
Consequently, in this paper, we propose a flexible and robust approach based on motionaware dynamic graph (dubbed MadyGraph) to improve existing SOTA models for video SCI. As shown in Fig. 1 (b), the reconstruction quality can be significantly improved with our proposed MadyGraph. To elaborate a bit, we develop this form by the following concerns: i) capture dependencies regardless of the distance along both the spatial and temporal dimensions (achieved by graph construction); ii) flexibly aggregate the most relative contextual information without much redundancy (achieved by dynamic sampling); iii) leverage appropriate prior into our model such as motion information (achieved by motionaware dynamic walks), sparsity and nonlocal selfsimilarity (achieved by crossscale node sampling
) to better exhibit correlations between nodes. From the overall framework, MadyGraph takes the reconstructed videos as input, firstly constructs relationships among dynamically selected spatialtemporal related nodes, and then aggregates them together to enable effective feature extraction for enhancing video reconstruction. Specific contributions of this work are summarized as follows:

[leftmargin=*]

A motionaware dynamic graph network (MadyGraph) is developed to efficiently model nonlocal spatialtemporal dependencies with the framebyframe motion information for better video representation. To the best of our knowledge, this is the first time that graph neural network is utilized for SCI reconstruction.

The proposed flexible and lightweight module is capable of “plugging” into any existing reconstruction approaches to improve the results, while alleviating the computational issues to a certain extent ( for pixels) compared with other nonlocal networks such as the selfattention module (), in accordance with the realistic demand of computational requirement of video SCI.

We showcase the effectiveness of the proposed dynamic graph in the application of video SCI, which achieves considerable improvements with respect to SOTA baselines on both simulation and real datasets.
2 Video Snapshot Compressive Imaging
2.1 Mathematical Model of Video SCI
In video SCI, the 3D video is modulated by dynamic masks and then compressed into a ‘measurement’ (a coded 2D frame) along time, which is decoded later using effective reconstruction methods to recover the original video. As depicted in Fig. 2, highspeed video frames are modulated by coding patterns (masks) , and then integrated over time on a camera. Mathematically, the compressed coded measurement frame can be expressed as
(1) 
where denotes the Hadamard (elementwise) product and refers to the noise. It has been proved that high quality reconstruction is achievable when [20]. In this paper, we focus on the inverse problem, i.e., aiming to improve the reconstruction quality of video frames , given , and existing reconstruction algorithms.
2.2 Related Work
Video Snapshot Compressive Imaging: Video SCI is a hardwareencoderplussoftwaredecoder system [56]. For the encoder, the key part is the spatial light modulators for modulating original scenes, e.g.usually a physical mask [28, 59], or different patterns on the digital mirror device (DMD) [18, 38, 30, 36, 37, 41]. Regarding the decoder, some optimization model based methods introduce different priors to recover the desired videos by iteratively optimizing, e.g., GAPTV [55], GMM [54, 57], and DeSCI [27], however suffering from the extremely long and impractical reconstruction time for real applications. In the recent years, a boost in SCI reconstruction efficiency was achieved by introducing deep learning based methods [5, 6, 52, 31, 24], which can recover the video not only within tens or hundreds of milliseconds but also with high quality. Most recently, a dense deep unfolding network (DDUN) [53] is designed with 3D CNN prior, which combines the merits of both optimization model based and deep learning based methods, achieving SOTA performance in video SCI. Nevertheless, these approaches usually neglect the longrange dependencies over the spacetime volume during reconstruction, leading to a potential requirement for efficient nonlocal modeling.
Nonlocal image/video processing: For a natural image/video, similar pixel patterns frequently recur so that many nonlocal methods have shown pleasant performance in different tasks [3, 8]. The nonlocal operation computes the response at a position as a weighted sum of the features at all positions, across space, time, or spacetime [50, 26, 63], but suffering from the exhausting computation. Graph neural networks (GNN) based methods, which propagate information along graphstructured input data, can alleviate the computational issues to a certain extent. The most related works of our proposed model are GraphSAGE [16] and DGMN [62], which utilize sampled graph nodes to capture positionbased context. However, GraphSAGE [16] simply uniformly samples nodes with fixed positions along the spatial dimension, independent of the actual input. Absorbing the ideas from deformable convolution [9], DGMN [62]
adaptively samples graph nodes for message passing according to the input. Nevertheless, it only explores along the spatial dimension without extending to the temporal dimension. Furthermore, in DGMN, the offset of each node is estimated only based on the features of a set of sampled nodes without checking the whole feature map, resulting in insufficient information exploration. Different from it, we extend the nonlocal modeling to the space and time while keeping a lightweight computational cost, and dynamically sample related nodes not only according to the whole feature map but also taking the knowledge of motion information between frames into consideration.
Video Enhancement: Due to the success of deep learning, researchers model image or video enhancement tasks as regression problems, i.e.
, given a degraded image or video, a welldesigned network outputs an enhanced one. Various deep neural networks are designed for superresolution
[11, 49, 45, 25], denoising [61, 43, 44], deblurring [49, 34, 7], and compression artifacts reduction [10, 15, 17, 14]. To a certain degree, the purpose of our proposed method has some similarities with the video enhancement methods, both aiming to improve the quality of existing degraded video. However, the enhancement methods are incompatible with video SCI task as discussed in our experimental section. Different from the traditional enhancement tasks, we leverage the hardware information of the SCI system into MadyGraph, where both the inputs and network architectures are elaborately designed for video SCI reconstruction.3 The Proposed Model
As depicted in Fig. 3, given the compressed measurement captured by the SCI system and the corresponding masks , our goal is to output the finegrained reconstructed video. Overall, our proposed model consists of three steps as shown in the Fig. 3: ) The input coarse video frames are obtained from backbone (any existing) models, and then remasked and transmitted to the highdimensional features through a 4layer 3D CNN. ) The dynamic Graph is constructed by automatically selecting most related neighbours across spacetime for each node with two mechanisms: motionaware dynamic sampling and crossscale node sampling. ) The dynamically constructed graph is then aggregated in feature domain and then mapped through a 4layer 3D CNN into the desired reconstructed video domain.
3.1 Input Remasking and Feature Extraction
At the beginning, the coarsely reconstructed videos are firstly obtained through the backbone models. Here, the backbone models refer to any existing video SCI models including both the optimization and deep learning based models. In order to provide a baseline, we also design a simple 3D CNN model for coarse video estimation, which is omitted here and detailed in the supplementary material (SM). In terms of the network input, one straightforward way is to directly use the reconstructed videos from the backbone models as inputs. However, after extensive experiments, we found it is difficult to obtain good results in this manner, since critical information might be smoothed out in the coarsely reconstructed videos. Considering that the knowledge of hardware system is critical for SCI decoding, we develop a remasking operation to better fit for video SCI.
Based on the coarsely reconstructed video from the backbone model, we remask each frame to enrich information about video SCI systems, i.e.the measurement and other remodulated frames . This step aims to reuse the information that might be neglected or smoothed in the backbone models. Recalling the mathematical model of video SCI in Eq. (1), we remask each frame by:
(2) 
where denotes the remodulated frame, acquired by dividing the summation of other masked frames from the measurement. Thus, in the remasking operation of each frame, all the inputs of original system and the coarse video are incorporated to enrich its information, facilitating further feature extraction. After acquiring the remasked frames , we construct a 4layer 3D CNN to encode the data cube into highdimensional features, as:
(3) 
where
is a 4D tensor,
is the channel dimension, andrefers to the temporal dimension (the number of video frames). After the 3D CNN operated on remasked video frames, the encoded feature of each frame is a 3D tensor, and the representation of each pixel at a single frame corresponds to a vector with the dimension of
.3.2 Motionaware Dynamic Node Sampling
Given the feature maps , hereafter we present the detailed steps to dynamically construct a motionaware graph.
3.2.1 Graph Definition and Notation
Firstly, we present the feature maps in the graph domain by constructing a feature graph , where denotes a set of node features, and represents the connections between nodes. Specifically, we define each initial node feature of the graph as the latent feature vector of a single pixel at one time step in , i.e., . Formally, the graph representation learning can be described as a weighted summation:
(4) 
where represents the node feature at the iteration, computed by weighted summation of at the previous iteration. The initial node features are coming from the initial input feature maps . The denotes the connection relationship between nodes and , and refers to the selfincluded neighbour of the node . Here, is a transmit operation imposed on .
Given , an important question is how to construct which correlates each node feature with others. For video processing, both the spatial and temporal relations should be taken into consideration regardless of the limitations of their distance. A direct implementation is constructing a fullyconnected graph by computing nonlocal interactions between every two nodes across space and time, i.e., . However, a fullyconnected graph often suffers from prohibitively expensive computation and redundant information, which makes the graph network difficult to be optimized especially dealing with limited training data. To address this challenge, we construct a dynamic sampling scheme to dynamically sample a small subset of the most relevant feature nodes as across spacetime, through motionaware dynamic sampling and crossscale node sampling. In this way, the nodes relations in are nonzero only at the positions of selected relevant nodes, imposed with the sparsity prior to efficiently gather nonlocal dependencies without computation overhead and information redundancy.
3.2.2 Motionaware Dynamic Sampling
In this part, we introduce a dynamic node sampling scheme to adaptively select neighbours of each node under the guidance of motion information. The dynamic sampling scheme is composed of three steps: 1) initial sampling, uniformly in space and continuously in time, 2) framebyframe motion extraction, and 3) motionaware dynamic walks.
Initial sampling: At the spatial dimension, we uniformly sample neighbouring nodes from across space for each graph node , which is a commonly used strategy for graph node sampling [62, 23] based on MonteCarlo estimation. At the temporal dimension, there exists two selections for sampling. Discretely sampling constructs graph snapshots taken at intervals in time, and continuously sampling is more general and may include more motions, however acquiring more computational cost. For video SCI reconstruction, we employ the continuously sampling scheme considering the importance of video continuity and the affordable extra computation for video SCI. Thus, the initial neighbours of each node contains elements. The initial sampling scheme can be seen as a positionfixed locallyconnected mechanism which neglects the original feature distribution and may miss out the important context.
Motion extraction: In order to explicitly learn the dynamic motions across video frames, we use optical flow to represent the framebyframe motions in our work to help each node find its most related neighbouring nodes. Considering the efficiency and accuracy, a lightweight flownet [19] is employed as our optical flow extractor. Taking the frame and the frame of the coarsely reconstructed video as inputs, we calculate the motion by:
(5) 
where denote the optical flow of adjacent frames output from a pretrained extractor [19], each of which consists of a vertical and a horizontal component. Furthermore, we define the motion of the last frame in a video as , to provide the motion information in the reverse direction. In the following, we will introduce how to utilize the motion information to guide dynamic sampling, which particularly learns the dynamic walks around initial sampling nodes for each graph node.
MotionAware Dynamic Walks: To take into account of all the node features and motions when dynamic sampling neighbours for nodes, we absorb the ideas from deformable CNN [9] and propose a motionaware dynamic walk upon the initial sampling nodes. As illustrated in Fig. 3, the black arrows refer to the dynamic walks around initial sampling nodes, which are predicted in a datadriven fashion being aware of the overall feature maps and motion distributions. Under continuously sampling scheme across time, the dynamic walks independently perform along the spatial dimension in each frame, considering the positions of related nodes always change over time in a video. In other words, the dynamic walks are different and adaptive in various frames.
Considering that dynamic walks are conducted across 2D spatial domain in each frame, let denote the predicted walk according to one of the initially sampled neighbouring nodes in the video frame. The node walks are predicted by applying a convolutional layer over the input feature maps and motion representations, i.e.,
(6) 
where is the horizontal and vertical walks (offset) of the neighbouring nodes of all the pixels in the frame, and refers to the concatenation operation along channels. Hereafter, we omit the superscript of iterations for brevity. Intuitively, it will be much easier for the neural network to predict the position of related nodes if the explicit motion of the current node is known. Thus we utilize the optical flow of adjacent frames to provide the motion information of both the former and latter frames as guidance to better learn the dynamic walks. Finally, the predicted dynamic walk is performed upon the initially sampled position by (), and calculate the features as:
(7) 
where denotes the bilinear sampler [9], which is imposed on , the original feature map of the
frame, to obtain the bilinear interpolated features for dynamic sampling nodes. This is because the dynamic walks are typically fractional, always resulting in irregular sampling positions; please refer to
[9] for more details about the bilinear sampler. In short, by performing motionaware dynamic walks upon the initially sampled neighbouring nodes, we believe that it could facilitate finding most related neighbours and learning better feature distributions for the graph nodes.3.2.3 CrossScale Node Sampling
In order to capture longerdependencies for each node, we utilize a crossscale node sampling mechanism to increase the receptive field size, which still keeps the computational efficiency by sparse sampling. Different “dilation rates” (denoted as ) are used to sample neighbouring nodes of various distances whilst maintaining a small number of connected nodes. As shown in Fig. 3, the different setting of “dilation rate” is able to find related nodes with different distances across various scales, which is utilized in conjunction with the dynamic walks. A limitation of this parallel architecture is that the computation memory will grow in proportion to the increasing scales, which is however acceptable and deserved due to its performance gains.
3.3 Graph Aggregation and Video Reconstruction
So far, the connections between the crossscale motionaware dynamic sampled nodes have been well constructed as a graph. After this, an aggregation module is developed to aggregate the relative features together along both the spatial and temporal domains. This operation aims to update the feature maps with the same dimension as input by utilizing the learned dynamic graphstructured features to capture nonlocal dependencies for better video reconstruction. The aggregation can be recognized as information interaction in each iteration, while the graph usually takes iterations for feature updating. Towards this end, we define a generic aggregation operation for updating a node at the th iteration step as:
(8) 
where is the corresponding updated node feature with the same size as ; represents the dynamically selected neighbouring node set of , containing the dynamic sampled neighbours at the frame. The correlation between two nodes is and normalized by . Specifically, there could be different choices for the pairwise function for calculating the relative relationship between two nodes. Here, the embedded Gaussian version is utilized to compute node relationships as:
(9) 
where the function is composed of CNNs for transforming the input node features to another representation space to calculate nodes similarities. Moreover, the weight is a learnable parameter to ‘reweight’ the node relationships of the frame, as we assume the node features of some frames (might be the neighbouring frames) are more important for updating node than other frames. The nodeconditioned relation calculation is similar to the selfattention mechanism in [6, 46], but only performed on the dynamically selected relative nodes. From this perspective, our aggregation can be recognized as an efficient variety of selfattention mechanism when adapting to the video processing.
Up to now, the dynamical graph nodes features can be adaptively updated by aggregating the relative information both spatially and temporally regardless of the distance for each node. The nonlocally constructed graph facilitates extracting efficient and meaningful features, which are then fed into a 4layer 3D CNN to reduce the channel and decode the final finegrained reconstructed video as:
(10) 
Optimization: At the training stage, we optimize our proposed model with mean square error (MSE) loss, i.e.,
(11) 
where is the frame of our final reconstructed video, and is from the groundtruth. In order to efficiently obtain a welltrained MadyGraph, we jointly train MadyGraph with the provided backbone model (a simple 3D CNN) with loss:
(12) 
After sufficient training, MadyGraph is ready to be “plugged” into any other backbone models to enhance their results.
4 Experiments
4.1 Datasets and Implement Details
Training and Test Datasets: Following [6], we choose DAVIS2017 [35] as the training set for all experiments, which has 90 different scenes of total 6208 frames with two resolutions: 480894 and 10801920. Same as [6], 26000 randomly cropped patch cubes (256 256 8) from the original scenes in DAVIS2017 are synthesized as training data. The widely used benchmark simulated test data including Kobe, Runner, Drop, Traffic [27], Aerial and Vehicle [58] with the size of 2562568 are used for evaluation. For real data, we choose three scenes from two real SCI systems for testing, Wheel [28] with the size of 256 25614, Domino and Water Balloon [38] with the size of 512 51210.
Algorithm  Kobe  Traffic  Runner  Drop  Aerial  Vehicle  Average  Time 
PnPFFDNet [58]  30.50, 0.926  24.18, 0.828  32.15, 0.933  40.70, 0.989  25.27, 0.829  25.42, 0.849  29.70, 0.892  3.0 
E2ECNN [38]  29.02, 0.861  23.45, 0.838  34.43, 0.958  36.77, 0.974  27.52, 0.882  26.40, 0.886  29.26, 0.900  0.023 
BaseNet + MadyGraph  32.98, 0.953  29.04, 0.940  38.87, 0.978  42.60, 0.996  28.98, 0.911  28.02, 0.932  33.42 0.952  0.31 
GAPTV [55]  26.45, 0.845  20.89, 0.715  28.81, 0.909  34.74, 0.970  25.05, 0.828  24.82, 0.838  26.79, 0.858  4.20 
GAPTV + MadyGraph  28.97, 0.911  23.12, 0.823  32.86, 0.955  38.96, 0.987  26.98, 0.882  26.13, 0.887  29.50, 0.908  4.29 
DeSCI [27]  33.25, 0.952  28.72, 0.925  38.76, 0.969  43.22, 0.993  25.33, 0.860  27.04, 0.909  32.72, 0.935  6180 
DeSCI + MadyGraph  35.42, 0.967  30.41, 0.947  40.71, 0.977  45.06, 0.994  26.76, 0.899  28.00, 0.929  34.39, 0.952  6180 
BIRNAT [6]  32.71, 0.950  29.33, 0.942  38.70, 0.976  42.28, 0.992  28.99, 0.927  27.84, 0.927  33.31, 0.951  0.16 
BIRNAT + MadyGraph  33.48, 0.944  29.95, 0.946  39.48, 0.978  42.88, 0.992  29.20, 0.921  27.99, 0.931  33.83, 0.952  0.26 
RevSCI [5]  33.72, 0.957  30.02, 0.949  39.40, 0.977  42.93, 0.992  29.35, 0.924  28.12, 0.937  33.92, 0.956  0.19 
RevSCI + MadyGraph  34.34, 0.962  30.59, 0.956  40.34, 0.981  43.56, 0.993  29.56, 0.928  28.20, 0.940  34.43, 0.960  0.30 
DDUN [53]  35.02, 0.968  31.78, 0.964  40.91, 0.982  44.49, 0.994  30.58, 0.940  29.36, 0.955  35.36, 0.967  1.35 
DDUN + MadyGraph  36.63, 0.975  32.60, 0.969  42.17, 0.985  45.59, 0.995  30.91, 0.946  29.76, 0.957  36.28, 0.971  1.45 
Algorithm  Kobe  Traffic  Runner  Drop  Aerial  Vehicle  Average  Time 

RevSCI [5]  33.72, 0.957  30.02, 0.949  39.40, 0.977  42.93, 0.992  29.35, 0.924  28.12, 0.937  33.92, 0.956  0.19 
RevSCI + EDVR [49]  31.71, 0.936  29.08, 0.944  36.64, 0.971  38.5, 0.986  29.06, 0.921  27.93, 0.934  32.15, 0.949  0.55 
RevSCI + MIMOUnet [7]  32.83, 0.951  29.82, 0.947  38.24, 0.974  37.15, 0.986  29.23, 0.922  28.06, 0.936  32.55, 0.953  0.35 
RevSCI + finetune MIMOUnet [7]  33.75, 0.958  30.03, 0.949  39.40, 0.977  42.93, 0.992  29.41, 0.924  28.18, 0.937  33.95, 0.956  0.35 
RevSCI + MadyGraph  34.34, 0.962  30.59, 0.956  40.34, 0.981  43.56, 0.993  29.56, 0.928  28.20, 0.940  34.43, 0.960  0.30 
DDUN [53]  35.02, 0.968  31.78, 0.964  40.91, 0.982  44.49, 0.994  30.58, 0.94  29.36, 0.955  35.36, 0.967  1.35 
DDUN + EDVR [49]  32.52, 0.955  29.98, 0.956  36.5, 0.973  38.32, 0.987  30.11, 0.936  29.07, 0.952  32.75, 0.96  1.71 
DDUN + MIMOUnet [7]  34.02, 0.964  31.34, 0.961  39.26, 0.979  36.43, 0.986  30.39, 0.938  29.28, 0.953  33.45, 0.964  1.51 
DDUN + finetune MIMOUnet [7]  36.41, 0.975  32.41, 0.969  41.78, 0.984  43.95, 0.994  30.79, 0.945  29.73, 0.957  35.84, 0.971  1.51 
DDUN + MadyGraph  36.63, 0.975  32.60, 0.969  42.17, 0.985  45.59, 0.995  30.91, 0.946  29.76, 0.957  36.28, 0.971  1.45 
Counterparts and Evaluation Metrics:
The proposed MadyGraph is compared with several SOTA methods including model based methods GAPTV [55], DeSCI [27], plugandplay method PnPFFDNet [58], and deep learning based methods E2ECNN [38], BIRNAT [6], RevSCI [5], and SOTA method DDUN [53]. For the simulation data, both peaksignaltonoise ratio (PSNR) and structural similarity (SSIM)
[51] are used as metrics to quantitatively evaluate the reconstruction quality.Implement Details: We choose 3 dilation rates (1, 7 and 13) for the crossscale node sampling in dynamic graph. For each latent node, we sample neighbouring nodes of each scale. The detailed architectures of MadyGraph and the baseline model are given in the supplementary materials (SM). The Adam optimizer [21] is employed for optimization with the initial learning rate of
. All experiments are implemented in PyTorch running on an RTX 8000 GPU.
4.2 Results on Simulated Data
The results of six widely used benchmark simulated data are given in Table 1 and Fig. 4. We impose the proposed MadyGraph on six backbones: a simple baseline network (called BaseNet composed of 3D CNNs) to verify the efficiency of the proposed model, two iterative optimization methods GAPTV [55] and DeSCI [27], and three deep networks BIRNAT [6], RevSCI [5] and DDUN [53]. Note that, we only jointly train MadyGraph with the simple baseline model, and then directly append the welltrained MadyGraph to other backbones. It can be observed that the proposed MadyGraph improves SOTA methods, i.e., providing {2.71dB, 1.67dB, 0.52, 0.51, 0.92dB} higher PSNR than GAPTV, DeSCI, BIRNAT, RevSCI, and DDUN, respectively, with a quite short extra running time (about 0.1s). This demonstrates both the effectiveness and efficiency of the proposed method for improving SCI reconstruction. Fig.4 plots selected reconstruction frames of different algorithms. Compared with the counterparts, cleaner and sharper reconstruction corners and finer details with less noise are provided by our proposed MadyGraph enhancement.
Furthermore, we compare our addon MadyGraph with image or video enhancement methods to show the superiority of our model for video SCI. Recalling the purpose of MadyGraph is enhancing the previous reconstruction results of different SCI methods, it is to some extent similar to traditional enhancement tasks such as deblurring and denoising, i.e.given a destructive result, to restore a more satisfactory one. In particular, we conduct the comparisons by following two steps: ) directly plug the enhancement methods into coarse reconstructions of different existing SCI methods; ) finetune the enhancement networks with the coarse results as inputs and ground truth videos as outputs. The detailed results are given in Table 2, where we firstly utilize two SOTA deblur models EDVR [49] (for video deblurring) and MIMOUnet [7] (for single image deblurring) to enhance the results from RevSCI [5] and DDUN [53]. Surprisingly but reasonably, it appears to a certain degree of degradation after deblurring the coarse results, which is mainly due to the fact that the deblurring network is highly overfitting on the degraded model [42, 22]. Once the degradation pattern is changed, the reconstruction quality will decrease. Furthermore, we finetune the pretrained MIMOUnet based on the two SCI reconstruction methods, which results in better performance than directly utilizing the pretrained model but lower quality and more inference time consumption compared with our proposed MadyGraph.
Model  PSNR  SSIM  Parameters  MACs  Time 
()  ()  
BaseNet  32.20  0.932  2.53  6.56  0.21 
+DW  33.12  0.945  3.21  7.17  0.25 
+DW+CS  33.28  0.949  3.38  7.29  0.26 
+DW+CS+MA  33.42  0.952  8.75    0.31 
To showcase how the proposed model finds related clues to update the representation for reconstructing videos, we visualize the intrinsic dynamic relationships in Fig. 5. Intuitively, the proposed model computes the response at the central position (e.g., the yellow point) according to the dynamically selected nodes (e.g., the red points). As we can see, the central point is at the lamp of a moving car, which tends to show more in the picture of the latter frames. We observed that: 1) most of the dynamically selected relative nodes are located on the car lamp or the surrounding relative positions; 2) the selected neighbours tend to diffuse around the lamp as time goes by, meanwhile, concentrate at the right of the picture, in accordance with the moving object; and 3) the aggregation weights of neighbouring nodes in frame #1 are lower than 0.2, since the lamp does not appear in the image. This demonstrates that the proposed model is able to dynamically capture meaningful and reasonable correlations regardless of the distance across space and time in the video.
4.3 Ablation Study
To quantitatively verify the contributions of each module in the MadyGraph, we add the partial components step by step with results shown in Table 3 on the six benchmark simulated data. Compared with the baseline, the dynamic walk (DW) upon initial sampling significantly increases the reconstruction quality (0.92dB) with minor parameters overhead by dynamically modeling nonlocal dependencies. Moreover, the crossscale (CS) node sampling scheme also contributes to performance improvement (0.16dB) by enlarging the receptive field. Furthermore, the motionaware (MA) mechanism gives a performance improvement (0.14dB) by leveraging motion knowledge for better video representation. The final version achieves the best result, which indicates our MadyGraph is an effective combination of each component.
4.4 Results on Real Data
In order to verify the robustness of our proposed algorithm, we conduct experiments on real data captured by SCI cameras [28, 59], which is more challenging due to the noise in reality. With the Wheel snapshot measurement of size 256256, we can recover 14 highspeed frames. As shown in Fig. 6, the generated letter ‘D’ of our proposed model provides clearer and smoother details with fewer artifacts. For testing on the larger scale, the snapshot measurements Domino and Water Balloon are recovered as two videos of size 51251210. It can be observed that the proposed model generates sharper edges with less noise and provides more accurate contours, owing to the efficient modeling for nonlocal spatialtemporal dependencies. Meanwhile, the proposed method significantly saves testing time compared with DeSCI (several hours). In general, the experiments demonstrate both the applicability and efficiency of our algorithm on real data.
5 Conclusions
We present a motionaware dynamic graph to capture nonlocal meaningful related dependencies regardless of the distance in space and time, under the guidance of the motion information across video frames. The proposed MadyGraph is an efficient, generic, and “plugandplay" module, which can be generalized to any backbone network to boost the reconstruction performance of video compressive sensing. The experimental results on both simulation and real SCI data demonstrate the effectiveness and efficiency of the proposed MadyGraph. In addition to the study focusing on the video SCI system in this paper, MadyGraph can potentially extend to a wider range of application scenarios in the future, wherever nonlocal modeling is critical.
References
 [1] Yoann Altmann, Stephen McLaughlin, Miles J Padgett, Vivek K Goyal, Alfred O Hero, and Daniele Faccio. Quantuminspired computational imaging. Science, 361(6403), 2018.
 [2] J. M. BioucasDias and Mat Figueiredo. A new twist: Twostep iterative shrinkage/thresholding algorithms for image restoration. IEEE Transactions on Image Processing, 16(12):2992–3004, 2008.
 [3] Antoni Buades, Bartomeu Coll, and JM Morel. A nonlocal algorithm for image denoising. In Computer Vision and Pattern Recognition (CVPR), 2005.
 [4] E. J. Candès and T. Tao. Nearoptimal signal recovery from random projections: universal encoding strategies? IEEE Transactions on Information Theory, 2006.
 [5] Z. Cheng, B. Chen, G. Liu, H. Zhang, R. Lu, Z. Wang, and X. Yuan. Memoryefficient network for largescale video compressive sensing. In Computer Vision and Pattern Recognition (CVPR), 2021.

[6]
Ziheng Cheng, Ruiying Lu, Zhengjue Wang, Hao Zhang, Bo Chen, Ziyi Meng, and Xin
Yuan.
BIRNAT: Bidirectional recurrent neural networks with adversarial training for video snapshot compressive imaging.
In European Conference on Computer Vision (ECCV), 2020.  [7] SungJin Cho, SeoWon Ji, JunPyo Hong, SeungWon Jung, and SungJea Ko. Rethinking coarsetofine approach in single image deblurring. In International Conference on Computer Vision (ICCV), pages 4641–4650, 2021.
 [8] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3d transformdomain collaborative filtering. IEEE Transactions on Image Processing, 16(8):2080–2095, August 2007.
 [9] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.
 [10] Chao Dong, Yubin Deng, Chen Change Loy, and Xiaoou Tang. Compression artifacts reduction by a deep convolutional network. In Proceedings of the IEEE International Conference on Computer Vision, pages 576–584, 2015.
 [11] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image superresolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2015.
 [12] David L Donoho et al. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.
 [13] Candes Emmanuel, Justin Romberg, and Terence Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 2006.
 [14] Xueyang Fu, Xi Wang, Aiping Liu, Junwei Han, and ZhengJun Zha. Learning dual priors for jpeg compression artifacts removal. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4086–4095, 2021.
 [15] Jun Guo and Hongyang Chao. Onetomany network for visually pleasing compression artifacts reduction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3038–3047, 2017.
 [16] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 49, 2017, Long Beach, CA, USA, pages 1024–1034, 2017.
 [17] Xin He, Qiong Liu, and You Yang. Mvgnn: Multiview graph neural network for compression artifacts reduction. IEEE Transactions on Image Processing, 29:6829–6840, 2020.
 [18] Yasunobu Hitomi, Jinwei Gu, Mohit Gupta, Tomoo Mitsunaga, and Shree K Nayar. Video from a single coded exposure photograph using a learned overcomplete dictionary. In 2011 International Conference on Computer Vision, pages 287–294. IEEE, 2011.
 [19] TakWai Hui, Xiaoou Tang, and Chen Change Loy. Liteflownet: A lightweight convolutional neural network for optical flow estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8981–8989, 2018.
 [20] S. Jalali and X. Yuan. Snapshot compressed sensing: Performance bounds and algorithms. IEEE Transactions on Information Theory, 65(12):8005–8024, Dec 2019.
 [21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [22] Orest Kupyn, Tetiana Martyniuk, Junru Wu, and Zhangyang Wang. Deblurganv2: Deblurring (ordersofmagnitude) faster and better. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
 [23] Jure Leskovec and Christos Faloutsos. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 631–636, 2006.
 [24] Yuqi Li, Miao Qi, Rahul Gulve, Mian Wei, Roman Genov, Kiriakos N Kutulakos, and Wolfgang Heidrich. Endtoend video compressive sensing using andersonaccelerated unrolled networks. In 2020 IEEE International Conference on Computational Photography (ICCP), pages 1–12. IEEE, 2020.
 [25] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1833–1844, 2021.
 [26] Ding Liu, Bihan Wen, Yuchen Fan, Chen Change Loy, and Thomas S Huang. Nonlocal recurrent network for image restoration. In Conference on Neural Information Processing Systems (NeurIPS), 2018.
 [27] Yang Liu, Xin Yuan, Jinli Suo, David Brady, and Qionghai Dai. Rank minimization for snapshot compressive imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(12):2990–3006, Dec 2019.
 [28] Patrick Llull, Xuejun Liao, Xin Yuan, Jianbo Yang, David Kittle, Lawrence Carin, Guillermo Sapiro, and David J Brady. Coded aperture compressive temporal imaging. Optics Express, 21(9):10526–10545, 2013.
 [29] Jiawei Ma, Xiaoyang Liu, Zheng Shou, and Xin Yuan. Deep tensor ADMMNet for snapshot compressive imaging. In IEEE/CVF Conference on Computer Vision (ICCV), 2019.
 [30] Xiao Ma, Xin Yuan, Chen Fu, and Gonzalo R. Arce. Ledbased compressive spectral temporal imaging system. Optics Express, 2021.
 [31] Ziyi Meng, Shirin Jalali, and Xin Yuan. Gapnet for snapshot compressive imaging. arXiv: 2012.08364, December 2020.
 [32] Ziyi Meng, Jiawei Ma, and Xin Yuan. Endtoend low cost compressive spectral imaging with spatialspectral selfattention. In European Conference on Computer Vision (ECCV), August 2020.
 [33] Xin Miao, Xin Yuan, Yunchen Pu, and Vassilis Athitsos. net: Reconstruct hyperspectral images from a snapshot measurement. In IEEE/CVF Conference on Computer Vision (ICCV), 2019.
 [34] Seungjun Nah, Sanghyun Son, and Kyoung Mu Lee. Recurrent neural networks with intraframe iterations for video deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8102–8111, 2019.
 [35] Jordi PontTuset, Federico Perazzi, Sergi Caelles, Pablo Arbelaez, Alexander SorkineHornung, and Luc Van Gool. The 2017 DAVIS challenge on video object segmentation. CoRR, abs/1704.00675, 2017.
 [36] Mu Qiao, Xuan Liu, and Xin Yuan. Snapshot spatial–temporal compressive imaging. Opt. Lett., 45(7):1659–1662, Apr 2020.
 [37] Mu Qiao, Xuan Liu, and Xin Yuan. Snapshot temporal compressive microscopy using an iterative algorithm with untrained neural networks. Opt. Lett., 2021.
 [38] Mu Qiao, Ziyi Meng, Jiawei Ma, and Xin Yuan. Deep learning for video compressive sensing. APL Photonics, 5(3):030801, 2020.
 [39] Dikpal Reddy, Ashok Veeraraghavan, and Rama Chellappa. P2c2: Programmable pixel compressive camera for high speed imaging. In CVPR 2011, pages 329–336. IEEE, 2011.
 [40] Karen Simonyan and Andrew Zisserman. Twostream convolutional networks for action recognition in videos. In Neural Information Processing Systems (NIPS), 2014.
 [41] Yangyang Sun, Xin Yuan, and Shuo Pang. Compressive highspeed stereo imaging. Opt Express, 25(15):18182–18190, 2017.
 [42] Xin Tao, Hongyun Gao, Xiaoyong Shen, Jue Wang, and Jiaya Jia. Scalerecurrent network for deep image deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [43] Matias Tassano, Julie Delon, and Thomas Veit. Dvdnet: A fast network for deep video denoising. In 2019 IEEE International Conference on Image Processing (ICIP), pages 1805–1809, 2019.
 [44] Matias Tassano, Julie Delon, and Thomas Veit. Fastdvdnet: Towards realtime deep video denoising without flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1354–1363, 2020.
 [45] Yapeng Tian, Yulun Zhang, Yun Fu, and Chenliang Xu. Tdan: Temporallydeformable alignment network for video superresolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3360–3369, 2020.
 [46] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Conference on Neural Information Processing Systems (NeurIPS), pages 5998–6008., 2017.
 [47] Ashwin Wagadarikar, Renu John, Rebecca Willett, and David Brady. Single disperser design for coded aperture snapshot spectral imaging. Applied Optics, 47(10):B44–B51, 2008.
 [48] Ashwin A Wagadarikar, Nikos P Pitsianis, Xiaobai Sun, and David J Brady. Video rate spectral imaging using a coded aperture snapshot spectral imager. Optics Express, 17(8):6368–6388, 2009.
 [49] Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, and Chen Change Loy. Edvr: Video restoration with enhanced deformable convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
 [50] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Nonlocal neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
 [51] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
 [52] Zhengjue Wang, Hao Zhang, Ziheng Cheng, Bo Chen, and Xin Yuan. Metasci: Scalable and adaptive reconstruction for video compressive sensing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2083–2092, 2021.
 [53] Zhuoyuan Wu, Jian Zhang, and Chong Mou. Dense deep unfolding network with 3dcnn prior for snapshot compressive imaging. In IEEE International Conference on Computer Vision (ICCV), 2021.

[54]
Jianbo Yang, Xin Yuan, Xuejun Liao, Patrick Llull, David J Brady, Guillermo
Sapiro, and Lawrence Carin.
Video compressive sensing using Gaussian mixture models.
IEEE Transaction on Image Processing, 23(11):4863–4878, November 2014.  [55] Xin Yuan. Generalized alternating projection based total variation minimization for compressive sensing. In 2016 IEEE International Conference on Image Processing (ICIP), pages 2539–2543, Sept 2016.
 [56] X. Yuan, D. J. Brady, and A. K. Katsaggelos. Snapshot compressive imaging: Theory, algorithms, and applications. IEEE Signal Processing Magazine, 38(2):65–88, 2021.
 [57] X. Yuan, H. Jiang, G. Huang, and P. Wilford. Compressive sensing via lowrank Gaussian mixture models. arXiv:1508.06901, 2015.
 [58] Xin Yuan, Yang Liu, Jinli Suo, and Qionghai Dai. Plugandplay algorithms for largescale snapshot compressive imaging. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
 [59] Xin Yuan, Patrick Llull, Xuejun Liao, Jianbo Yang, David J. Brady, Guillermo Sapiro, and Lawrence Carin. Lowcost compressive sensing for color video and depth. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3318–3325, 2014.
 [60] Xin Yuan, Jinli Suo Yang Liu, Frédo Durand, and Qionghai Dai. Plugandplay algorithms for video snapshot compressive imaging. arXiv: 2101.04822, Jan 2021.
 [61] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Ffdnet: Toward a fast and flexible solution for cnnbased image denoising. IEEE Transactions on Image Processing, 27(9):4608–4622, 2018.
 [62] Li Zhang, Dan Xu, Anurag Arnab, and Philip H. S. Torr. Dynamic graph message passing networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, (CVPR), pages 3723–3732, 2020.
 [63] Y Zhang, K Li, K Li, B Zhong, and Y Fu. Residual nonlocal attention networks for image restoration. In International Conference on Learning Representations, 2019.