DeepAI
Log In Sign Up

Motion-aware Dynamic Graph Neural Network for Video Compressive Sensing

03/01/2022
by   Ruiying Lu, et al.
0

Video snapshot compressive imaging (SCI) utilizes a 2D detector to capture sequential video frames and compresses them into a single measurement. Various reconstruction methods have been developed to recover the high-speed video frames from the snapshot measurement. However, most existing reconstruction methods are incapable of capturing long-range spatial and temporal dependencies, which are critical for video processing. In this paper, we propose a flexible and robust approach based on graph neural network (GNN) to efficiently model non-local interactions between pixels in space as well as time regardless of the distance. Specifically, we develop a motion-aware dynamic GNN for better video representation, i.e., represent each pixel as the aggregation of relative nodes under the guidance of frame-by-frame motions, which consists of motion-aware dynamic sampling, cross-scale node sampling and graph aggregation. Extensive results on both simulation and real data demonstrate both the effectiveness and efficiency of the proposed approach, and the visualization clearly illustrates the intrinsic dynamic sampling operations of our proposed model for boosting the video SCI reconstruction results. The code and models will be released to the public.

READ FULL TEXT VIEW PDF

page 9

page 11

09/04/2022

Spatial-Temporal Transformer for Video Snapshot Compressive Imaging

Video snapshot compressive imaging (SCI) captures multiple sequential vi...
08/15/2022

STAR-GNN: Spatial-Temporal Video Representation for Content-based Retrieval

We propose a video feature representation learning framework called STAR...
03/15/2022

OcclusionFusion: Occlusion-aware Motion Estimation for Real-time Dynamic 3D Reconstruction

RGBD-based real-time dynamic 3D reconstruction suffers from inaccurate i...
05/18/2021

Reinforcement Learning for Adaptive Video Compressive Sensing

We apply reinforcement learning to video compressive sensing to adapt th...
04/01/2021

Distributed Video Adaptive Block Compressive Sensing

Video block compressive sensing has been studied for use in resource con...
01/18/2022

Deep Equilibrium Models for Video Snapshot Compressive Imaging

The ability of snapshot compressive imaging (SCI) systems to efficiently...
02/09/2017

L1-regularized Reconstruction Error as Alpha Matte

Sampling-based alpha matting methods have traditionally followed the com...

1 Introduction

Inspired by compressive sensing (CS) [13, 4, 12], various computational imaging systems [1, 56] have been developed to enhance the imaging performance. In particular, snapshot compressive imaging (SCI) is one important branch of computational imaging techniques with wide applications [18, 47, 39], which utilizes a 2-dimensional (2D) camera to capture the desired 3-dimensional (3D) data (videos or hyperspectral images) by imposing modulations and then compressing into a single frame (measurement). Different from conventional imaging systems employing directly sampling strategies, such SCI systems compress the high-dimensional signals along time (i.e., CACTI [28, 59]) or spectrum (i.e., CASSI [48]) to obtain the compressed measurements, leading to promising advantages in acquisition efficiency, storage consumption, and low bandwidth footprints.

Figure 1: (a) An illustration of the spatial-temporal non-local correlations between the central point (yellow point) and most related points (red points, learned by our model) in adjacent frames, and the motion directions around the central point. (b) The comparisons of reconstruction quality and inference time of SOTA methods by appending MadyGraph. The same color represents the same backbone; ’’/ ’’ denote the results with/ without MadyGraph.

Taking the compressed measurements as input, SCI reconstruction algorithms focus on efficiently and effectively recovering the desired video frames. The first mainstream methods are based on optimization models, which differ in various prior knowledge, eg, total variation (TV) used in GAP-TV [55] and TwIST [2], sparsity in GMM based methods [54, 57] , and non-local low-rank in DeSCI [27]

. The second mainstream methods are based on deep learning networks, which prefer learning the direct mapping from measurements to video frames, such as off-the-shelf U-net 

[38], RevSCI [5]

based on 3D convolutional neural network (CNN), BIRNAT based on recurrent neural network (RNN) 

[6]. Recently, some researchers combine deep networks with optimization methods to reconstruct the desired videos, such as deep unfolding [53, 24, 29, 31] and plug-and-play algorithms [58, 60]. Typically, for video reconstruction, capturing non-local interactions is of central importance, since similar pixel patterns frequently recur between distant pixels across space-time [40, 50]. However, despite existing approaches can reconstruct the videos decently, they are limited in efficiently capturing long-range dependencies across video frames, restricting further performance improvement.

As the most usually utilized deep learning networks, both the convolutional and recurrent operations build blocks that process one local neighborhood at a time, either in space or time. While the straightforward way of using deep stacks of CNN or RNN could naturally extend their receptive field, this strategy is inherently limited in computational inefficiency, optimization difficulties, and multihop dependency modeling, i.e., difficulty in delivering messages back or forth between distant positions. On the other hand, although the self-attention mechanisms have been applied to capture the global (non-local) spatial dependencies for SCI [6, 33, 32], they are inherently limited in the following perspectives: 1) severe computation overhead, 2) redundant information by capturing relations across all pairs of pixels, and 3) only performing along the spatial dimension and neglecting the temporal dimension. Consequently, there remains a gap in constructing an elaborately designed model to efficiently capture both the local and global spatial-temporal dependencies for video SCI reconstruction.

Bearing these concerns in mind, we propose a graph neural networks (GNN) based approach to directly compute interactions between any two positions regardless of their spatial-temporal distance for better video representation learning. Furthermore, a dynamically sampling scheme is designed to select the most relative neighbours for each node, to alleviate the computational issues while enjoying the advantages brought by graph-structured features. In addition to modeling on video inputs, we observe that optical flow as an off-the-shelf module is helpful to find non-local dependency [50], motivating us to leverage optical flow into our graph construction. As depicted in Fig. 1 (a), the motion information could facilitate the network to find the most relative neighbours. Typically, there are two solutions to further improve the current state-of-the-art (SOTA) results: a) developing a new end-to-end network accounting for non-local spatial and temporal dependencies, or b) building an efficient, flexible, and generic add-on module appended to existing networks to improve the results. Apparently, a) is computationally expensive, consuming large GPU memory and long training time. Thus, in this paper, instead of designing more complicated network architectures, we utilize b) to address the aforementioned shortcomings and efficiently model non-local spatial-temporal dependencies.

Consequently, in this paper, we propose a flexible and robust approach based on motion-aware dynamic graph (dubbed MadyGraph) to improve existing SOTA models for video SCI. As shown in Fig. 1 (b), the reconstruction quality can be significantly improved with our proposed MadyGraph. To elaborate a bit, we develop this form by the following concerns: i) capture dependencies regardless of the distance along both the spatial and temporal dimensions (achieved by graph construction); ii) flexibly aggregate the most relative contextual information without much redundancy (achieved by dynamic sampling); iii) leverage appropriate prior into our model such as motion information (achieved by motion-aware dynamic walks), sparsity and non-local self-similarity (achieved by cross-scale node sampling

) to better exhibit correlations between nodes. From the overall framework, MadyGraph takes the reconstructed videos as input, firstly constructs relationships among dynamically selected spatial-temporal related nodes, and then aggregates them together to enable effective feature extraction for enhancing video reconstruction. Specific contributions of this work are summarized as follows:

  • [leftmargin=*]

  • A motion-aware dynamic graph network (MadyGraph) is developed to efficiently model non-local spatial-temporal dependencies with the frame-by-frame motion information for better video representation. To the best of our knowledge, this is the first time that graph neural network is utilized for SCI reconstruction.

  • The proposed flexible and lightweight module is capable of “plugging” into any existing reconstruction approaches to improve the results, while alleviating the computational issues to a certain extent ( for pixels) compared with other non-local networks such as the self-attention module (), in accordance with the realistic demand of computational requirement of video SCI.

  • We showcase the effectiveness of the proposed dynamic graph in the application of video SCI, which achieves considerable improvements with respect to SOTA baselines on both simulation and real datasets.

2 Video Snapshot Compressive Imaging

2.1 Mathematical Model of Video SCI

Figure 2: Principle of video SCI system. The original video frames are modulated by dynamic masks and then compressed by the camera into a snapshot measurement, then decoded by reconstruction algorithms to recover the video.

In video SCI, the 3D video is modulated by dynamic masks and then compressed into a ‘measurement’ (a coded 2D frame) along time, which is decoded later using effective reconstruction methods to recover the original video. As depicted in Fig. 2, high-speed video frames are modulated by coding patterns (masks) , and then integrated over time on a camera. Mathematically, the compressed coded measurement frame can be expressed as

(1)

where denotes the Hadamard (element-wise) product and refers to the noise. It has been proved that high quality reconstruction is achievable when  [20]. In this paper, we focus on the inverse problem, i.e., aiming to improve the reconstruction quality of video frames , given , and existing reconstruction algorithms.

2.2 Related Work

Video Snapshot Compressive Imaging: Video SCI is a hardware-encoder-plus-software-decoder system [56]. For the encoder, the key part is the spatial light modulators for modulating original scenes, e.g.usually a physical mask [28, 59], or different patterns on the digital mirror device (DMD) [18, 38, 30, 36, 37, 41]. Regarding the decoder, some optimization model based methods introduce different priors to recover the desired videos by iteratively optimizing, e.g., GAP-TV [55], GMM [54, 57], and DeSCI [27], however suffering from the extremely long and impractical reconstruction time for real applications. In the recent years, a boost in SCI reconstruction efficiency was achieved by introducing deep learning based methods [5, 6, 52, 31, 24], which can recover the video not only within tens or hundreds of milliseconds but also with high quality. Most recently, a dense deep unfolding network (DDUN) [53] is designed with 3D CNN prior, which combines the merits of both optimization model based and deep learning based methods, achieving SOTA performance in video SCI. Nevertheless, these approaches usually neglect the long-range dependencies over the space-time volume during reconstruction, leading to a potential requirement for efficient non-local modeling.

Non-local image/video processing: For a natural image/video, similar pixel patterns frequently recur so that many non-local methods have shown pleasant performance in different tasks [3, 8]. The non-local operation computes the response at a position as a weighted sum of the features at all positions, across space, time, or space-time [50, 26, 63], but suffering from the exhausting computation. Graph neural networks (GNN) based methods, which propagate information along graph-structured input data, can alleviate the computational issues to a certain extent. The most related works of our proposed model are GraphSAGE [16] and DGMN [62], which utilize sampled graph nodes to capture position-based context. However, GraphSAGE [16] simply uniformly samples nodes with fixed positions along the spatial dimension, independent of the actual input. Absorbing the ideas from deformable convolution [9], DGMN [62]

adaptively samples graph nodes for message passing according to the input. Nevertheless, it only explores along the spatial dimension without extending to the temporal dimension. Furthermore, in DGMN, the offset of each node is estimated only based on the features of a set of sampled nodes without checking the whole feature map, resulting in insufficient information exploration. Different from it, we extend the non-local modeling to the space and time while keeping a lightweight computational cost, and dynamically sample related nodes not only according to the whole feature map but also taking the knowledge of motion information between frames into consideration.

Video Enhancement: Due to the success of deep learning, researchers model image or video enhancement tasks as regression problems, i.e.

, given a degraded image or video, a well-designed network outputs an enhanced one. Various deep neural networks are designed for super-resolution 

[11, 49, 45, 25], denoising [61, 43, 44], deblurring [49, 34, 7], and compression artifacts reduction [10, 15, 17, 14]. To a certain degree, the purpose of our proposed method has some similarities with the video enhancement methods, both aiming to improve the quality of existing degraded video. However, the enhancement methods are incompatible with video SCI task as discussed in our experimental section. Different from the traditional enhancement tasks, we leverage the hardware information of the SCI system into MadyGraph, where both the inputs and network architectures are elaborately designed for video SCI reconstruction.

3 The Proposed Model

Figure 3: Illustration of the whole framework. The coarse video from backbone model is firstly re-masked and mapped through 3D CNN, constructed as a graph with cross-scale motion-aware dynamic sampling, then aggregated and finally reconstructed as a fine-grained video.

As depicted in Fig. 3, given the compressed measurement captured by the SCI system and the corresponding masks , our goal is to output the fine-grained reconstructed video. Overall, our proposed model consists of three steps as shown in the Fig. 3: ) The input coarse video frames are obtained from backbone (any existing) models, and then re-masked and transmitted to the high-dimensional features through a 4-layer 3D CNN. ) The dynamic Graph is constructed by automatically selecting most related neighbours across space-time for each node with two mechanisms: motion-aware dynamic sampling and cross-scale node sampling. ) The dynamically constructed graph is then aggregated in feature domain and then mapped through a 4-layer 3D CNN into the desired reconstructed video domain.

3.1 Input Re-masking and Feature Extraction

At the beginning, the coarsely reconstructed videos are firstly obtained through the backbone models. Here, the backbone models refer to any existing video SCI models including both the optimization and deep learning based models. In order to provide a baseline, we also design a simple 3D CNN model for coarse video estimation, which is omitted here and detailed in the supplementary material (SM). In terms of the network input, one straightforward way is to directly use the reconstructed videos from the backbone models as inputs. However, after extensive experiments, we found it is difficult to obtain good results in this manner, since critical information might be smoothed out in the coarsely reconstructed videos. Considering that the knowledge of hardware system is critical for SCI decoding, we develop a re-masking operation to better fit for video SCI.

Based on the coarsely reconstructed video from the backbone model, we re-mask each frame to enrich information about video SCI systems, i.e.the measurement and other re-modulated frames . This step aims to reuse the information that might be neglected or smoothed in the backbone models. Recalling the mathematical model of video SCI in Eq. (1), we re-mask each frame by:

(2)

where denotes the re-modulated frame, acquired by dividing the summation of other masked frames from the measurement. Thus, in the re-masking operation of each frame, all the inputs of original system and the coarse video are incorporated to enrich its information, facilitating further feature extraction. After acquiring the re-masked frames , we construct a 4-layer 3D CNN to encode the data cube into high-dimensional features, as:

(3)

where

is a 4D tensor,

is the channel dimension, and

refers to the temporal dimension (the number of video frames). After the 3D CNN operated on re-masked video frames, the encoded feature of each frame is a 3D tensor, and the representation of each pixel at a single frame corresponds to a vector with the dimension of

.

3.2 Motion-aware Dynamic Node Sampling

Given the feature maps , hereafter we present the detailed steps to dynamically construct a motion-aware graph.

3.2.1 Graph Definition and Notation

Firstly, we present the feature maps in the graph domain by constructing a feature graph , where denotes a set of node features, and represents the connections between nodes. Specifically, we define each initial node feature of the graph as the latent feature vector of a single pixel at one time step in , i.e., . Formally, the graph representation learning can be described as a weighted summation:

(4)

where represents the node feature at the iteration, computed by weighted summation of at the previous iteration. The initial node features are coming from the initial input feature maps . The denotes the connection relationship between nodes and , and refers to the self-included neighbour of the node . Here, is a transmit operation imposed on .

Given , an important question is how to construct which correlates each node feature with others. For video processing, both the spatial and temporal relations should be taken into consideration regardless of the limitations of their distance. A direct implementation is constructing a fully-connected graph by computing non-local interactions between every two nodes across space and time, i.e., . However, a fully-connected graph often suffers from prohibitively expensive computation and redundant information, which makes the graph network difficult to be optimized especially dealing with limited training data. To address this challenge, we construct a dynamic sampling scheme to dynamically sample a small subset of the most relevant feature nodes as across space-time, through motion-aware dynamic sampling and cross-scale node sampling. In this way, the nodes relations in are non-zero only at the positions of selected relevant nodes, imposed with the sparsity prior to efficiently gather non-local dependencies without computation overhead and information redundancy.

3.2.2 Motion-aware Dynamic Sampling

In this part, we introduce a dynamic node sampling scheme to adaptively select neighbours of each node under the guidance of motion information. The dynamic sampling scheme is composed of three steps: 1) initial sampling, uniformly in space and continuously in time, 2) frame-by-frame motion extraction, and 3) motion-aware dynamic walks.

Initial sampling: At the spatial dimension, we uniformly sample neighbouring nodes from across space for each graph node , which is a commonly used strategy for graph node sampling [62, 23] based on Monte-Carlo estimation. At the temporal dimension, there exists two selections for sampling. Discretely sampling constructs graph snapshots taken at intervals in time, and continuously sampling is more general and may include more motions, however acquiring more computational cost. For video SCI reconstruction, we employ the continuously sampling scheme considering the importance of video continuity and the affordable extra computation for video SCI. Thus, the initial neighbours of each node contains elements. The initial sampling scheme can be seen as a position-fixed locally-connected mechanism which neglects the original feature distribution and may miss out the important context.

Motion extraction: In order to explicitly learn the dynamic motions across video frames, we use optical flow to represent the frame-by-frame motions in our work to help each node find its most related neighbouring nodes. Considering the efficiency and accuracy, a lightweight flownet [19] is employed as our optical flow extractor. Taking the frame and the frame of the coarsely reconstructed video as inputs, we calculate the motion by:

(5)

where denote the optical flow of adjacent frames output from a pre-trained extractor  [19], each of which consists of a vertical and a horizontal component. Furthermore, we define the motion of the last frame in a video as , to provide the motion information in the reverse direction. In the following, we will introduce how to utilize the motion information to guide dynamic sampling, which particularly learns the dynamic walks around initial sampling nodes for each graph node.

Motion-Aware Dynamic Walks: To take into account of all the node features and motions when dynamic sampling neighbours for nodes, we absorb the ideas from deformable CNN [9] and propose a motion-aware dynamic walk upon the initial sampling nodes. As illustrated in Fig. 3, the black arrows refer to the dynamic walks around initial sampling nodes, which are predicted in a data-driven fashion being aware of the overall feature maps and motion distributions. Under continuously sampling scheme across time, the dynamic walks independently perform along the spatial dimension in each frame, considering the positions of related nodes always change over time in a video. In other words, the dynamic walks are different and adaptive in various frames.

Considering that dynamic walks are conducted across 2D spatial domain in each frame, let denote the predicted walk according to one of the initially sampled neighbouring nodes in the video frame. The node walks are predicted by applying a convolutional layer over the input feature maps and motion representations, i.e.,

(6)

where is the horizontal and vertical walks (offset) of the neighbouring nodes of all the pixels in the frame, and refers to the concatenation operation along channels. Hereafter, we omit the superscript of iterations for brevity. Intuitively, it will be much easier for the neural network to predict the position of related nodes if the explicit motion of the current node is known. Thus we utilize the optical flow of adjacent frames to provide the motion information of both the former and latter frames as guidance to better learn the dynamic walks. Finally, the predicted dynamic walk is performed upon the initially sampled position by (), and calculate the features as:

(7)

where denotes the bi-linear sampler [9], which is imposed on , the original feature map of the

frame, to obtain the bi-linear interpolated features for dynamic sampling nodes. This is because the dynamic walks are typically fractional, always resulting in irregular sampling positions; please refer to 

[9] for more details about the bi-linear sampler. In short, by performing motion-aware dynamic walks upon the initially sampled neighbouring nodes, we believe that it could facilitate finding most related neighbours and learning better feature distributions for the graph nodes.

3.2.3 Cross-Scale Node Sampling

In order to capture longer-dependencies for each node, we utilize a cross-scale node sampling mechanism to increase the receptive field size, which still keeps the computational efficiency by sparse sampling. Different “dilation rates” (denoted as ) are used to sample neighbouring nodes of various distances whilst maintaining a small number of connected nodes. As shown in Fig. 3, the different setting of “dilation rate” is able to find related nodes with different distances across various scales, which is utilized in conjunction with the dynamic walks. A limitation of this parallel architecture is that the computation memory will grow in proportion to the increasing scales, which is however acceptable and deserved due to its performance gains.

3.3 Graph Aggregation and Video Reconstruction

So far, the connections between the cross-scale motion-aware dynamic sampled nodes have been well constructed as a graph. After this, an aggregation module is developed to aggregate the relative features together along both the spatial and temporal domains. This operation aims to update the feature maps with the same dimension as input by utilizing the learned dynamic graph-structured features to capture non-local dependencies for better video reconstruction. The aggregation can be recognized as information interaction in each iteration, while the graph usually takes iterations for feature updating. Towards this end, we define a generic aggregation operation for updating a node at the -th iteration step as:

(8)

where is the corresponding updated node feature with the same size as ; represents the dynamically selected neighbouring node set of , containing the dynamic sampled neighbours at the frame. The correlation between two nodes is and normalized by . Specifically, there could be different choices for the pairwise function for calculating the relative relationship between two nodes. Here, the embedded Gaussian version is utilized to compute node relationships as:

(9)

where the function is composed of CNNs for transforming the input node features to another representation space to calculate nodes similarities. Moreover, the weight is a learnable parameter to ‘re-weight’ the node relationships of the frame, as we assume the node features of some frames (might be the neighbouring frames) are more important for updating node than other frames. The node-conditioned relation calculation is similar to the self-attention mechanism in [6, 46], but only performed on the dynamically selected relative nodes. From this perspective, our aggregation can be recognized as an efficient variety of self-attention mechanism when adapting to the video processing.

Up to now, the dynamical graph nodes features can be adaptively updated by aggregating the relative information both spatially and temporally regardless of the distance for each node. The non-locally constructed graph facilitates extracting efficient and meaningful features, which are then fed into a 4-layer 3D CNN to reduce the channel and decode the final fine-grained reconstructed video as:

(10)

Optimization: At the training stage, we optimize our proposed model with mean square error (MSE) loss, i.e.,

(11)

where is the frame of our final reconstructed video, and is from the ground-truth. In order to efficiently obtain a well-trained MadyGraph, we jointly train MadyGraph with the provided backbone model (a simple 3D CNN) with loss:

(12)

After sufficient training, MadyGraph is ready to be “plugged” into any other backbone models to enhance their results.

4 Experiments

4.1 Datasets and Implement Details

Training and Test Datasets: Following [6], we choose DAVIS2017 [35] as the training set for all experiments, which has 90 different scenes of total 6208 frames with two resolutions: 480894 and 10801920. Same as [6], 26000 randomly cropped patch cubes (256 256 8) from the original scenes in DAVIS2017 are synthesized as training data. The widely used benchmark simulated test data including Kobe, Runner, Drop, Traffic [27], Aerial and Vehicle [58] with the size of 2562568 are used for evaluation. For real data, we choose three scenes from two real SCI systems for testing, Wheel [28] with the size of 256 25614, Domino and Water Balloon [38] with the size of 512 51210.

Algorithm Kobe Traffic Runner Drop Aerial Vehicle Average Time
PnP-FFDNet [58] 30.50, 0.926 24.18, 0.828 32.15, 0.933 40.70, 0.989 25.27, 0.829 25.42, 0.849 29.70, 0.892 3.0
E2E-CNN [38] 29.02, 0.861 23.45, 0.838 34.43, 0.958 36.77, 0.974 27.52, 0.882 26.40, 0.886 29.26, 0.900 0.023
BaseNet + MadyGraph 32.98, 0.953 29.04, 0.940 38.87, 0.978 42.60, 0.996 28.98, 0.911 28.02, 0.932 33.42 0.952 0.31
GAP-TV [55] 26.45, 0.845 20.89, 0.715 28.81, 0.909 34.74, 0.970 25.05, 0.828 24.82, 0.838 26.79, 0.858 4.20
GAP-TV + MadyGraph 28.97, 0.911 23.12, 0.823 32.86, 0.955 38.96, 0.987 26.98, 0.882 26.13, 0.887 29.50, 0.908 4.29
DeSCI [27] 33.25, 0.952 28.72, 0.925 38.76, 0.969 43.22, 0.993 25.33, 0.860 27.04, 0.909 32.72, 0.935 6180
DeSCI + MadyGraph 35.42, 0.967 30.41, 0.947 40.71, 0.977 45.06, 0.994 26.76, 0.899 28.00, 0.929 34.39, 0.952 6180
BIRNAT [6] 32.71, 0.950 29.33, 0.942 38.70, 0.976 42.28, 0.992 28.99, 0.927 27.84, 0.927 33.31, 0.951 0.16
BIRNAT + MadyGraph 33.48, 0.944 29.95, 0.946 39.48, 0.978 42.88, 0.992 29.20, 0.921 27.99, 0.931 33.83, 0.952 0.26
RevSCI [5] 33.72, 0.957 30.02, 0.949 39.40, 0.977 42.93, 0.992 29.35, 0.924 28.12, 0.937 33.92, 0.956 0.19
RevSCI + MadyGraph 34.34, 0.962 30.59, 0.956 40.34, 0.981 43.56, 0.993 29.56, 0.928 28.20, 0.940 34.43, 0.960 0.30
DDUN [53] 35.02, 0.968 31.78, 0.964 40.91, 0.982 44.49, 0.994 30.58, 0.940 29.36, 0.955 35.36, 0.967 1.35
DDUN + MadyGraph 36.63, 0.975 32.60, 0.969 42.17, 0.985 45.59, 0.995 30.91, 0.946 29.76, 0.957 36.28, 0.971 1.45
Table 1: The average results of PSNR in dB (left entry in each cell), SSIM (right entry in each cell) and running time per measurement/shot in seconds by different algorithms on the six grayscale benchmark data. Best results are in bold.
Figure 4: The reconstructed frames of six benchmark simulation data. For simplicity, the ‘+’ on the right side of the methods indicates that is enhanced by MagyGraph.
Algorithm Kobe Traffic Runner Drop Aerial Vehicle Average Time
RevSCI [5] 33.72, 0.957 30.02, 0.949 39.40, 0.977 42.93, 0.992 29.35, 0.924 28.12, 0.937 33.92, 0.956 0.19
RevSCI + EDVR [49] 31.71, 0.936 29.08, 0.944 36.64, 0.971 38.5, 0.986 29.06, 0.921 27.93, 0.934 32.15, 0.949 0.55
RevSCI + MIMOUnet [7] 32.83, 0.951 29.82, 0.947 38.24, 0.974 37.15, 0.986 29.23, 0.922 28.06, 0.936 32.55, 0.953 0.35
RevSCI + finetune MIMO-Unet [7] 33.75, 0.958 30.03, 0.949 39.40, 0.977 42.93, 0.992 29.41, 0.924 28.18, 0.937 33.95, 0.956 0.35
RevSCI + MadyGraph 34.34, 0.962 30.59, 0.956 40.34, 0.981 43.56, 0.993 29.56, 0.928 28.20, 0.940 34.43, 0.960 0.30
DDUN [53] 35.02, 0.968 31.78, 0.964 40.91, 0.982 44.49, 0.994 30.58, 0.94 29.36, 0.955 35.36, 0.967 1.35
DDUN + EDVR [49] 32.52, 0.955 29.98, 0.956 36.5, 0.973 38.32, 0.987 30.11, 0.936 29.07, 0.952 32.75, 0.96 1.71
DDUN + MIMO-Unet [7] 34.02, 0.964 31.34, 0.961 39.26, 0.979 36.43, 0.986 30.39, 0.938 29.28, 0.953 33.45, 0.964 1.51
DDUN + finetune MIMO-Unet [7] 36.41, 0.975 32.41, 0.969 41.78, 0.984 43.95, 0.994 30.79, 0.945 29.73, 0.957 35.84, 0.971 1.51
DDUN + MadyGraph 36.63, 0.975 32.60, 0.969 42.17, 0.985 45.59, 0.995 30.91, 0.946 29.76, 0.957 36.28, 0.971 1.45
Table 2: The average results of PSNR in dB (left entry in each cell), SSIM (right entry in each cell) and running time per measurement/shot in seconds compared to different video enhancement algorithms on the six grayscale benchmark data.

Counterparts and Evaluation Metrics:

The proposed MadyGraph is compared with several SOTA methods including model based methods GAP-TV [55], DeSCI [27], plug-and-play method PnP-FFDNet [58], and deep learning based methods E2E-CNN [38], BIRNAT [6], RevSCI [5], and SOTA method DDUN [53]

. For the simulation data, both peak-signal-to-noise ratio (PSNR) and structural similarity (SSIM) 

[51] are used as metrics to quantitatively evaluate the reconstruction quality.

Implement Details: We choose 3 dilation rates (1, 7 and 13) for the cross-scale node sampling in dynamic graph. For each latent node, we sample neighbouring nodes of each scale. The detailed architectures of MadyGraph and the baseline model are given in the supplementary materials (SM). The Adam optimizer [21] is employed for optimization with the initial learning rate of

. All experiments are implemented in PyTorch running on an RTX 8000 GPU.

4.2 Results on Simulated Data

The results of six widely used benchmark simulated data are given in Table 1 and Fig. 4. We impose the proposed MadyGraph on six backbones: a simple baseline network (called BaseNet composed of 3D CNNs) to verify the efficiency of the proposed model, two iterative optimization methods GAP-TV [55] and DeSCI [27], and three deep networks BIRNAT [6], RevSCI [5] and DDUN [53]. Note that, we only jointly train MadyGraph with the simple baseline model, and then directly append the well-trained MadyGraph to other backbones. It can be observed that the proposed MadyGraph improves SOTA methods, i.e., providing {2.71dB, 1.67dB, 0.52, 0.51, 0.92dB} higher PSNR than GAP-TV, DeSCI, BIRNAT, RevSCI, and DDUN, respectively, with a quite short extra running time (about 0.1s). This demonstrates both the effectiveness and efficiency of the proposed method for improving SCI reconstruction. Fig.4 plots selected reconstruction frames of different algorithms. Compared with the counterparts, cleaner and sharper reconstruction corners and finer details with less noise are provided by our proposed MadyGraph enhancement.

Furthermore, we compare our add-on MadyGraph with image or video enhancement methods to show the superiority of our model for video SCI. Recalling the purpose of MadyGraph is enhancing the previous reconstruction results of different SCI methods, it is to some extent similar to traditional enhancement tasks such as deblurring and denoising, i.e.given a destructive result, to restore a more satisfactory one. In particular, we conduct the comparisons by following two steps: ) directly plug the enhancement methods into coarse reconstructions of different existing SCI methods; ) fine-tune the enhancement networks with the coarse results as inputs and ground truth videos as outputs. The detailed results are given in Table 2, where we firstly utilize two SOTA deblur models EDVR [49] (for video deblurring) and MIMO-Unet [7] (for single image deblurring) to enhance the results from RevSCI [5] and DDUN [53]. Surprisingly but reasonably, it appears to a certain degree of degradation after deblurring the coarse results, which is mainly due to the fact that the deblurring network is highly overfitting on the degraded model [42, 22]. Once the degradation pattern is changed, the reconstruction quality will decrease. Furthermore, we fine-tune the pre-trained MIMO-Unet based on the two SCI reconstruction methods, which results in better performance than directly utilizing the pre-trained model but lower quality and more inference time consumption compared with our proposed MadyGraph.

Figure 5: Example of the dynamic nodes learned by MadyGraph on Traffic. The yellow point in the middle Frame #4 is the central point, and the red points denote the dynamically selected neighbours with high aggregation weights larger than 0.2.
Model PSNR SSIM Parameters MACs Time
() ()
BaseNet 32.20 0.932 2.53 6.56 0.21
+DW 33.12 0.945 3.21 7.17 0.25
+DW+CS 33.28 0.949 3.38 7.29 0.26
+DW+CS+MA 33.42 0.952 8.75 - 0.31
Table 3: Computational complexity and average reconstruction quality on six benchmark simulation data for ablation study of MadyGraph. MAC means Multiply Accumulate.

To showcase how the proposed model finds related clues to update the representation for reconstructing videos, we visualize the intrinsic dynamic relationships in Fig. 5. Intuitively, the proposed model computes the response at the central position (e.g., the yellow point) according to the dynamically selected nodes (e.g., the red points). As we can see, the central point is at the lamp of a moving car, which tends to show more in the picture of the latter frames. We observed that: 1) most of the dynamically selected relative nodes are located on the car lamp or the surrounding relative positions; 2) the selected neighbours tend to diffuse around the lamp as time goes by, meanwhile, concentrate at the right of the picture, in accordance with the moving object; and 3) the aggregation weights of neighbouring nodes in frame #1 are lower than 0.2, since the lamp does not appear in the image. This demonstrates that the proposed model is able to dynamically capture meaningful and reasonable correlations regardless of the distance across space and time in the video.

4.3 Ablation Study

To quantitatively verify the contributions of each module in the MadyGraph, we add the partial components step by step with results shown in Table 3 on the six benchmark simulated data. Compared with the baseline, the dynamic walk (DW) upon initial sampling significantly increases the reconstruction quality (0.92dB) with minor parameters overhead by dynamically modeling non-local dependencies. Moreover, the cross-scale (CS) node sampling scheme also contributes to performance improvement (0.16dB) by enlarging the receptive field. Furthermore, the motion-aware (MA) mechanism gives a performance improvement (0.14dB) by leveraging motion knowledge for better video representation. The final version achieves the best result, which indicates our MadyGraph is an effective combination of each component.

4.4 Results on Real Data

In order to verify the robustness of our proposed algorithm, we conduct experiments on real data captured by SCI cameras [28, 59], which is more challenging due to the noise in reality. With the Wheel snapshot measurement of size 256256, we can recover 14 high-speed frames. As shown in Fig. 6, the generated letter ‘D’ of our proposed model provides clearer and smoother details with fewer artifacts. For testing on the larger scale, the snapshot measurements Domino and Water Balloon are recovered as two videos of size 51251210. It can be observed that the proposed model generates sharper edges with less noise and provides more accurate contours, owing to the efficient modeling for non-local spatial-temporal dependencies. Meanwhile, the proposed method significantly saves testing time compared with DeSCI (several hours). In general, the experiments demonstrate both the applicability and efficiency of our algorithm on real data.

Figure 6: The reconstructed frames of real data.

5 Conclusions

We present a motion-aware dynamic graph to capture non-local meaningful related dependencies regardless of the distance in space and time, under the guidance of the motion information across video frames. The proposed MadyGraph is an efficient, generic, and “plug-and-play" module, which can be generalized to any backbone network to boost the reconstruction performance of video compressive sensing. The experimental results on both simulation and real SCI data demonstrate the effectiveness and efficiency of the proposed MadyGraph. In addition to the study focusing on the video SCI system in this paper, MadyGraph can potentially extend to a wider range of application scenarios in the future, wherever non-local modeling is critical.

References

  • [1] Yoann Altmann, Stephen McLaughlin, Miles J Padgett, Vivek K Goyal, Alfred O Hero, and Daniele Faccio. Quantum-inspired computational imaging. Science, 361(6403), 2018.
  • [2] J. M. Bioucas-Dias and Mat Figueiredo. A new twist: Two-step iterative shrinkage/thresholding algorithms for image restoration. IEEE Transactions on Image Processing, 16(12):2992–3004, 2008.
  • [3] Antoni Buades, Bartomeu Coll, and J-M Morel. A non-local algorithm for image denoising. In Computer Vision and Pattern Recognition (CVPR), 2005.
  • [4] E. J. Candès and T. Tao. Near-optimal signal recovery from random projections: universal encoding strategies? IEEE Transactions on Information Theory, 2006.
  • [5] Z. Cheng, B. Chen, G. Liu, H. Zhang, R. Lu, Z. Wang, and X. Yuan. Memory-efficient network for large-scale video compressive sensing. In Computer Vision and Pattern Recognition (CVPR), 2021.
  • [6] Ziheng Cheng, Ruiying Lu, Zhengjue Wang, Hao Zhang, Bo Chen, Ziyi Meng, and Xin Yuan.

    BIRNAT: Bidirectional recurrent neural networks with adversarial training for video snapshot compressive imaging.

    In European Conference on Computer Vision (ECCV), 2020.
  • [7] Sung-Jin Cho, Seo-Won Ji, Jun-Pyo Hong, Seung-Won Jung, and Sung-Jea Ko. Rethinking coarse-to-fine approach in single image deblurring. In International Conference on Computer Vision (ICCV), pages 4641–4650, 2021.
  • [8] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3d transform-domain collaborative filtering. IEEE Transactions on Image Processing, 16(8):2080–2095, August 2007.
  • [9] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.
  • [10] Chao Dong, Yubin Deng, Chen Change Loy, and Xiaoou Tang. Compression artifacts reduction by a deep convolutional network. In Proceedings of the IEEE International Conference on Computer Vision, pages 576–584, 2015.
  • [11] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2015.
  • [12] David L Donoho et al. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.
  • [13] Candes Emmanuel, Justin Romberg, and Terence Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 2006.
  • [14] Xueyang Fu, Xi Wang, Aiping Liu, Junwei Han, and Zheng-Jun Zha. Learning dual priors for jpeg compression artifacts removal. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4086–4095, 2021.
  • [15] Jun Guo and Hongyang Chao. One-to-many network for visually pleasing compression artifacts reduction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3038–3047, 2017.
  • [16] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 1024–1034, 2017.
  • [17] Xin He, Qiong Liu, and You Yang. Mv-gnn: Multi-view graph neural network for compression artifacts reduction. IEEE Transactions on Image Processing, 29:6829–6840, 2020.
  • [18] Yasunobu Hitomi, Jinwei Gu, Mohit Gupta, Tomoo Mitsunaga, and Shree K Nayar. Video from a single coded exposure photograph using a learned over-complete dictionary. In 2011 International Conference on Computer Vision, pages 287–294. IEEE, 2011.
  • [19] Tak-Wai Hui, Xiaoou Tang, and Chen Change Loy. Liteflownet: A lightweight convolutional neural network for optical flow estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8981–8989, 2018.
  • [20] S. Jalali and X. Yuan. Snapshot compressed sensing: Performance bounds and algorithms. IEEE Transactions on Information Theory, 65(12):8005–8024, Dec 2019.
  • [21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [22] Orest Kupyn, Tetiana Martyniuk, Junru Wu, and Zhangyang Wang. Deblurgan-v2: Deblurring (orders-of-magnitude) faster and better. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  • [23] Jure Leskovec and Christos Faloutsos. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 631–636, 2006.
  • [24] Yuqi Li, Miao Qi, Rahul Gulve, Mian Wei, Roman Genov, Kiriakos N Kutulakos, and Wolfgang Heidrich. End-to-end video compressive sensing using anderson-accelerated unrolled networks. In 2020 IEEE International Conference on Computational Photography (ICCP), pages 1–12. IEEE, 2020.
  • [25] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1833–1844, 2021.
  • [26] Ding Liu, Bihan Wen, Yuchen Fan, Chen Change Loy, and Thomas S Huang. Non-local recurrent network for image restoration. In Conference on Neural Information Processing Systems (NeurIPS), 2018.
  • [27] Yang Liu, Xin Yuan, Jinli Suo, David Brady, and Qionghai Dai. Rank minimization for snapshot compressive imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(12):2990–3006, Dec 2019.
  • [28] Patrick Llull, Xuejun Liao, Xin Yuan, Jianbo Yang, David Kittle, Lawrence Carin, Guillermo Sapiro, and David J Brady. Coded aperture compressive temporal imaging. Optics Express, 21(9):10526–10545, 2013.
  • [29] Jiawei Ma, Xiaoyang Liu, Zheng Shou, and Xin Yuan. Deep tensor ADMM-Net for snapshot compressive imaging. In IEEE/CVF Conference on Computer Vision (ICCV), 2019.
  • [30] Xiao Ma, Xin Yuan, Chen Fu, and Gonzalo R. Arce. Led-based compressive spectral temporal imaging system. Optics Express, 2021.
  • [31] Ziyi Meng, Shirin Jalali, and Xin Yuan. Gap-net for snapshot compressive imaging. arXiv: 2012.08364, December 2020.
  • [32] Ziyi Meng, Jiawei Ma, and Xin Yuan. End-to-end low cost compressive spectral imaging with spatial-spectral self-attention. In European Conference on Computer Vision (ECCV), August 2020.
  • [33] Xin Miao, Xin Yuan, Yunchen Pu, and Vassilis Athitsos. -net: Reconstruct hyperspectral images from a snapshot measurement. In IEEE/CVF Conference on Computer Vision (ICCV), 2019.
  • [34] Seungjun Nah, Sanghyun Son, and Kyoung Mu Lee. Recurrent neural networks with intra-frame iterations for video deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8102–8111, 2019.
  • [35] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbelaez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 DAVIS challenge on video object segmentation. CoRR, abs/1704.00675, 2017.
  • [36] Mu Qiao, Xuan Liu, and Xin Yuan. Snapshot spatial–temporal compressive imaging. Opt. Lett., 45(7):1659–1662, Apr 2020.
  • [37] Mu Qiao, Xuan Liu, and Xin Yuan. Snapshot temporal compressive microscopy using an iterative algorithm with untrained neural networks. Opt. Lett., 2021.
  • [38] Mu Qiao, Ziyi Meng, Jiawei Ma, and Xin Yuan. Deep learning for video compressive sensing. APL Photonics, 5(3):030801, 2020.
  • [39] Dikpal Reddy, Ashok Veeraraghavan, and Rama Chellappa. P2c2: Programmable pixel compressive camera for high speed imaging. In CVPR 2011, pages 329–336. IEEE, 2011.
  • [40] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Neural Information Processing Systems (NIPS), 2014.
  • [41] Yangyang Sun, Xin Yuan, and Shuo Pang. Compressive high-speed stereo imaging. Opt Express, 25(15):18182–18190, 2017.
  • [42] Xin Tao, Hongyun Gao, Xiaoyong Shen, Jue Wang, and Jiaya Jia. Scale-recurrent network for deep image deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [43] Matias Tassano, Julie Delon, and Thomas Veit. Dvdnet: A fast network for deep video denoising. In 2019 IEEE International Conference on Image Processing (ICIP), pages 1805–1809, 2019.
  • [44] Matias Tassano, Julie Delon, and Thomas Veit. Fastdvdnet: Towards real-time deep video denoising without flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1354–1363, 2020.
  • [45] Yapeng Tian, Yulun Zhang, Yun Fu, and Chenliang Xu. Tdan: Temporally-deformable alignment network for video super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3360–3369, 2020.
  • [46] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Conference on Neural Information Processing Systems (NeurIPS), pages 5998–6008., 2017.
  • [47] Ashwin Wagadarikar, Renu John, Rebecca Willett, and David Brady. Single disperser design for coded aperture snapshot spectral imaging. Applied Optics, 47(10):B44–B51, 2008.
  • [48] Ashwin A Wagadarikar, Nikos P Pitsianis, Xiaobai Sun, and David J Brady. Video rate spectral imaging using a coded aperture snapshot spectral imager. Optics Express, 17(8):6368–6388, 2009.
  • [49] Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, and Chen Change Loy. Edvr: Video restoration with enhanced deformable convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
  • [50] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
  • [51] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
  • [52] Zhengjue Wang, Hao Zhang, Ziheng Cheng, Bo Chen, and Xin Yuan. Metasci: Scalable and adaptive reconstruction for video compressive sensing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2083–2092, 2021.
  • [53] Zhuoyuan Wu, Jian Zhang, and Chong Mou. Dense deep unfolding network with 3d-cnn prior for snapshot compressive imaging. In IEEE International Conference on Computer Vision (ICCV), 2021.
  • [54] Jianbo Yang, Xin Yuan, Xuejun Liao, Patrick Llull, David J Brady, Guillermo Sapiro, and Lawrence Carin.

    Video compressive sensing using Gaussian mixture models.

    IEEE Transaction on Image Processing, 23(11):4863–4878, November 2014.
  • [55] Xin Yuan. Generalized alternating projection based total variation minimization for compressive sensing. In 2016 IEEE International Conference on Image Processing (ICIP), pages 2539–2543, Sept 2016.
  • [56] X. Yuan, D. J. Brady, and A. K. Katsaggelos. Snapshot compressive imaging: Theory, algorithms, and applications. IEEE Signal Processing Magazine, 38(2):65–88, 2021.
  • [57] X. Yuan, H. Jiang, G. Huang, and P. Wilford. Compressive sensing via low-rank Gaussian mixture models. arXiv:1508.06901, 2015.
  • [58] Xin Yuan, Yang Liu, Jinli Suo, and Qionghai Dai. Plug-and-play algorithms for large-scale snapshot compressive imaging. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [59] Xin Yuan, Patrick Llull, Xuejun Liao, Jianbo Yang, David J. Brady, Guillermo Sapiro, and Lawrence Carin. Low-cost compressive sensing for color video and depth. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3318–3325, 2014.
  • [60] Xin Yuan, Jinli Suo Yang Liu, Frédo Durand, and Qionghai Dai. Plug-and-play algorithms for video snapshot compressive imaging. arXiv: 2101.04822, Jan 2021.
  • [61] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Ffdnet: Toward a fast and flexible solution for cnn-based image denoising. IEEE Transactions on Image Processing, 27(9):4608–4622, 2018.
  • [62] Li Zhang, Dan Xu, Anurag Arnab, and Philip H. S. Torr. Dynamic graph message passing networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, (CVPR), pages 3723–3732, 2020.
  • [63] Y Zhang, K Li, K Li, B Zhong, and Y Fu. Residual non-local attention networks for image restoration. In International Conference on Learning Representations, 2019.