STFormer
None
view repo
Video snapshot compressive imaging (SCI) captures multiple sequential video frames by a single measurement using the idea of computational imaging. The underlying principle is to modulate high-speed frames through different masks and these modulated frames are summed to a single measurement captured by a low-speed 2D sensor (dubbed optical encoder); following this, algorithms are employed to reconstruct the desired high-speed frames (dubbed software decoder) if needed. In this paper, we consider the reconstruction algorithm in video SCI, i.e., recovering a series of video frames from a compressed measurement. Specifically, we propose a Spatial-Temporal transFormer (STFormer) to exploit the correlation in both spatial and temporal domains. STFormer network is composed of a token generation block, a video reconstruction block, and these two blocks are connected by a series of STFormer blocks. Each STFormer block consists of a spatial self-attention branch, a temporal self-attention branch and the outputs of these two branches are integrated by a fusion network. Extensive results on both simulated and real data demonstrate the state-of-the-art performance of STFormer. The code and models are publicly available at https://github.com/ucaswangls/STFormer.git
READ FULL TEXT VIEW PDFNone
With recent advances in artificial intelligence, high quality, high-dimensional data have become one of the bottlenecks for large-scale deep learning models. In other words, capturing more data with multiple dimensions will lead to a dramatic increase in storage and transmission costs. Unlike ordinary cameras which capture RGB images, computational imaging
[1, 2] provides a new way to capture high-dimensional data in a memory-efficient manner. It is promising to support the explosion of artificial intelligence using data captured by computational imaging systems. In this paper, we focus on snapshot compressive imaging (SCI) [3], especially video SCI systems, which are capable of capturing high-speed videos using a low-speed camera [4], enjoying the benefits of low memory requirement, low bandwidth for transmission, low cost and low power.Traditional high-speed camera method for capturing high-speed scenes often faces the disadvantages of high hardware cost, high storage requirement and high transmission bandwidth. Bearing these challenges, video SCI system provides an elegant solution. As shown in Fig. 1, video SCI is an integrated hardware plus software system. For the hardware part (encoder), each frame of the original video is encoded by different masks, then a series of encoded frames is integrated (summed) by the (grayscale or color) camera to generate a compressed measurement. In this manner, video SCI can achieve efficient compression during optical domain imaging and improve the efficiency of video storage and transmission. At present, a variety of video SCI systems [4, 6, 7, 8, 9] have been built. The mask is usually generated by a digital micromirror device (DMD) or a spatial light modulator (SLM), and the encoded measurement is usually captured by a charge-coupled device (CCD) or a complementary metal-oxide semiconductor (CMOS) camera. In addition, for color video SCI [5], there is usually a Bayer-filter before the sensor array to capture different color components such as red, green or blue at different pixels, and ultimately the camera outputs the modulated Bayer measurement. For the software part (decoder), measurement and masks are usually fed into the reconstruction algorithm to recover the desired high-speed video.
In the decoding stage, the video SCI reconstruction algorithm aims to solve an ill-posed inverse problem. Traditional model-based reconstruction algorithms combines the idea of iterative optimization, such as generalized alternating projection (GAP) [10] and alternating direction method of multipliers (ADMM) [11], with a variety of prior knowledge, such as total variation (TV) [12], non-local low rank [13]
[14]. Although these methods do not require any data to train the model, they usually suffer from poor reconstruction quality and long reconstruction time. In recent years, due to the development of deep learning and its strong generalization ability, researchers have constructed many learning-based models. For example, BIRNAT [15]uses convolutional neural networks (CNNs) and bidirectional recurrent neural networks (RNNs) for frame-by-frame reconstruction, while U-net
[8] uses a simple U-shaped structure for fast reconstruction. RevSCI [16] can save training memory by using reversible mechanism [17]. Plug-and-play (PnP) methods, such as PnP-FFDNet [18] and PnP-FastDVDnet [19], make the model more flexible by integrating the deep denoising model into an iterative optimization algorithm. GAP-net [20], DUN-3DUnet [21] and Two-stage [22] further improve reconstruction quality and speed by using the ideas of deep unfolding [23]. Although previous methods have made great strides in video SCI reconstruction, at present, video SCI reconstruction still suffers from several key challenges:
Since the video SCI sampling process compresses the entire temporal domain, it is important to establish long-term temporal correlation in the temporal domain during reconstruction. However, existing networks [24] use convolution to explore temporal correlation, and its receptive field are too small to be suitable for long-term temporal correlation extraction, resulting in poor reconstruction quality in complex high-speed scenes.
Most existing end-to-end deep learning methods can not adapt to masks and input sizes changes. More specifically, the model usually needs to be retrained or fine-tuned when masks and input sizes change [25].
PnP methods usually suffer from excessive smoothing and loss of local details.
Bearing these concerns in mind, we leverage the powerful model capacity of Transformer [26] and its ability to explore long-term dependencies to build an efficient video SCI reconstruction network.
In this paper, we propose a simple yet effective network for video SCI reconstruction. Our main contributions are as follows:
We build an end-to-end deep video SCI reconstruction network based on spatial-temporal Transformer, which can be well applied to multiple video SCI reconstruction tasks.
Through space-time factorization and local self-attention mechanism, we propose an efficient and flexible Transformer module, dubbed STFormer, with linear complexity, which can explore spatial-temporal correlations efficiently.
We propose a Grouping Resnet Feed Forward module to further improve reconstruction quality by fusing multi-layer information and strengthening the correlation between different layers.
We use multiple 3D convolutions to construct a token generation block to prevent loss of local details.
Experimental results on a large number of simulated and real datasets demonstrate that our proposed algorithm achieves state-of-the-art (SOTA) results with better real-time performance compared to previous methods.
The rest of this paper is organized as follows: Sec. 2 introduces the mathematical model of grayscale and color video SCI and the related work of vision Transformers. In Sec. 3, we propose a reconstruction model STFormer for different video SCI reconstruction tasks, including grayscale video, color video and large-scale video. Sec. 4 shows experimental results on multiple simulated and real datasets. Sec. 5 concludes the entire paper.
Since video SCI is an integrated system of hardware and software, in the following, we first introduce the mathematical model of video SCI , and then briefly review the related work of vision Transformer, which is the main part of the reconstruction network.
Video SCI system encodes high-dimensional video data into a 2D measurement, and coded aperture compressive temporal imaging (CACTI) [4] is one of the earliest video SCI systems. As shown in Fig. 1, the three-dimensional video data is first modulated by multiple masks. Then, the encoded high-speed scene is captured by a two-dimensional camera sensor through integration across the time dimension.
For grayscale video SCI system, let denote the -frame (grayscale) video data to be captured, denote pre-defined masks. For each frame within the video , it is modulated by a mask in the image plane conducted by a 4-f system, and we can express this modulation as
(1) |
where denotes the modulated video data, denotes the -th frame of the 3D video data , denotes the -th mask of the 3D mask , and denotes the element-wise multiplication.
After this, by compressing across the time domain, the camera sensor plane captures a 2D compressed measurement , which can be expressed as
(2) |
where denotes the measurement noise.
To keep the notations simple, we further give the vectorized formulation expression of Eq. (
2). Firstly, we define as a vectorization operation of ensued matrix. Then we vectorize(3) | |||||
(4) | |||||
(5) |
where . In addition, we define sensing matrix generated by masks in SCI system as
(6) |
where is a diagonal matrix and its diagonal elements is filled by , Finally, the vectorization expression of Eq. (2) is
(7) |
After obtaining the measurement (captured by the camera), the next task is to develop a decoding algorithm, i.e., given and , solve .
For color video SCI system, we use the Bayer pattern filter sensor, where each pixel captures only the red (R), green (G) or blue (B) channel of the raw data in a spatial layout such as ‘RGGB’. Since the colors of adjacent pixels are discontinuous, we divide the original measurement into four sub-measurements . Correspondingly, the original masks and original video frames can also be divided into and , respectively. The forward model of each channel can be expressed as
(8) | |||||
(9) | |||||
(10) | |||||
(11) |
For reconstruction, most previous algorithms [5, 18] reconstruct each sub-measurement separately, and then use off-the-shelf demosaic algorithms to obtain the desired RGB color videos. This channel-separated reconstruction algorithm cannot make good use of the correlation between channels and is inefficient. Therefore, we input all sub-measurements into our proposed reconstruction network simultaneously, and the reconstruction network directly outputs the desired RGB color video without employing previous demosaic algorithms.
Compared with convolutional networks [27, 28, 29], Vision Transformer (ViT) [30] and its variants [31, 32, 33, 34, 35] have achieved competitive results in the field of image classification [36]. However, the original ViT model training requires a large dataset (e.g., JFT-300M [37]) to achieve good results, DeiT [38]
introduced some training strategies to enable the ViT model to achieve better performance in the ImageNet-1K dataset
[36], but it still used the global self-attention mechanism, and the computational complexity increases quadratically with the image size. These greatly limit the application of ViT to dense prediction tasks such as object detection [39], semantic segmentation [40, 41, 42], and low-level tasks such as image denoising [43] and image restoration [44].Moreover, PVT V2 [45] proposed a linear spatial reduction attention (SRA) layer, which uses average pooling to reduce spatial dimension, and achieve improvements on fundamental vision tasks such as classification, detection, and segmentation. MST [46] proposed a spectral-wise multi-head self-attention (S-MSA) and customized a mask-guided mechanism (MM), which effectively explores the application of Transformer to the spectral SCI task [47]. Swin Transformer [48] limits the self-attention calculation to the local window through the division of non-overlapping local windows and shifted window mechanism, which reduces the computational complexity of the Transformer to a linear relationship with the image size. However, Swin Transformer can only explore spatial correlation. Video Swin Transformer [49] can obtain local-scale temporal correlation by extending the local window to the temporal domain, but still cannot achieve good results in long-term correlation scenarios. TimeSformer [50] investigates the spatial-temporal correlation between tokens through a space-time factorization mechanism, and has achieved extremely competitive results in the field of video recognition. However, due to the use of global self-attention mechanism in the spatial domain, the computational complexity grows quadratically with the image size, and its positional embedding size is fixed, lacking the flexibility to change with input size. In addition, most of the existing Transformers [51, 52, 53] divide images or videos into non-overlapping patches to generate tokens, resulting in loss of local details and cannot be well applied to video reconstruction tasks in this work.
Inspired by recent advances in Transformers, especially variants for video processing tasks, in this section, we introduce our proposed spatial-temporal Transformer in detail, which can efficiently explore long-range spatial-temporal correlations and lead to SOTA results on video SCI tasks.
The overall video reconstruction network architecture is depicted in Fig. 2. In the preprocessing stage shown in Fig. 2(a), we design the input initialization block with reference to GAP-net [20] and Two-stage [22]. By pre-processing masks () and measurement () with the input initialization block, we can get a coarse estimation of the modulated frames , where represents the input channel number. Then, is fed into the token generation (TG) block [16] to get a series of consecutive tokens. We show in the supplementary material the implementation details of the pre-processing for grayscale and color video SCI in the input initialization block.
As demonstrated in Fig. 3
(a), the TG block consists of five 3D convolutional layers, each followed by a LeakyReLU activation function
[54]. Through the stride step and feature mapping in the convolutional layer, the number of finally generated tokens is , and the feature dimension of each token is . Different from most token generation methods, our proposed approach does not divide into non-overlapping patches, but uses 3D convolution for feature mapping, and then treats each point of the feature map as a token. This is beneficial to reduce the phenomenon of local detail loss.To better explore the spatial-temporal correlation between each token, we design STFormer block (described in Section 3.2), and stack STFormer blocks. It is worth noting that we do not use any downsampling method in each STFormer block, and the input and output dimensions of each STFormer block are kept consistent, which is also beneficial to prevent the loss of local details.
After the original tokens been mapped by STFormer blocks, the spatial-temporal correlation between tokens has been well established. We input the tokens into the video reconstruction (VR) block [16] as shown in Fig. 3(b) to obtain the final desired video frames. In the VR block, the output channels number of the last layer of convolution varies according to the reconstruction task, which is 1 for grayscale video reconstruction, and 3 for color video reconstruction.
As shown in Fig. 2 (d), the STFormer block is composed of three parts: spatial self-attention (SSA) branch, temporal self-attenion (TSA) branch, and grouping resnet feed forward (GRFF) network. Among them, the SSA block can well explore the spatial correlation between tokens, and TSA block can well establish the temporal correlation between tokens, respectively. GRFF network can further investigate the correlation between adjacent tokens.
Most of previous Transformer architectures used a global self-attention mechanism to calculate the correlation between tokens, which will cause the computational complexity of the Transformer model to increase quadratically with the number of tokens. This limits the application of Transformer to video SCI reconstruction tasks.
Bearing this in mind, in our proposed spatial self-attention branch, we use a local self-attention method to calculate the spatial correlation. As shown in Fig. 2 (d), in SSA branch, we first divide the feature map into a series of non-overlapping local windows , where represents the number of tokens in each local window, represents the number of local windows, and represent the height, width and timing length of the local window, respectively. Since spatial self-attention only calculates the spatial correlation between tokens, we set to 1, and the default values of and are 7. Then, the self-attention computation is restricted to each non-overlapping local window.
For the self-attention calculation of the local window, we first linearly map to get , and :
(12) | |||||
(13) | |||||
(14) |
where and represent projection matrices and share parameters between different windows. Then, we respectively divide , , into heads along the feature channel :
(15) | |||||
(16) | |||||
(17) |
and the feature dimension of each head becomes . For each head , the attention can be calculated by the local self-attention mechanism as:
(18) |
where represents the transpose of matrix and represents the learnable relative position encoding. After this, we concatenate the outputs of heads along the channel dimension and conduct a linear mapping to obtain the final output of spatial local window multi-head self-attention mechanism ():
(19) |
where , represents projection matrices and represents window reverse returning to the original dimension. The whole process of the spatial self-attention branch can be expressed as:
(20) |
For the lack of connection between local windows, we refer to the shifted window partitioning approach of Swin Transformer [49] to establish the information interaction between local windows.
Previous reconstruction algorithms typically use 2D or 3D convolutions to explore temporal correlations. Due to the local connection of convolutions, its receptive field is limited, which makes it incapable of investigating long-term correlation. Therefore, we exploit the long-term perception ability of Transformer to build a temporal self-attention branch. Different from the SSA branch, the TSA branch only performs self-attention on tokens in the same spatial position. In other words, the TSA branch does not calculate the spatial correlation between tokens.
As shown in Fig. 2(d), in TSA branch, for temporal self-attention calculation, we reshape the input to . After this, similar to the local window self-attention calculation of the SSA branch, we first linearly map to get , and :
(21) |
where and represent projection matrices and share parameters between different temporal windows.
Different from the SSA branch, we decrease the channel dimensions of and to further reduce the computational complexity. Then, we respectively divide , , into heads along the feature channel: , , , and the feature dimension of each head becomes . For each head , the attention can be calculated by the temporal window self-attention mechanism as:
(22) |
where represents the learnable relative position encoding. Then, we concatenate the outputs of heads along the channel dimension and perform a linear mapping to obtain the final output of temporal window multi-head self-attention mechanism ( ):
(23) |
where , represents projection matrices, and represents reshape operation. The whole process of the temporal self-attention branch can be expressed as:
(24) |
The FF network of a regular Transformer [26] uses two linear mapping layers to transform features; the first linear layer expands the channel dimension (usually by a factor of 4), and the second linear layer restores the channel dimension to the original one [55]. In the whole FF network process, the operation of each feature point are independent of each other, and there will be no interaction between feature points. In this work, we modify the original FF network with details shown in the GRFF network in Fig. 2(d). We divide the input feature map into two parts along the channel dimension; the first part of the feature is sent to a Resnet module, whose output is added to the second part of the feature, and the summed feature is used as input to another Resnet module. Following this, the outputs of the two Resnet modules are concatenated along the channel dimension to obtain the final output of the GRFF network. Given an input , the whole GRFF network process can be expressed as:
(25) | |||||
(26) | |||||
(27) | |||||
(28) |
where represents feature division along the channel, represents feature merging across the channel dimension and represents the output of the GRFF network. Overall, the GRFF network conducts more feature mapping by grouping and using the Resnet mechanism to achieve multi-layer information fusion, and the use of convolution enhances the information interaction between adjacent feature points. All of these are beneficial to improve the quality of video SCI reconstruction.
In addition, we analyze the computational complexity of the spatial-temporal multi-head self-attention mechanism (), which consists of and :
(35) | ||||
(36) | ||||
(37) | ||||
(38) |
where represents the computational complexity, and are generally set to 7, and is the number of frames, and compare it with the global multi-head self-attention mechanism () [46],
(39) |
We can observe that the computational complexity of our proposed grows linearly with the spatial size , which is more computationally efficient than (quadratic to ).
Dataset | Kobe | Traffic | Runner | Drop | Crash | Aerial | Average | Running time(s) |
GAP-TV[12] | 26.46, 0.845 | 20.89, 0.715 | 28.52, 0.909 | 34.63, 0.970 | 24.82, 0.838 | 25.05, 0.828 | 26.73, 0.858 | 4.2 (CPU) |
U-net[8] | 27.79, 0.807 | 24.62, 0.840 | 34.12, 0.947 | 36.56, 0.949 | 26.43, 0.882 | 27.18, 0.869 | 29.45, 0.882 | 0.03 (GPU) |
PnP-FFDNet[18] | 30.50, 0.926 | 24.18, 0.828 | 32.15, 0.933 | 40.70, 0.989 | 25.42, 0.849 | 25.27, 0.829 | 29.70, 0.892 | 3.0 (GPU) |
PnP-FastDVDnet[19] | 32.73, 0.947 | 27.95, 0.932 | 36.29, 0.962 | 41.82, 0.989 | 27.32, 0.925 | 27.98, 0.897 | 32.35, 0.942 | 6.0 (GPU) |
DeSCI[13] | 33.25, 0.952 | 28.71, 0.925 | 38.48, 0.969 | 43.10, 0.993 | 27.04, 0.909 | 25.33, 0.860 | 32.65, 0.935 | 6180 (CPU) |
BIRNAT[15] | 32.71, 0.950 | 29.33, 0.942 | 38.70, 0.976 | 42.28, 0.992 | 27.84, 0.927 | 28.99, 0.917 | 33.31, 0.951 | 0.16 (GPU) |
GAP-net-Unet-S12 [20] | 32.09, 0.944 | 28.19, 0.929 | 38.12, 0.975 | 42.02, 0.992 | 27.83, 0.931 | 28.88, 0.914 | 32.86, 0.947 | 0.03 (GPU) |
MetaSCI [25] | 30.12, 0.907 | 26.95, 0.888 | 37.02, 0.967 | 40.61, 0.985 | 27.33, 0.906 | 28.31, 0.904 | 31.72, 0.926 | 0.03 (GPU) |
RevSCI [16] | 33.72, 0.957 | 30.02, 0.949 | 39.40, 0.977 | 42.93, 0.992 | 28.12, 0.937 | 29.35, 0.924 | 33.92, 0.956 | 0.19 (GPU) |
DUN-3DUnet [21] | 35.00, 0.969 | 31.76, 0.966 | 40.03, 0.980 | 44.96, 0.995 | 29.33, 0.956 | 30.46, 0.943 | 35.26, 0.968 | 1.35 (GPU) |
STFormer-S | 33.19, 0.955 | 29.19, 0.941 | 39.00, 0.979 | 42.84, 0.992 | 29.26, 0.950 | 30.13, 0.934 | 33.94, 0.958 | 0.14 (GPU) |
STFormer-B | 35.53, 0.973 | 32.15, 0.967 | 42.64, 0.988 | 45.08, 0.995 | 31.06, 0.970 | 31.56, 0.953 | 36.34, 0.974 | 0.49 (GPU) |
STFormer-L | 36.02, 0.975 | 32.74, 0.971 | 43.40, 0.989 | 45.48, 0.995 | 31.04, 0.971 | 31.85, 0.956 | 36.75, 0.976 | 0.92 (GPU) |
In this section, we compare the performance of the proposed STFormer network with several SOTA video reconstruction methods on multiple simulation and real datasets. The peak-signal-to-noise-ratio (PSNR) and the structured similarity index metrics (SSIM)
[57] are used to evaluate the performance of different video SCI reconstruction methods on simulation datasets.We use DAVIS2017 [58] as the training dataset for the model, which contains 90 different scenes with two resolutions: and . For the grayscale simulation video testing datasets, we used six benchmark datasets including Kobe, Runner, Drop, Traffic , Aerial and Vehicle with a size of , following the setup in [18]. For the color simulation video testing datasets, we follow PnP-FastDVDnet [19], using six benchmark color simulation datasets, including Beauty, Bosphorus,Jockey, Runner, ShakeNDry and Traffic with a size of . For the large-scale simulation video testing datasets, we used 4 benchmark large-scale simulation datasets, including Messi, Hummingbird, Swinger, Football used in [19].
During training, we perform data augmentation on DAVIS2017 using random horizontal flipping, random scaling, and random cropping. Following the CACTI imaging process, a series of measurements are generated. We use measurement and masks as inputs to train the STFormer network and use Adam optimizer [59] to optimize the model. Since our proposed STFormer network is flexible in input size, to speed up model training, we first train the model on data with a spatial resolution of
(100 epochs with an initial learning rate set be 0.0001) and
then fine-tune it on data with a spatial resolution of(20 epochs with an initial learning rate set be 0.00001). All experiments are run on PyTorch framework with 8 NVIDIA RTX 3090 GPUs.
In this section we present the results from the simulation to real dataset, first in grayscale and then in color.
To trade-off speed and performance, we have trained three models with different size, dubbed as STFormer-L, STFormer-B and STFormer-S, standing for Large, Base and Small networks respectively. The hyper-parameters of these models are as follows:
STFormer-S: =64, block numbers = {2,2,2,2},
STFormer-B: =256, block numbers = {2,2,2,2},
STFormer-L: =256, block numbers = {4,4,4,4},
where represents the number of input channels of the STFormer block. The model parameters (Params) and theoretical computational complexity (FLOPs) are shown in Tab. II, where we can observe that the parameters and computation of our proposed STFormer-S network are less than BIRNAT and RevSCI, and the STFormer-B network is less than DUN-3DUnet.
Method | Params (M) | FLOPs (G) | PSNR | SSIM |
BIRNAT [15] | 4.13 | 390.56 | 33.31 | 0.951 |
RevSCI [16] | 5.66 | 766.95 | 33.92 | 0.956 |
DUN-3DUnet [21] | 61.91 | 3975.83 | 35.26 | 0.968 |
STFormer-S | 1.22 | 193.47 | 33.94 | 0.958 |
STFormer-B | 19.48 | 3060.75 | 36.34 | 0.974 |
STFormer-L | 36.81 | 5363.98 | 36.75 | 0.976 |
Currently, there are various methods for video SCI reconstruction, here we compare our method with some SOTA methods, , model-based iterative optimization methods (GAP-TV [12] and DeSCI [13]), end-to-end deep learning methods (U-net [8], MetaSCI [25], BIRNAT [15] and RevSCI [16]), plug-and-play methods (PnP-FFDNet [18], PnP-FastDVDnet [19]) and deep unfolding methods (GAP-net [20], DUN-3DUnet [21]). Tab. I presents the average PSNR and SSIM values for different reconstruction methods on 6 benchmark grayscale datasets and the average reconstruction time for a single measurement. Fig. 4 shows the visualization results of several SOTA reconstruction methods. We can observe that our proposed method achieves higher reconstruction quality and better real-time performance than previous SOTA methods by a large margin. From the visualization of reconstructed videos, our proposed method can recover more details and edge information. We summarize the observations in Tab. I and Fig. 4 as follows:
Our proposed method (STFormer-L) achieves an average PSNR value of 36.75 dB and an SSIM value of 0.976. Compared with the previous SOTA method DUN-3DUnet (best published results) and the end-to-end deep learning method RevSCI, our proposed method achieves 1.49 dB and 2.83 dB higher average PSNR, respectively.
For reconstruction running time, our proposed method achieves a good balance between reconstruction quality and running performance. The reconstruction quality of our proposed STFormer-B model is higher than 36 dB, and the running time is within 500ms. The running speed and reconstruction quality of the STFormer-S model are higher than most current reconstruction algorithms. Although U-net, MetaSCI, and GAP-net run faster, these algorithms have poor reconstruction quality, with an average PSNR value of less than 33 dB, which is more than 3 dB lower than STFormer-L. In some complex high-speed scenarios, such as Traffic, Crash, Aerial datasets, the reconstruction quality of these methods cannot even reach 29 dB.
Benefit from the powerful model capacity of Transformer and its ability to effectively explore long-term dependencies, STFormer has excellent performance in complex scenarios (such as Aerial data), and high-speed scenarios (such as Crash data). The reconstruction quality of these two datasets reaches 31 dB for the first time. From the visualization frames in Fig. 4, we can recover clear edges of tree trunks in Aerial data and the marks on the vehicles in the Crash data. Previous SOTA methods are unable to reconstruct these details, which often leads to excessive smoothing.
Dataset | TimeSformer [50] | Video Swin Transfomer [49] | STFormer-B |
Kobe | 31.12, 0.932 | 30.72, 0.924 | 35.53, 0.973 |
Traffic | 27.50, 0.917 | 27.26, 0.911 | 32.15, 0.967 |
Runner | 37.13, 0.974 | 37.25, 0.971 | 42.64, 0.988 |
Drop | 40.14, 0.988 | 39.82, 0.987 | 45.08, 0.995 |
Crash | 28.11, 0.931 | 28.46, 0.939 | 31.06, 0.970 |
Aerial | 28.96, 0.915 | 29.07, 0.915 | 31.56, 0.953 |
Average | 32.16, 0.943 | 32.09, 0.941 | 36.34, 0.974 |
To further explore the effectiveness of our proposed STFormer network for video SCI, we directly apply several SOTA video Transformer models, specifically TimeSformer [50] and Video Swin Transformer [49], to video SCI reconstruction tasks. For the original Video Swin Transformer, since the hierarchical structure makes the spatial resolution of the model output too low, we use U-net [60]
to upsample the deep features and fuse them with the shallow features. In this way, the model can predict the final reconstruction result. As shown in Tab.
III, the reconstruction quality of our proposed STFormer is significantly better than that of TimeSformer and Video Swin Transformer, which further verifies the effectiveness of the STFormer network for video SCI.Dataset | Beauty | Bosphorus | Jockey | Runner | ShakeNDry | Traffic | Average | Running time(s) |
GAP-TV[12] | 33.08, 0.964 | 29.70, 0.914 | 29.48, 0.887 | 29.10, 0.878 | 29.59, 0.893 | 19.84, 0.645 | 28.47, 0.864 | 10.80 (CPU) |
DeSCI[13] | 34.66, 0.971 | 32.88, 0.952 | 34.14, 0.938 | 36.16, 0.949 | 30.94, 0.905 | 24.62, 0.839 | 32.23, 0.926 | 92640 (CPU) |
PnP-FFDNet-gray[18] | 33.21, 0.963 | 28.43, 0.905 | 32.30, 0.918 | 30.83, 0.888 | 27.87, 0.861 | 21.03, 0.711 | 28.93, 0.874 | 13.20 (GPU) |
PnP-FFDNet-color[18] | 34.15, 0.967 | 33.06, 0.957 | 34.80, 0.943 | 35.32, 0.940 | 32.37, 0.940 | 24.55, 0.837 | 32.38, 0.931 | 97.80 (GPU) |
PnP-FastDVDnet-gray[19] | 33.01, 0.963 | 30.95, 0.934 | 33.51, 0.928 | 32.82, 0.900 | 29.92, 0.892 | 22.81, 0.776 | 30.50, 0.899 | 19.80 (GPU) |
PnP-FastDVDnet-color [19] | 35.27,0.972 | 37.24, 0.978 | 35.63,0.950 | 38.22, 0.965 | 33.71, 0.969 | 27.49, 0.915 | 34.60, 0.955 | 99.05 (GPU) |
BIRNAT-color [61] | 36.08, 0.975 | 38.30, 0.982 | 36.51, 0.956 | 39.65, 0.973 | 34.26, 0.951 | 28.03, 0.915 | 35.47, 0.959 | 0.98 (GPU) |
STFormer-S | 36.83, 0.980 | 38.36, 0.981 | 37.09, 0.963 | 40.56, 0.980 | 34.67, 0.952 | 29.00, 0.923 | 36.09, 0.963 | 0.54 (GPU) |
STFormer-B | 37.37, 0.981 | 40.39, 0.988 | 38.32, 0.968 | 42.45, 0.985 | 35.15, 0.956 | 30.24, 0.939 | 37.32, 0.970 | 1.95 (GPU) |
To verify the effectiveness of our method on various video SCI reconstruction tasks, we extend the STFormer network to the color SCI reconstruction task. We conduct related experiments on six benchmark color RGB datasets [19] with a spatial size of , where represents the RGB channels. Similar to grayscale video, we compress the video with a compression rate of . As shown in Fig. 1, we capture compressed Bayer measurements using a camera with a Bayer filter. For each dataset with 32 color video frames, we can get 4 Bayer measurements.
Dataset | Messi | Hummingbird | Swinger | Football | Average | Running time(s) |
GAP-TV[12] | 25.20, 0.874 | 29.64, 0.897 | 24.64, 0.847 | 28.88, 0.919 | 27.09, 0.884 | 39.96 (CPU) |
PnP-FFDNet[18] | 30.83, 0.962 | 31.48, 0.945 | 25.27, 0.881 | 29.19, 0.930 | 29.19, 0.930 | 31.96 (GPU) |
PnP-FastDVDnet[19] | 31.57, 0.960 | 33.99, 0.878 | 26.30, 0.893 | 34.12, 0.965 | 31.50, 0.924 | 209.59 (GPU) |
STFormer-S | 33.55, 0.964 | 38.20, 0.965 | 31.98, 0.964 | 39.13, 0.988 | 35.72, 0.970 | 4.25 (GPU) |
Since STFormer is flexibility with respect to input size, in order to speed up training and save memory, we use the approach described in Sec. 4.2 to train the model on small-scale data. Considering the inflexibility of DUN-3DUnet [21] and RevSCI [16] for input size and masks, training a model with a spatial size of requires a large amount of memory or training time. We only compare with iterative optimization algorithms (GAP-TV [12] and DeSCI [13]), end-to-end deep learning algorithm (BIRNAT-color [61]) and PnP algorithms (PnP-FFDNet [18], PnP-FastDVDnet [19]), it is worth noting that PnP methods are divided into gray version and color version according to the use of grayscale denoiser or color denoiser. The reconstruction results of different algorithms are shown in Fig. 5 and Tab. IV, we can summarize the observations as follows:
The PSNR value of the STFormer network reaches 37.32 dB, which is 1.85 dB higher than the previous SOTA algorithm BIRNAT-color, especially in the high-speed motion scene Bosphorus, which is improved by 2.09 dB, and it exceeds 30 dB for the first time on the Traffic dataset. This shows that STFormer is also effective in color high-speed scenes.
Regarding the running time of algorithm, the reconstruction of each measurement by the DeSCI algorithm is over 24 hours. Although the GAP-TV and PnP algorithms achieve higher real-time performance, they still take more than 10 seconds, while STFormer-B reconstruction algorithm only needs 1.95 seconds, which is more than 5 times faster than previous PnP methods. Recently, BIRNAT-color takes advantage of the end-to-end model to further improve the running speed of reconstruction. However, our proposed STFormer-S model achieves higher reconstruction quality and faster real-time performance.
From the visualization results, our method can recover sharper edges of datasets Beauty, Jockey, Runner and Traffic, and can recover more local details of datasets Bosphorus, ShakeNDry. The reconstructed results of GAP-TV, PnP-FFDNet-gray and PnP-FastDVD-gray methods have some artifacts, while the reconstructed results of PnP-FFDNet-color and PnP-FastDVD-color methods have blurred edges and serious loss of details.
TG Block | Video Swin Block | SSA Branch | TSA Branch | GRFF network | PSNR | SSIM | |||
(a) | ✓ | ✓ | 33.27 | 0.952 | |||||
(b) | ✓ | ✓ | ✓ | 34.41 | 0.961 | ||||
(c) | ✓ | ✓ | ✓ | 34.28 | 0.960 | ||||
(d) | ✓ | ✓ | ✓ | 33.41 | 0.954 | ||||
(e) | ✓ | ✓ | ✓ | ✓ | 35.26 | 0.969 | |||
(f) | ✓ | ✓ | ✓ | ✓ | 35.15 | 0.967 | |||
(g) | ✓ | ✓ | ✓ | 35.04 | 0.964 | ||||
(h) | ✓ | ✓ | ✓ | ✓ | 36.34 | 0.974 |
Model | Channel | Block | PSNR | SSIM | Running time(s) |
STFormer-S | 64 | 8 | 33.94 | 0.958 | 0.14 |
STFormer-B | 256 | 8 | 36.34 | 0.974 | 0.49 |
STFormer-L | 256 | 16 | 36.75 | 0.976 | 0.92 |
Similar to the benchmark color data, we further extend STFormer to large-scale datasets. Following [19], we used 4 benchmark large-scale datasets, including Messi, Hummingbird, Swinger with spatial size of and Football with the spatial size of , where represents RGB channels. Similar to the color simulation video, we generate measurements with a compression rate of . Note that this is different from the various compression rates used in [19]. We set since we can use the same model trained for the mid-scale for the reconstruction of these large-scale datasets. This verifies the scalability and flexibility of our model. We can also train the model for large compression rates, only with additional training time and memory.
Due to the fact that network training of BIRNAT [15] and DUN-3DUnet [21] for large-scale datasets requires a lot of memory, while RevSCI [16] requires a longer training time, here we only compare STFormer with GAP-TV [12], PnP-FFDNet [18] and PnP-FastDVDnet [19]. Tab. V and Fig. 6 show the reconstruction results of these algorithms.
As shown in Fig. 6, we can observe that the results of GAP-TV reconstruction are blurry, PnP-FastDVDnet has some artifacts on some datasets, while our STFormer can achieve more realistic results. More importantly, our reconstruction on these datasets can reach more than 31 dB, which proves that video SCI can be applied to real scenes.
As for the running time, since we use a small version of STFormer, it can provide a more efficient runtime. As shown in Tab. V, for the reconstruction of a single measurement, our proposed algorithm takes only 4.25 seconds, which is 49 times faster than the previous SOTA algorithm PnP-FastDVDnet.
To verify the effect of each module in the proposed STFormer network on the overall reconstruction quality, we conducted some ablation experiments on each module. Tab. VI shows the effect of each module on the reconstruction quality using the 6 grayscale benchmark datasets, where indicates that the reconstructed network includes this module, and no TG block indicates that the token generation method in the original Swin Transformer is used. In addition, and represent FF network expansion factors and , respectively. We can get the following observations:
Improvements in STFormer Block: Here, we mainly compare the STFormer block with Video Swin Transformer block. Tab. VI(b,e) and Tab. VI(c,f) show that our proposed STFormer block can bring an improvement of at least 0.85 dB. In addition, we also verified the TSA branch of the STFormer block. As shown in Tab. VI(d,e), the reconstruction quality of the STFormer block with TSA branch can be improved by about 1.85 dB.
Improvements in Grouping Resnet Feed Forward network: We compare the GRFF network with the traditional FF network consisting of MLP layers. As shown in Tab. VI(e,f,h), the GRFF network can provide an effective gain of more than 1.19 dB.
Impact of width and depth of STFormer network: To verify the influence of the width and depth of the STFormer network on the reconstruction quality, we designed three models with different number of channels used to adjust the model width and number of blocks used to adjust the model depth. As shown in Tab. VII, increasing the width and depth of the model is beneficial to the improvement of the reconstruction quality, but it also leads to a longer running time.
Mask | BIRNAT | RevSCI | DUN-3DUnet | STFormer-B |
23.15, 0.731 | 18.90, 0.531 | 31.59, 0.934 | 36.28, 0.974 | |
23.09, 0.730 | 18.99, 0.537 | 31.62, 0.934 | 36.33, 0.974 | |
23.08, 0.728 | 18.96, 0.528 | 31.80, 0.935 | 36.32, 0.974 |
Previous end-to-end learning methods usually require retraining models for different masks and different spatial sizes. When dealing with large-scale data, they often require a large amount of memory and training time, which is inefficient. Transformer has a strong model capacity, and it can dynamically adjust the attention map according to different model inputs [44]. Along with our initialization method, the proposed STFormer network is robust to different masks.
To verify this flexibility, we randomly generate three masks that are not used during training. As shown in Tab. VIII, for different masks, the average PSNR value of the STFormer-B network reconstruction results remains within 0.06 dB, which is better than the current SOTA algorithms. By contrast, the previous end-to-end deep learning methods BIRNAT [15] and RevSCI [16] decrease more than 10 dB, and the deep unfolding DUN-3DUnet [21] decreases more than 3 dB.
Due to this flexibility of our proposed STFormer network, coupled with the use of STFormer block local window and relative position bias, the STFormer network trained on small-scale dataset can be used for large-scale datasets. As shown in Tab. IV and Tab. V, these STFormer models are trained on small-scale data (spatial size less than or equal to ), and then directly used to reconstruct data with larger spatial size, all achieving SOTA reconstruction results.
We validate our proposed method on grayscale and color real data. Since the real video SCI imaging system has uncertain noises, it is more challenging to reconstruct real data.
For the grayscale real data, we use Duomino, Water Ballon and Hand video data captured by [8]. It is worth noting that similar scenes are captured with different compression ratios, and all snapshot measurements spatial size are . As shown in Fig 7, we first compared the reconstruction results with several SOTA reconstruction algorithms, namely GAP-TV [12], DeSCI [13], PnP-FFDNet [18] and BIRNAT [15] in the scenes of compression rate . By zooming in on the local area, we can observe that our proposed algorithm can recover clear letters and sharp edges in the Dumino data and Water Ballon data, while the reconstruction results of GAP-TV, PnP-FFDNet, DeSCI and BIRNAT algorithms over-smooth these areas with some artifacts.
Real Data | Pixel resolution | GAP-TV | DeSCI | PnP-FFDNet | PnP-FastDVDnet | STFormer-S |
Hand10 | 37.8 | 2880.0 | 19.3 | 29.5 | 1.5 | |
Hand20 | 88.7 | 4320.0 | 42.4 | 63.9 | 1.8 | |
Hand30 | 163.0 | 6120.0 | 74.7 | 107.7 | 2.2 | |
Hand50 | 303.4 | 12600.0 | 144.5 | 203.9 | 2.7 |
In addition, our proposed STFormer network can also achieve good reconstruction results at high compression rates, e.g., at , which further verifies the capability of our method to explore long-term temporal dependencies. Although previous reconstruction algorithms can reconstruct high-compression data, their reconstruction results are too smooth and require a long running time (See Tab. IX); in particular, PnP-FastDVDnet takes 3.4 minutes, DeSCI algorithm reconstruction time is more than 3 hours and and our method only needs 2.7 seconds. Fig. 10 and Fig. 11 show the reconstruction results of Hand and Duomino with , respectively. We can observe that our proposed method can well reconstruct the desired high-speed video frames with compression rates from 10 to 50. Compare with previous SOTA method PnP-FastDVDnet, our results can provide clear details of Duomino and the Water Balloon even at . Please refer to Fig. 8.
We have further verified our proposed algorithms on the new video SCI system built at Westlake University similar to [8] but with different masks and different compression rates. Please refer to the reconstructed videos in the supplemental material.
For the color real data, we use Hammer video data captured by [5]. The spatial resolution of a single Bayer mosaic measurement is and the compression rate is 22. Since most reconstruction algorithms cannot be applied to the color data, we only compare our method with GAP-TV [12], DeSCI [13] and PnP-FFDNet [18]. Fig. 9 shows the reconstruction results of these algorithms. By zooming in on the local areas, we can see that the reconstruction results of GAP-TV, DeSCI, and PnP-FFDNet methods have some artifacts and blurred edges, but our proposed STFormer method can restore these sharpe edges.
Furthermore, to fill the gap of lacking real color data for video SCI, we have built a new video SCI system at Westlake University using RGB sensors but with different masks and different compression rates. The reconstructed videos are shown in Fig. 12 and Fig. 13. Comparing with previous SOTA algorithms, the reconstruction results of our proposed STFormer network are closer to the real color and can recover more details. Please refer to the enlarged areas in Figs. 12-13.
In this paper, we present STFormer, a spatial-temporal Transformer, to conduct the reconstruction task of video snapshot compressive imaging. Our proposed STFormer network consists of token generation block, video reconstruction block and a series of STFormer blocks. In particular, each STFormer block restricts the self-attention calculation to the spatial local window and time domain through the space-time factorization and local self-attention mechanism, which improves the computational efficiency and increases the flexibility of the model for multi-scale input. Since STFormer can effectively explore spatial-temporal correlations, it achieves SOTA results on multiple video SCI reconstruction tasks. Especially for complex and high-speed motion scenes, STFormer can achieve a reconstruction quality of more than 30 dB, far exceeding the previous SOTA reconstruction algorithms. Furthermore, STFormer is the first end-to-end deep learning network with flexibility of masks and input scale, while enjoying fast inference, greatly facilitating applications of video SCI system in our daily life.
Although STFormer has achieved satisfactory results on video SCI, the current video SCI reconstruction still faces many difficulties, such as the research of deep learning models suitable for different compression rates. In addition, for the real color data, due to the existence of complicated noise, there is still a huge gap between the reconstruction results and the real scene.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2014, pp. 3318–3325.Z. Cheng, R. Lu, Z. Wang, H. Zhang, B. Chen, Z. Meng, and X. Yuan, “BIRNAT: Bidirectional recurrent neural networks with adversarial training for video snapshot compressive imaging,” in
European Conference on Computer Vision. Springer, 2020, pp. 258–275.International Conference on Machine Learning
. PMLR, 2019, pp. 573–582.A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,”
Advances in neural information processing systems, vol. 25, pp. 1097–1105, 2012.T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in
European conference on computer vision. Springer, 2014, pp. 740–755.B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ADE20K dataset,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 633–641.K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 000–16 009.