DeSCI
Rank Minimization for Snapshot Compressive Imaging (SCI)
view repo
Snapshot compressive imaging (SCI) refers to compressive imaging systems where multiple frames are mapped into a single measurement, with video compressive imaging and hyperspectral compressive imaging as two representative applications. Though exciting results of high-speed videos and hyperspectral images have been demonstrated, the poor reconstruction quality precludes SCI from wide applications.This paper aims to boost the reconstruction quality of SCI via exploiting the high-dimensional structure in the desired signal. We build a joint model to integrate the nonlocal self-similarity of video/hyperspectral frames and the rank minimization approach with the SCI sensing process. Following this, an alternating minimization algorithm is developed to solve this non-convex problem. We further investigate the special structure of the sampling process in SCI to tackle the computational workload and memory issues in SCI reconstruction. Both simulation and real data (captured by four different SCI cameras) results demonstrate that our proposed algorithm leads to significant improvements compared with current state-of-the-art algorithms. We hope our results will encourage the researchers and engineers to pursue further in compressive imaging for real applications.
READ FULL TEXT VIEW PDF
A Bayesian compressive sensing framework is developed for video
reconstr...
read it
Compressive lensless imagers enable novel applications in an extremely
c...
read it
Capturing high-dimensional (HD) data is a long-term challenge in signal
...
read it
Sampling high-dimensional images is challenging due to limited availabil...
read it
Snapshot hyperspectral imaging can capture the 3D hyperspectral image (H...
read it
We consider the problem of reconstructing signals and images from period...
read it
This paper presents a novel norm-one-regularized, consensus-based imagin...
read it
Rank Minimization for Snapshot Compressive Imaging (SCI)
Compressive sensing (CS) [1, 2]
has inspired practical compressive imaging systems to capture high-dimensional data such as videos
[3, 4, 5, 6, 7, 8] and hyper-spectral images [9, 10, 11, 12, 13]. In video CS, high-speed frames are modulated at a higher frequency than the capture rate of the camera, which is working at a low frame rate. In order to achieve ultra-high frame rate [14], for multiple frames, there is only a single measurement available per pixel in these high-dimensional compressive imaging systems. In this manner, each captured measurement frame can recover a number of high-speed frames, which is dependent on the coding strategy, e.g., 148 frames reconstructed from a snapshot in [5]; if the CS imaging system is working (sampling measurements) at 30 frames per second (fps), we can achieve a video frame rate of higher than 4,000 fps. In hyper-spectral image CS, the wavelength dependent coding is implemented by a coded aperture (physical mask) and a disperser [10, 11]. More than 30 hyperspectral images have been reconstructed from a snapshot measurement. These systems can be recognized as snapshot compressive imaging (SCI) systems.Though these SCI systems have led to exciting results, the poor quality of reconstructed images precludes them from wide applications. Therefore, algorithms with high quality reconstruction are desired. This gap will be filled by our paper via exploiting the performance of optimization based reconstruction algorithms. It is worth noting that different from traditional CS [15] using random sensing matrices, in SCI, the sensing matrix has special structures and it is not random. Though this has raised the challenge in theoretical analysis [16], we can gain the speed in the algorithmic development (details in Sec. 4.3) by using this structure. Most importantly, this structure is inherent in the hardware development of SCI, such as the video CS [5] and spectral CS [11].
Without loss of generality, we use video CS as an example below to describe the motivation and contribution of this work. Hardware review and results of hyperspectral images are presented along with videos in corresponding sections. Different from traditional CS [15, 18], in SCI, the desired signal usually lies in high dimensions; for instance, a 148-frame video clip with spatial resolution (pixels) is recovered from a single (pixels) measurement frame. This paper aims to address the following key challenges in SCI reconstruction.
How to exploit the structure information in high-dimensional videos and hyperspectral images?
Are there some approaches for other (image/video processing) tasks that can be used in SCI to improve the reconstruction quality?
One of the state-of-the-art algorithms for SCI, the Gaussian mixture model (GMM) based algorithms
[19, 20] only exploited the sparsity of the video patches. In widely used videos (e.g., the Kobe dataset in Fig. 5), these GMM based algorithms cannot provide reconstructed video frames with a PSNR (peak-signal-to-noise-ratio) more than 30dB. Similar to simulation, for real data captured by SCI cameras, the results of GMM suffer from blurry and other unpleasant artifacts (Fig. 13). While the blurry is mainly due to the limitation of sparse priors used in GMM, the unpleasant artifacts might be due to the system noise. Motivated by these limitations of GMM, to develop a better reconstruction algorithm, high-dimensional structure information is necessary to be investigated, for example, the nonlocal similarity across temporal and spectral domain in addition to the spatial domain. Moreover, an algorithm which is robust to noise is highly demanded as the system noise is unavoidable in real SCI systems.A reconstruction framework that takes advantage of more comprehensive structural information can potentially outperform existing algorithms. One recent proposal in CS is to use algorithms that are designed for other data processing tasks such as denoising [21], which employed advanced denoising algorithms in the approximate message passing framework to achieve state-of-the-art results in image CS. This motivates us to develop advanced reconstruction algorithms for SCI by leveraging the advanced techniques in video denoising [22]. Meanwhile, recent researches on rank minimization have led to significant advances in other image and video processing tasks [22, 23, 24, 25, 26]. These advances have great potentials to boost the performance of reconstruction algorithms for SCI and this will be investigated in our paper.
While the rank minimization approaches have been investigated extensively for image processing, extending them to the video and hyperspectral image cases, especially the SCI is nontrivial. Particularly, to achieve high-speed videos, in SCI, the measurement is not video frames, but the linear combination of a number of frames. In this manner, the rank minimization methods cannot be imposed directly as used in image processing. To conquer this challenge, a new reconstruction framework is proposed, which incorporates the rank minimization as an intermediate step during reconstruction. Specifically, by integrating the compressive sampling model in SCI with the weighted nuclear norm minimization (WNNM) [24] for video patch groups (see details in Sec. 4.2), a joint model for SCI reconstruction is formulated. To solve this problem, the alternating direction method of multipliers (ADMM) [27] method is employed to develop an iterative optimization algorithm for SCI reconstruction.
Fig. 1 depicts the flowchart of our proposed algorithm for SCI reconstruction. After the measurement is captured by the SCI systems (left part in Fig. 1), our proposed algorithm, dubbed DeSCI (decompress SCI, middle part in Fig. 1), performs the projection (which projects the measurement/residual to the signal space to fit the sampling process in the SCI system) and WNNM denoising (imposing signal structural priors) for video patch groups (with details in Sec. 4.2) iteratively. In the right part, we show the reconstruction results of our proposed DeSCI algorithm on the Kobe data used in [19] and the results of the GAP-TV (generalized alternating projection based total variation) method [17] are shown for comparison in the upper part. It can be seen clearly that our DeSCI algorithm provides better image/video quality than GAP-TV.
Moreover, our proposed DeSCI algorithm has boosted the reconstruction quality of real data captured by four different SCI cameras, i.e., CACTI (coded aperture compressive temporal imaging) [5], color-CACTI [6], CASSI (coded aperture snapshot spectral imaging) [11] and the high-speed stereo camera [8]; please refer to Fig 13 to Fig. 20 for visual comparisons. These real data results clearly demonstrate the advantages of our proposed algorithms, e.g., robust to noise. We thus strongly believe that our findings have significant practical values for both research and applications. We hope our encouraging results will inspire researchers and engineers to pursue further in compressive imaging.
As SCI reconstruction is an ill-posed problem, to solve it, different priors have been employed, which can be categorized into total variation (TV) [17], sparsity in transform domains [4, 5, 6], sparsity in over-complete dictionaries [3, 12], and the GMM based algorithms [19, 20]. The group sparsity based algorithms [28] and the nonlocal self-similarity [29]
model, which has led to significant advances in image processing, have not been used in these SCI reconstruction algorithms. This is investigated in our paper. Most recently, the deep learning techniques have been utilized for video CS
[30, 31]. We are not aiming to compete with these algorithms as they are usually complicated and require data to train the neural networks. Furthermore, some of these algorithms require the sensing matrix being spatially repetitive
[30], which is very challenging (or unrealistic) in real applications. In addition, we have noticed that limited improvement (around 2dB) has been obtained using these deep learning techniques in [30] compared with GMM. By contrast, our proposed algorithm has improved the results significantly (dB) over GMM. Specifically, we employ rank minimization approach to the nonlocal similar patches in videos and hyperspectral images. In this manner, the reconstruction results have been improved dramatically and this would pave the way of wide applications of SCI systems, such as high-speed videos [6], hyperspectral images [12] and three-dimensional high-accuracy high-speed in-door localizations [8].The rest of this paper is organized as follows. Sec. 2 reviews the two standard SCI systems, namely, the video SCI and hyperspectral image SCI systems. The mathematical model of SCI is introduced in Sec. 3. Sec. 4 develops the rank minimization based algorithm for SCI reconstruction. Extensive simulation results are presented in Sec. 5 to demonstrate the efficacy of the proposed algorithm, and real data results are shown in Sec. 6. Sec. 7 concludes the paper and discusses future research directions.
The last decade has seen a number of SCI systems [9, 10, 3, 4, 5, 32, 14, 33, 6, 34, 35, 8, 36] with the development of compressive sensing [1, 37, 38]. The underlying principle is encoding the high-dimensional data on a 2D sensor with dispersion for spectral imaging [9, 10, 33], temporal-variant mask for high-speed imaging [3, 4, 5, 14], and angular variation for light-field imaging [32]. Recently, several variants explore more than three dimensions of the scene [6, 35, 8, 39, 40], which paves the way for plenoptic imaging [41].
We validate the proposed DeSCI methods on two typical snapshot compressive imaging systems, i.e., snapshot compressive imagers such as the CACTI system [5] and the CASSI system [9, 10], as shown in Fig. 2 and Fig. 3, respectively. Similar approaches could be adapted for other compressive imaging systems with minor modifications, since we only need to change the sensing matrix for various coding strategies and the nonlocal self-similarity always holds for natural scenes.
In video snapshot compressive imagers, i.e., the CACTI system [5], the high-speed scene is collected by the objective lens and spatially coded by the temporal-variant mask, such as the shifting mask [5] or different patterns on the digital micromirror device (DMD) or the spatial light modulator (SLM) [3, 4, 7], as shown in Fig. 2. Then the coded scene is detected by the monochrome or color CCD (Charge-Coupled Device) for grayscale [5] and color [6] video capturing, respectively. A snapshot on the CCD encodes tens of temporal frames of the high-speed scene. The number of coded frames for a snapshot is determined by the number of variant codes (of the mask) within the integration time.
In spectral snapshot compressive imagers, i.e., the CASSI system [9, 10], the spectral scene is collected by the objective lens and spatially coded by a fixed mask, as shown in Fig. 3. Then the coded scene is spectrally dispersed by the dispersive element, such as the prism or the grating. The spatial-spectral coded scene is detected by the CCD. A snapshot on the CCD encodes tens of spectral bands of the scene. The number of coded frames for a snapshot is determined by the dispersion property of the dispersive element and the pixel size of the mask and the CCD [10].
Mathematically, the measurement in the SCI systems can be modeled by [5]
(1) |
where is the sensing matrix, is the desired signal, and denotes the noise. Unlike traditional CS, the sensing matrix considered here is not a dense matrix. In SCI, e.g., video CS as in CACTI [5, 6] and spectral CS as in CASSI [11], the matrix has a very specific structure and can be written as
(2) |
where are diagonal matrices.
Taking the SCI in CACTI [5] as an example, consider that high-speed frames are modulated by the masks , correspondingly (Fig. 2). The measurement is given by
(3) |
where denotes the Hadamard (element-wise) product. For all pixels (in the frames) at the position , ; , they are collapsed to form one pixel in the measurement (in one shot) as
(4) |
By defining
(5) |
where , and , for
, we have the vector formulation of Eq. (
1), where . Therefore, , , and the compressive sampling rate in SCI is equal to . ^{1}^{1}1Multiple measurements have also been investigated in [42, 43] for hyperspectral image CS, and our algorithm can be used in that system with minor modifications. The code design has also been investigated in [43] for multiple shot hyperspectral image CS. However, this is out of the scope of this paper. It is worth noting that due to the special structure of in (2), we have being a diagonal matrix. This fact will be useful to derive the efficient algorithm in Sec. 4.3 for handling the massive data in SCI.One natural question is that is it theoretically possible to recover from the measurement defined in Eq. (1), for ? Most recently, this has been addressed in [16] by using the compression-based compressive sensing regime [44] via the following theorem, where denotes the encoder/decoder, respectively.
[16] Assume that , . Further assume the rate- code achieves distortion on . Moreover, for , , and . For and let denote the solution of compressible signal pursuit optimization. Assume that is a free parameter, such that . Then,
(6) |
with a probability larger than
.Details of the compressible signal pursuit optimization and the proof can be found in [16]. Most importantly, Theorem 1 characterizes the performance of SCI recovery by connecting the parameters of the (compression/decompression) code, its rate and its distortion , to the number of frames and the reconstruction quality. This theoretical finding strongly encourages our algorithmic design for SCI systems.
In this section, we first briefly review the rank minimization algorithms and the joint model is developed in Sec. 4.2. The proposed algorithm is derived in Sec. 4.3.
In image/video processing, since the matrix formed by nonlocal similar patches in a natural image is of low rank, a series of low-rank matrix approximation methods have been proposed for various tasks [24, 45, 46, 47]. The main goal of low-rank matrix approximation is to recover the underlying low-rank structure of a matrix from its degraded/corrupted observation. Within these frameworks, nuclear norm minimization (NNM) [23] is the most representative one and this will be used in our work. Specifically, given a data matrix , the goal of NNM is to find a low-rank matrix of rank , which satisfies the following objective function,
(7) |
where is the nuclear norm, i.e.
, the sum of the singular values of a given matrix,
denotes the Frobenius norm and is the regularization parameter. Despite the theoretical guarantee of the singular value thresholding model [23], it has been observed that the recovery performance of such a convex relaxation will degrade in the presence of noise, and the solution can seriously deviate from the original solution of the rank minimization problem [48]. To mitigate this issue, Gu [24] proposed the weighted nuclear norm minimization (WNNM) model, which is essentially the reweighted -norm [49] of the singular values of the desired matrix, and WNNM has led to state-of-the-art image denoising results. Though generally the WNNM problem is nonconvex, for non-descendingly ordered weights as used in this paper, we can get the optimal solution in closed-form by the weighted soft-thresholding operator [25]. In the following, we will investigate how to use WNNM in the SCI systems.To be concrete, all video frames are divided into overlapping patches of size , and each patch is denoted by a vector . For the patch , its similar patches are selected from a surrounding (searching) window with pixels to form a set , where denotes the window size in space and signifies the window size in time (across frames). After this, these patches in are stacked into a matrix , i.e.,
(8) |
This matrix consisting of patches with similar structures is thus called a group, where denotes the patch in the group. Since all patches in each data matrix have similar structures, the constructed data matrix is of low rank.
By using this rank minimization as a constraint, the SCI problem in (1) can be formulated as
(9) |
where is a parameter to balance these two terms, and recall that is constructed from .
Under the ADMM [27] framework, we introduce an auxiliary variable to the problem in (10),
(12) |
where again is constructed from . Eq. (12) can be translated into three sub-problems:
(13) | ||||
(14) | ||||
(15) |
We derive the solutions to these sub-problems below, and without confusion, we discard the iteration index .
Solve : Given , Eq (13) is a quadratic form and has a closed-form solution
(16) |
where
is an identity matrix with desired dimensions. Since
is a fat matrix, will be a large matrix and thus the matrix inversion formula is employed to simplify the calculation:(17) |
(18) | ||||
As mentioned earlier, is a diagonal matrix in our imaging systems. Let
(19) |
we have
(20) | |||
(21) |
Let and denote the element of the vector ; (18) becomes
(22) |
Note that can be updated in one shot by , and is pre-calculated and stored with
(23) |
with defined in (3). In this way, the last term in (22) can be computed element-wise and can thus be updated very efficiently.
Solve : Let . Eq. (14) can be considered as the WNNM denoising problem (however, for videos rather than images)
(24) |
Recall that is the patch group constructed from . Let be the patch group constructed from corresponding to . The structure of (24) is very complicated and in order to achieve a tractable solution, a general assumption is made. Specifically, is treated as a noisy version of , which is , where denotes the zero-mean white Gaussian noise and each element
follows an independent zero mean Gaussian distribution with variance
, i.e., . It is worth noting that we can recover after reconstructing , and there are in totalpatch groups. Invoking the law of large numbers in probability theory, we have the following equation with a very large probability (limited to 1) at each iteration
[50], i.e., ,(25) |
The proof can be found in [51]. As both and are large (i.e., ) in our case and the overlapping patches are averaged, we can thus translate (24) to the following problem
(26) |
Note that there is a scale difference between in (26) and (24). As mentioned above, can be recovered after reconstructing and each can be independently solved by
(27) |
By considering and , (27) can be solved in closed-form by the generalized soft-thresholding algorithm [52, 25] detailed below, where we have used .
Considering the singular value decomposition (SVD) of
and the weight vector , can be solved by(28) | |||||
(29) |
The following problem is to determine the weight vector . For natural images/videos, we have the general prior knowledge that the larger singular values of are more important than the smaller ones since they represent the energy of the major components of . In the application of denoising, the larger the singular values, the less they should be shrunk, i.e.,
(30) |
where is a constant, is the number of patches in and is a tiny positive constant. Since is not available,
is not available either. We estimate it by
(31) |
where
is updated in each iteration. There are some heuristic approaches on how to determine
based on current estimations [24]. In principle, with the increasing of the iteration number, the measurement error is decreasing and is getting smaller. In our SCI problem, we have found that progressively decreasing the noise level by starting with a large value performs well in our experiments (details in Sec. 5.1.1). This is consistent with the analysis in [24].The complete algorithm, i.e., DeSCI, is summarized in Algorithm 1.
Relation to generalized alternating projection. The generalized alternating projection (GAP) algorithm [53] has been investigated extensively for video CS [5, 6, 17] with different priors. The difference between ADMM developed above and GAP is the constraint on the measurement. In ADMM, we aim to minimize , while GAP imposes . Specifically, by introducing a constant , GAP reformulates the problem in (12) as
(32) |
This is solved by a series of alternating projection problem:
(33) | ||||
where is a constant that changes in each iteration; is constructed from , and is constructed from , respectively.
Eq. (33) is equivalent to
(34) | ||||
where is a parameter to balance these two terms. Eq. (34) is solved by alternating updating and . Given , is solved by a WNNM denoising algorithm as derived before.
Given , the update of is simply an Euclidean projection of on the linear manifold ; this is solved by
(35) | ||||
(36) |
Comparing with the updated equation of in (22), by interchange and , we can found (22) and (36) are equivalent by setting and in (22). This is expected as if we set in (13), we impose that is on the linear manifold , which is exactly the constraint used in GAP. This works well in the noiseless and low noise cases. However, when the noise level is high, ADMM will outperform GAP as the strong constraint will bias the results.
An accelerated GAP (GAP-acc) is also proposed by Liao et al. [53]. The linear manifold can also be adaptively adjusted and so (35) can be modified as
(37) | |||||
(38) |
This accelerated GAP speeds up the convergence in the noiseless case in our experiments. We have also tried the approximate message passing (AMP) algorithm [54] used in [21] to update . Unfortunately, AMP cannot lead to good results due to the special structure of ; this may be because the convergence of AMP heavily depends on the Gaussianity of the sensing matrix in CS.
In this section, we validate the proposed DeSCI algorithm on simulation datasets, including the videos and hyperspectral images and compare it with other state-of-the-art algorithms for SCI reconstruction.
We first conduct simulations to the video SCI problem. The coding strategy employed in [5], specifically the shifting binary mask, is used in our simulation. Two datasets, namely Kobe and Traffic, used in [19] are employed in our simulation. We also used another two datasets, Runner [55] and Drop [56] to show the wide applicability. The video frame is of size (pixels) and eight () consequent frames are modulated by shifting masks and then collapsed to a single measurement. We compare our proposed DeSCI algorithm with other leading algorithms, including GMM-TP [19], MMLE-GMM, MMLE-MFA [20] and GAP-TV [17]. The performance of GAP-wavelet proposed in [6] is similar to GAP-TV as demonstrated in [17] and thus omitted here. Due to the special requirements of the mask in deep learning based methods [30], we do not compare with them. However, as mentioned earlier, we can notice that limited improvement (around 2 dB) has been obtained using these deep learning techniques in [30] compared with GMM. By contrast, our algorithm has improved the results significantly ( dB) over GMM. All algorithms are implemented in MATLAB and the GMM and GAP-TV codes are downloaded from the authors’ websites. The MATLAB code of the proposed DeSCI algorithm is available at https://github.com/liuyang12/DeSCI.
In our DeSCI algorithm, for all noise levels, the parameter and iteration number Max-Iter are fixed to and , respectively. The image patch size is set to and for and , respectively; the searching window size is set to , and the number of adjacent frames for denoising is set to . Each patch group has and patches for and , respectively. The parameter is set based on the signal-to-noise ratio (SNR) of the SCI measurements. Specifically, is set to for the measurement SNR of {40dB, 30dB, 20dB, 10dB, 0dB}. For noiseless measurements, is set to ; as derived in Sec. 4.3, in this case, ADMM is equivalent to GAP. Our algorithm is terminated by sufficiently small difference between successive updates of or reaching the Max-Iter. When the SNR of the SCI measurement is larger than 30 dB, GAP-acc is used to update . Both PSNR and structural similarity (SSIM) [57] are employed to evaluate the quality of reconstructed videos.
As mentioned earlier, the noise estimation in each iteration of our DeSCI algorithm is important to the performance. We found that firstly setting to a large value and then progressively decreasing it leads to good results. One example of the Kobe data used in [19] is shown in Fig. 4. We have found that this sequential setting can always lead to good results in our experiments.
Algorithm | Kobe | Traffic | Runner | Drop | Average |
GMM-TP | 24.47, 0.5246 | 25.08, 0.7652 | 29.75, 0.6995 | 34.76, 0.6356 | 28.52, 0.6562 |
MMLE-GMM | 27.33, 0.6962 | 25.68, 0.7798 | 33.68, 0.8224 | 39.86, 0.7369 | 31.64, 0.7588 |
MMLE-MFA | 24.63, 0.5291 | 22.66, 0.6232 | 30.83, 0.7331 | 35.66, 0.6690 | 28.45, 0.6386 |
GAP-TV | 26.45, 0.8448 | 20.89, 0.7148 | 28.81, 0.9092 | 34.74, 0.9704 | 27.72, 0.8598 |
DeSCI | 33.25, 0.9518 | 28.72, 0.9251 | 38.76, 0.9693 | 43.22, 0.9925 | 35.99, 0.9597 |
We hereby demonstrate that the proposed DeSCI algorithm performs much better than current state-of-the-art algorithms. Comparison of DeSCI, GMM-TP, MMLE-GMM, MMLE-MFA and GAP-TV on four datasets is shown in Table I. It can be seen clearly that the proposed DeSCI outperforms other algorithms on every dataset. Specifically, the average gains of DeSCI over GMM-TP, MMLE-GMM, MMLE-MFA and GAP-TV are as much as {7.47, 4.35, 7.54, 8.27}dB on PSNR and {0.3035, 0.2009, 0.3211, 0.0999} on SSIM.
In average, MMLE-GMM is the runner-up on PSNR while GAP-TV is the runner-up on SSIM. This may due to the fact that the TV prior is imposed globally and thus reserves more structural information of the video (thus leading to a higher SSIM). Exemplar reconstructed frames are shown in Fig. 5. It can be observed that GMM based algorithms suffer from blurring artifacts and GAP-TV suffers from blocky artifacts. By contrast, our proposed DeSCI algorithm can not only provide the fine details but also the large-scale sharp edges and clear motions. The reconstructed videos are shown in the supplementary material (SM).
One phenomenon for SCI reconstruction is the image quality (PSNR and SSIM) dropping down of the first and last reconstruction frames for each measurement. One of the reasons is that previous reconstruction algorithms do not consider the non-local self similarity in video frames. In Fig. 6, we plot the frame-wise PSNR and SSIM of DeSCI as well as other algorithms for the Kobe and Traffic datasets. It can be seen that the proposed DeSCI smooths this quality dropping down phenomenon out. The self-similarity exists in every frame of the video and thus helps the reconstruction of these frames. We can also notice that the PSNR of reconstructed video frames drops in the last 16 frames of the Kobe data (Fig. 6(a)); this is because complicated motions exist in these frames, i.e., the slam dunk. Though the PSNRs of all algorithms drop, the SSIM of our proposed DeSCI is much smoother (Fig. 6(b)) than that of other algorithms.
As mentioned in Sec. 4.3, DeSCI using GAP as projection can lead to fast results under noiseless case, which has been verified in previous results. However, GAP imposes that the solution of should be on the manifold , which is too strong in the noisy case. By contrast, ADMM minimizes the measurement error, i.e., , and is thus robust to noise [58, 27]. To verify this, we perform the experiments on the Kobe dataset by adding different levels of white Gaussian noise. The results are summarized in Fig. 7, where we can see that in the noiseless case, ADMM and GAP performs the same, which is consistent with our theoretical analysis. When the measurement SNR is getting smaller, ADMM outperforms GAP in both PSNR and SSIM. Therefore, DeSCI using ADMM as projection is recommended in realistic systems with noise.
Our proposed DeSCI is an iterative algorithm and in each iteration, we first perform projection via ADMM or GAP and then perform denoising via the low-rank minimization method, namely, WNNM. A natural question is how will the performance change by replacing the derived WNNM with other state-of-the-art video denoising algorithms. We have already seen that the DeSCI-WNNM performs much better than TV based denoising algorithms. The VBM4D [28] algorithm, which is the representative state-of-the-art video denoising algorithm, is incorporated into our proposed framework, dubbed DeSCI-VBM4D. We run DeSCI-VBMD on the Kobe and Traffic datasets and obtained the PSNR of {30.60, 26.60}dB and SSIM of {0.9260, 0.8958}, respectively. More specifically, DeSCI-WNNM outperforms DeSCI-VBM4D more than 2 dB on PSNR and more than 0.025 on SSIM. Exemplar reconstruction frames are shown in Fig. 8, where we can see DeSCI-VBM4D still suffers from some undesired artifacts while the DeSCI-WNNM can provide small-scale fine details as well as large-scale sharp edges. This clearly verifies the superiority of the proposed reconstruction framework along with the WNNM denoising on SCI reconstruction.
We have also tried to integrate WNNM and VBM4D with other priors for denoising, such as TV priors. We observe that the TV prior can marginally help DeSCI-VBM4D (less than 0.2 dB) but almost cannot help DeSCI-WNNM. This again verifies that our proposed DeSCI-WNNM has investigated both the global prior and the local prior for videos, through the nonlocal self-similarity and low-rank estimation.
As mentioned in Sec. 4.3, the projection step can be updated very efficiently, and the time consuming step is the weighted nuclear norm minimization for video denoising, in which both the block matching and low-rank estimation via SVD require a high computational workload. While the block matching can be performed once per (about) 20 iterations, the SVD needs to be performed in every iteration for each patch group. Specifically, a single round of block matching and patch estimation via SVD takes 90 seconds and 11 seconds, respectively. Recent advances of deep learning have shown great potentials in various image processing tasks, and we envision the block matching might be able to speed up via Generative Adversarial Networks (GAN) [59]. In this way, the time of block matching can be reduced significantly. Regarding the low-rank estimation via SVD, some truncated approaches [60] can be employed to speed-up the operation. Via using all these advanced tools, we believe each iteration of our algorithm can be performed within 10 seconds, and DeSCI can provide good results in a few minutes.
Algorithm | bird | toy |
---|---|---|
GAP-TV | 30.36 dB, 0.9251 | 24.66 dB, 0.8608 |
DeSCI | 32.40 dB, 0.9452 | 25.91 dB, 0.9094 |
We further demonstrate that DeSCI can be used for snapshot-hyperspectral compressive imaging systems, such as CASSI [9]. Both simulated data with shifting masks and data from CASSI systems [9] are presented. We first show the simulation results in this section and real data results are demonstrated in Sec. 6.3. We generate the simulated hyperspectral data by summing the spectral frames from the hyperspectral dataset^{2}^{2}2The bird spectral data is from [12]. The toy spectral data is from CAVE multispectral image database (http://www1.cs.columbia.edu/CAVE/databases/multispectral/). and shifting random masks using Eq. (3).
Bird simulated hyperspectral data. The bird hyperspectral images consists of 24 spectral bands with each of size pixels. The results of the reconstructed spectra of simulated bird hyperspectral data and exemplar frames are shown in Fig. 9 and Fig. 10, respectively. The averaged results of PSNR and SSIM are shown in Table II. It can be seen clearly that both GAP-TV and DeSCI can recover the spectra of the four birds reliably with correlation over 0.9999, as shown in the legends of Fig. 9. DeSCI provides more details in reconstruction (see Fig. 10, where the feather of the orange bird is less smoothed out than GAP-TV) and higher quantitative indexes (PSNR and SSIM in Table II).
Toy simulated hyperspectral data. The toy hyperspectral images consists of 31 spectral bands with each of size pixels. The results of the reconstructed spectra of simulated toy hyperspectral data and exemplar frames are shown in Fig. 11 and Fig. 12, respectively. The averaged results of PSNR and SSIM are shown in Table II. It can be seen clearly that DeSCI provides better reconstructed spectra than GAP-TV, as shown in the legends of Fig. 11. Besides, the reconstructed frames of DeSCI preserves fine details of the toy dataset, for example the resolution targets and the fringe on the clothes in Fig. 12 and gets higher PSNR and SSIM, as shown in Table II.
In this section, we demonstrate the efficacy of the proposed DeSCI algorithm with real data captured from various SCI systems. Both grayscale and color video data and hyperspectral images are used for reconstruction and DeSCI provides significantly better reconstruction results.
The grayscale high-speed video data are captured by the coded aperture compressive temporal imaging (CACTI) system [5]. A snapshot measurement of size pixels encodes frames of the same size. The same data is used in two GMM papers [19, 20]. We compare the results of DeSCI with three GMM based algorithms (GMM-TP, MMLE-GMM and MMLE-MFA) and re-use the figures^{3}^{3}3We re-use the figures in [20] for two reasons. Firstly, we use exactly the same dataset as the GMM paper [20]. Secondly, only the pre-trained model of is provided by the authors [19]. However, for real data, for the grayscale high-speed video and for the color high-speed video. from [20].
Chopper wheel grayscale high-speed video. The results of the chopper wheel dataset are shown in Fig. 13. It can be seen clearly that all other leading algorithms suffer from motion blur artifacts (see letter ‘D’ and the corresponding motion blur in each frame and the close-ups in Fig. 13). However, DeSCI reserves both fine details and large-scale sharp edges.
Hand lens grayscale high-speed video. Similar results of the hand lens dataset are shown in Fig. 14. DeSCI reserves the clear background of the motionless hand (close-ups of the hand) and sharp edges of the moving lens (close-ups of the lens).
UCF grayscale high-speed video. DeSCI can also boost the performance of other snapshot high-speed compressive imaging systems. A snapshot measurement of size pixels encodes 10 frames of the same size from [8]. We compare DeSCI and TwIST used in [8] and results are shown in Fig. 15. Three frames out of 10 total frames are shown here. DeSCI not only preserves sharp edges of the scene (close-ups of the ‘C’ character and the moving ball), but also resolves fine details of the background. As shown in the middle close-ups, the characters of the book can be seen clearly in DeSCI reconstruction. However, TwIST blurs these details and no characters on the book can be identified.
The color high-speed video data are from the high-speed motion, color and depth system [6]. A snapshot Bayer RGB measurement of size pixels encodes frames of the same size. Since we do not have the pre-trained GMM model of , we do not compare our algorithm DeSCI with three GMM based algorithms. As discussed in Sec. 6.1, GMM based algorithms perform similar to GAP-TV w.r.t. motion blur artifacts.
Triball color high-speed video. The results of the triball dataset are shown in Fig. 16. It can be seen clearly that the DeSCI results eliminate the motion blur artifacts compared with the GAP-TV results (see the close-up of the dark red ball in Fig. 16). Comparing with the GAP-wavelet results shown in [6], our DeSCI results recovered a much smoother background. The orange ball still suffers from motion blur artifacts because its moving direction keeps the same as the shifting direction of the mask in all frames, and shift-mask-based SCI systems fail to resolve motions with the same direction as the shifting mask.
Hammer color high-speed video. The results of the hammer dataset are shown in Fig. 17. It can be seen clearly that the DeSCI results recovered a much smoother background compared with GAP-wavelet used in [6] and sharper edges of the hammer compared with GAP-TV (see the close-ups of the hammer in Fig. 17).
We apply DeSCI on real snapshot-hyperspectral compressive imaging data and show that DeSCI provides significantly better results by preserving the spectra and reducing artifacts induced by compressed measurements. The real hyperspectral data is from the CASSI system [9].
Bird hyperspectral data. The bird data [9] consists of 24 spectral bands with each of size pixels. The reconstructed spectra and exemplar frames of the bird real hyperspectral data are shown in Fig. 18 and Fig. 19, respectively. We shift the reconstructed spectra two bands to keep align with optical calibration. So we only get 21 spectral bands shown in Fig. 18, compared with 24 bands shown in simulated reconstruction in Fig. 9. It can be seen clearly that DeSCI preserves the spectra property of the scene with correlation over 0.99. Besides, the reconstructed frames of DeSCI are clearer than those of GAP-TV; the results of GAP-TV suffer from blurry artifacts resulted from coded measurements. We notice that there are some over-smooth phenomena existing in the DeSCI results on this real data in Fig. 19, which might be due to the system noise in DeSCI reconstruction.
Object hyperspectral data. The object data [9] consists of 33 spectral bands with each of size pixels. The reconstructed frames of DeSCI and comparison of exemplar frames with TwIST used in [9] and GAP-TV are shown in Fig. 20 and Fig. 21, respectively. Since there is no ground truth for the object data, we cannot compare the reconstructed spectra to the ground truth, nor the correlation. Our reconstructed frames are clear and free of blur. Each frame of a spectral band has a whole object or nothing, as shown in Fig. 20. Because the plastic object with the same color in this scene are made of the same material, the whole object has the same spectral band. In contrast to the yellow band of the TwIST and GAP-TV results shown in Fig. 21, DeSCI sees no fraction of the banana object. DeSCI reduces the mask artifacts significantly induced by compressed measurements.
We have proposed a new algorithm to reconstruct high-speed video frames from a single measurement in snapshot compressive imaging systems, with videos and hyperspectral images as two exemplar applications. The rank minimization approach is incorporated into the forward model of the snapshot compressive imaging system and a joint optimization problem is formulated. An alternating minimization algorithm is developed to solve this joint model. Our proposed algorithm has investigated the nonlocal self-similarity in video (hyperspectral) frames, and has led to significant improvements over existing algorithms. Extensive results of both simulation and real data have demonstrated the superiority of the proposed algorithm.
Most recently, the deep learning algorithms have been used in CS inversion [61, 62, 63] and as mentioned in the introduction, deep learning has been employed for video CS in [30, 31]
(but with some constraints). While most of these algorithms tried to learn an end-to-end inversion network via convolutional neural networks (CNN), Yang
et al. [62] incorporates the CNN into the ADMM framework. This inspires one of our future work to integrate the proposed ADMM framework with the deep learning based denoising algorithms [64, 65].Our proposed algorithm can be used in other snapshot (or multiple shots) compressive imaging systems, for example, the depth compressive imaging [66], polarization compressive imaging [67], X-ray compressive imaging [68, 69, 70] and three-dimensional high-speed compressive imaging [7] systems. This will be another direction of our future work. We believe our results will encourage the researchers and engineers to pursue further in compressive imaging on real applications.
The authors would like to thank Dr. Patrick Llull for capturing the real video data of CACTI, Dr. Tsung-Han Tsai for capturing the real hyperspectral image data of CASSI, and Mr. Yangyang Sun and Dr. Shuo Pang for providing the UCF video data. This work was supported by the National Natural Science Foundation of China (grant Nos. 61327902, 61722110, 61627804, and 61631009).
IEEE International Conference on Computer Vision (ICCV)
, 2011, pp. 287–294.IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Conference Proceedings, pp. 329–336.Proceedings of the 27th international conference on machine learning (ICML-10)
, 2010, pp. 663–670.E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?”
Journal of the ACM, vol. 58, no. 3, pp. 11:1–37, 2011.
Comments
There are no comments yet.