I Introduction
Hyperspectral imaging aims at sampling the spectral reflectance of a scene to collect a threedimensional (3D) dataset consisting of two spatial dimensions and one spectral dimension , called a datacube [1]. Compared with panchromatic images, the sensed spectral signature of each pixel in hyperspectral images (HSIs) covers a broad range of wavelengths and has a high spectral resolution, so it can reveal more properties of objects at the corresponding spatial position in the scene. Hyperspectral imaging has the superiority of achieving more accurate object classification, and acts as a useful tool in many applications including environmental remote sensing [2], land cover classification [3]
[4] and material identification [1, 5, 6].Many different techniques have been developed for acquiring 3D hyperspectral cubes [7, 8, 9, 10, 11]. Typically, these imaging systems capture one or twodimensional subset of the datacube, and resort to temporal scanning to sense the remaining dimensions. For instance, spatially scanned hyperspectral imaging systems [12] measure the slices of the datacube by a twodimensional sensor array in pushbroom imaging spectrometers, or collect only one point of the datacube in whiskbroom imaging spectrometers, and complement the datacube by conducting spatial scanning. Spectrally scanned hyperspectral imaging systems, such as fixed or tunable filter spectrometers, sense a single spectral band of the datacube at one time and cover all the spectral bands by scanning along the spectral dimension [13]. However, these spectrometers may suffer from motion artifacts during the period of scanning. Furthermore, since the light collection efficiency of the entrance slit in these spectrometers is insufficient, the imaging quality may be degraded.
Different from the scanning hyperspectral imaging systems mentioned above [14, 10], snapshot imaging spectrometers collect both the spectral and spatial information in a single integration period. Therefore, motion artifacts can be avoided in snapshot imaging, and the light collection efficiency can also be significantly improved, enabling the capture of dynamic scenes. Coded aperture snapshot spectral imaging (CASSI) [15] is one of the wellknown hyperspectral snapshot imaging systems. It takes advantage of the compressed sensing (CS) technology and achieves a twodimensional (2D) snapshot measurement by random linear projection. Specifically, the random projection in CASSI starts from encoding the scene through a binary coded aperture mask. There are mainly two encoding manners [16], one is encoding the optical field in the spatial dimension with a single disperser like SDCASSI [17], and the other is encoding the optical field in both spatial and spectral dimension, like SSCASSI [8] or DDCASSI [15]. The encoded light field is then integrated by a 2D detector array. An optimization algorithm needs to be used to reconstruct the spectral scenes from the 2D snapshot measurement, which is far fewer than the number of samples required by conventional scanning based spectrometers.
Due to the underdetermined observations in CASSI, HSI reconstruction from the snapshot measurement is an illposed inverse problem. To deal with this issue, some handcrafted structures have been designed to represent hyperspectral images, including total variation (TV) [18], sparsity [17], lowrank [19], and nonlocal selfsimilarity. The reconstruction can be obtained by solving these priors regularized optimization problems. However, these prior structures are designed empirically and are therefore insufficient to represent the complicated spectral variation of the realworld scenes. With the powerful learning capabilities of deep networks [20, 21, 22, 23], some works attempted to learn the parameterized network representation of HSIs in a datadriven manner [16]
. However, they all require a large number of hyperspectral images for supervised learning. In practical scenarios, it is expensive to collect enough training data sets for network pretraining. In addition, due to the differences in the spectral response characteristics and spectral wavelength range of different spectral imaging devices, the pretrained network upon some specific hyperspectral datasets usually cannot be well applicable to other hyperspectral imagers.
In order to cope with these issues, we propose an unsupervised network to learn HSI reconstruction only from the compressive snapshot measurement without pretraining. As shown in Fig. 1, the proposed network acts as a conditional generative network for generating the underlying hyperspectral images from random code conditioned on the given snapshot measurement . Different from gray or color images, hyperspectral images present a joint correlation among spatial and spectral dimensions. Therefore, the conditional generative network is equipped with specific modules to capture spatialspectral correlations, which can effectively reconstruct the spatialspectral information of HSIs. The network parameters are optimized to generate the optimal reconstruction which can closely match the given snapshot measurement according to the imaging model. We refer to our Hyperspectral Compressive SnapShot reconstruction Network as HCSNet for short. Our main contributions can be summarized as:

We propose an unsupervised HCS
Net for hyperspectral compressive snapshot reconstruction, which learns the reconstruction only from the snapshot measurement without pretraining. In practical scenarios, it can greatly enhance the adaptability and generalization due to the characteristics of unsupervised learning.

The spatialspectral joint attention module is designed to seize the correlation between the spatial and spectral dimensions of HSIs. This module learns multiscale 3D attention maps to adaptively weight each entry of the feature map, which is beneficial to improve the reconstruction quality.

The proposed HCSNet is evaluated upon multiple simulated and real data sets with both the SDCASSI and SSCASSI systems. The quantitative results show that HCSNet achieves promising reconstruction results, and outperforms the stateoftheart methods.
The remainder of this paper is organized as follows. In section II, we review some related works, especially two kinds of popular works, namely the predefined priorbased and deep networkbased methods. Section III introduces the CASSI systems, and section IV describes the proposed HCSNet, including network architecture and network learning. We report the experimental results in section V and conclude the paper in section VI.
Ii Related work
The codingbased snapshot imaging methods rely on the principle of compressed sensing [24], and the number of entries in the snapshot measurement (such as SDCASSI and SSCASSI measurements) is much smaller than the original HSI size. Therefore, this underdetermined reconstruction problem usually leverages on a proper prior representation of HSIs to achieve reliable reconstruction. The popular HSI compressive snapshot reconstruction methods can be mainly grouped into two categories: predefined priorbased methods and deep networkbased reconstruction methods.
Predefined Priorbased HSI reconstruction. This kind of method seeks the reconstruction by optimizing an objective function consisting of a data fidelity term and a regularization term. The data fidelity term penalizes the mismatch between the unknown HSI and the given measurement according to the imaging observation model, and the regularization term constraints the prior structures of HSIs. Many prior structures have been exploited to represent the HSI, such as the sparsity prior, the total variation and the low rank structure [18, 17, 19].
Many studies are developed within this paradigm. [25] and [26] both choose the prior of total variation (TV) as a regularization term for each band, and they employ the twostep iterative shrinkage thresholding (TwIST) algorithm and the generalized alternating projection (GAP) for model optimization respectively. [27] represents the unknown signal in the wavelet and DCT domain and solves the induced sparsityregularized reconstruction problem by the bregman iterative algorithm. Instead of using a predefined transformation, [28] learns a dictionary to represent the underlying HSI datacube. Liu . [29]
proposed a method dubbed DeSCI to capture the nonlocal selfsimilarity of HSIs by minimizing the weighted nuclear norm. Compared with the TV regularization, DeSCI can achieve a better reconstruction performance, but it takes a lot of time to carry out patch search and singular value decomposition. Overall, all these prior structures are handcrafted based on empirical knowledge, so they lack an adaptive ability to spectral diversities and the nonlinearity distribution of hyperspectral data. At the same time, these priors also involve the empirical setting of some parameters.
Deep Networkbased HSI reconstruction
. In recent years, deep neural networks have been proven to achieve the startoftheart results for a variety of imagerelated tasks including image compressive sensing reconstruction
[30, 31, 32]. Unlike the predefined prior based algorithms, deep networkbased methods attempt to directly learn the image prior by training the network on a large number of data sets, thereby capturing the inherent statistical characteristics of HSIs.Several studies have assessed deep networks for hyperspectral compressive reconstruction from the CASSI measurement. Xiong [33]
upsampled the undersampled measurement into the same dimension as the original HSI, and enhances the reconstruction by learning the incremental residual with a convolutional neural network. Choi
[16]designed a convolutional autoencoder to obtain nonlinear spectral representation of HSIs, and they adopted the learned autoencoder prior and total variation prior as a compound regularization term of the unified variational reconstruction problem, and this reconstruction problem was optimized with the alternating direction method of multipliers to obtain the final reconstruction. Wang
[34] proposed a network named HyperReconNet to learn the reconstruction, in which a spatial network and a spectral network were concatenated to finish the spatialspectral information prediction. [35] mainly exploits a network consisting of multiple dense residual blocks with residual channel attention modules to learn the reconstruction. In order to capture the complex variety nature of HSIs, the external dataset and the internal information of input coded image are used in [35]. [36] proposes anet to learn the reconstruction mapping by a twostage generative adversarial network. It used a deeper selfattention Unet for the first stage reconstruction and used another Unet to improve the first stage reconstruction.
Although these deep networkbased methods can explore the power of deep feature representation for boosting the reconstruction accuracy, they all require large data sets for network pretraining. At the same time, these pretrained networks are dedicated to a single observation system. When a new coded aperture mask is used in the imaging system, it is necessary to retrain the reconstruction networks for good reconstruction, so the pretrained networks have a poor generalization ability to other HSI imaging systems. Unsupervised network learning is an effective method to cope with this issue. Motivated by the deep network prior
[37], a nonlocally regularized network is developed for compressed sensing of images without network pretraining [38]. However, the compressive imaging principle of HSIs is different from that of monochromatic images [38] in that HSIs have complex spatialspectral joint structures. In this paper, we propose an unsupervised spatialspectral network for hyperspectral compressive snapshot reconstruction, in which multiscale spatialspectral attention modules are designed to capture the complex spatialspectral correlation and HSIs are reconstructed only based on the given snapshot measurement without network pretraining, thereby improving the applicability of the HSIs reconstruction network.Iii Coded aperture snapshot spectral imaging
The CASSI system makes full use of a coded aperture (physical mask) and one or more dispersive elements to modulate the optical field of a target scene, and achieves the projection from the 3D HSI datacube into a 2D detector according to the specific sensing process, as shown in Fig. 2. According to different methods of encoding spectral signatures, CASSI can be mainly divided into two categories: SDCASSI [17] using a single disperser encoded in the spatial domain and DDCASSI [8] or SSCASSI [15] encoded in both spatial and spectral domains.
For concreteness, let indicate the discrete values of source spectral intensity with wavelength at location . A coded aperture mask creates coded patterns by its transmission function , while a dispersive prism produces a shear along one spatial axis based on a wavelengthdependent dispersive function . Here, is assumed to be linear.
For spatial encoding, the imaging systems, such as SDCASSI, first create a coding of the incident light field and then shear the coded field through a dispersive element. The final snapshot measurement at the 2D detector array can be represented as an integral over the spectral wavelength ,
(1) 
For spatialspectral encoding, the imaging systems, such as DDCASSI and SSCASSI, have two dispersive elements, and a coded aperture is placed between them. Specifically, the imaging system disperses incident light field, and creates a coded field through the coded aperture mask, and employs additional optics to unshear this coding. The final snapshot measurement can be presented as
(2) 
In summary, the CASSI imaging process can be rewritten in the following standard form of an underdetermined system,
(3) 
where
is the vectorized representation of the underlying 3D HSI
with as its spatial resolution, as its number of spectral bands and computed as , denotes the vectorized formulation of the corresponding 2D snapshot measurement. For the SDCASSI system, the sensed measurement has the dimension of , so is computed as . Since a second dispersive element in the DDCASSI system can undo the dispersion caused by the first one, the sensed measurement of DDCASSI system has the same spatial dimension as , and is accordingly computed as . is the measurement matrix. is the measurement noise.Taking the CASSI system with spatialspectral encoding as an example, the snapshot imaging process can be expressed by the measurement matrix , and the measurement rate can be computed as . In order to demonstrate the intrinsic structure of , the snapshot projection produce can be externalized as the following according to Eq. (2),
(4) 
where is the matrix representation of the sensed snapshot measurement, is the Hadamard (elementwise) product, is the th spectral band and is the shifted code mask corresponding to the th band, is the measurement noise. Specifically, for each pixel with a dimensional spectral vector, it is collapsed to form one pixel in the snapshot measurement. Thus, the measurement matrix in Eq. (3) can be specialized as
(5) 
where is the diagonal matrix by taking the vectorized as its diagonal entries. Thus, the matrix
has a very sparse structure. Although being different from the dense random matrix in conventional CS,
[39] studies theoretical analysis of snapshot CS systems in terms of the compressionbased framework and demonstrates that the reconstruction error of the CASSI system is bounded.Iv Hyperspectral Compressive Snapshot Reconstruction
This paper aims at learning the reconstruction only based on the 2D CASSI measurement with a unsupervised generative network. To reach this purpose, two issues need to be solved. One is how to construct the generative network of the unknown hyperspectral image, and the other is how to effectively estimate the network parameters based on the given snapshot measurement. In the following subsections, we will discuss the details.
Iva SpatialSpectral Reconstruction Network
Fig. 3 illustrates the architecture of the proposed HCSNet for hyperspectral compressive snapshot reconstruction. For the purpose of making the generative network CS taskaware, the generative network is required to be conditional on the snapshot measurement . Thus, we concatenate feature maps of the latent random codes and snapshot measurement as the network inputs. Then the network inputs are processed through multiple bottleneck residual blocks and one spatialspectral attention module, which are dedicated to capturing spatialspectral correlation in hyperspectral images. Finally, a convolution is employed to adjust the number of the final output channels to be the same as the number of hyperspectral bands (for example, 24 bands and 31 bands). The sigmoid activation limits the output range to .
Bottleneck Residual Block. Residual blocks have been shown to perform well in image feature representation. Taking into account the close similarity between spectral bands, we specialize the residual blocks with bottleneck connection and cascade three
bottleneck residual blocks (dubbed as BRB) for feature extraction. Taking
as the input of first residual block, the skip connection here is to fuse the correlation between bands as , and the main path can capture the remaining information . Thus, the output of the first block is computed as(6) 
where denotes the operation in the first block. The subsequent two blocks further extract features from the previous block output, so the output of the th block can be expressed as,
(7) 
This result is fed into the spatialspectral attention module for further processing.
SpatialSpectral Attention Module. Spatialspectral joint correlation is an inherent characteristic of hyperspectral datacube. At the same time, hyperspectral data also has multiscales structure, just like in gray and color images. Thus, we design the spatialspectral attention module to predict threedimensional (3D) attention maps at multiscales, so that these characteristics can be exploited to represent HSI more effectively.
As shown in Fig. 3, the spatialspectral attention module is conducted on multiscale features and the 3D attention prediction is performed at each scale. The input of spatialspectral attention module is . We omit the superscript in this subsection for simplicity. Let represent the feature maps at scale , is computed from through a downsampling operation and convolution, ,
(8) 
where is the downsampled feature maps from
by a convolution with stride 2. The spatial resolution of
is , where is the spatial resolution of the feature maps . Then, the feature maps at the th scale are used to compute the attention map to enhance the feature maps in the th scale. The computation flow is defined as,(9) 
where denotes the double upsampled feature maps from
by bilinear interpolation,
is the threedimensional attention map for the th scale feature maps, denotes the Hadamard product. Specifically, we predict through the convolution and Sigmoid activation processing of. Different from the twodimensional attention map or the tensor product of a twodimensional attention and an onedimensional attention map, we directly learn the 3D attention map, so that each entry of the feature maps can be adaptively weighted. Correspondingly, we obtain the attention enhanced feature maps
by the Hadamard product operation between and . Furthermore, we concatenate with to better fuse the spatial and spectral information among different scales. The final output at th scale are computed by the below formula,(10) 
is then fed into the succeeding operations.
IvB Unsupervised Network Learning
We define the conditional generative network in Fig. 1 as , in which is the network parameters, is the random code and is the snapshot measurement. The generative network acts as the parametric mapping from the latent random code to the reconstruction conditioned on the snapshot measurement . With the aim of unsupervised learning, we try to learn the reconstruction only from the given snapshot measurement . According to the observation model in Eq. (4), the optimal reconstruction can be derived by solving the following loss function,
(11) 
where is the operation to extract the th band of the network output. Compared with the norm, the
norm is more robust to outliers in the snapshot measurement
[40]. As in [37], we can optimize so that the generated image can match as closely as possible to the given measurement .According to the chain rule of calculus, the error can be backpropagated from the loss function
to the network parameters . The Adam [41] algorithm is then used to find the optimal for reconstruction. This optimization is tailored to solve the reconstruction task for one specific image, and the final reconstruction can be computed from the optimal parameters as .Fig. 4 plots the variation of the loss function versus the number of iterations taking the Toy image from the CAVE dataset [42] as an example. It can be seen that during the first 1000 iterations, the value of the objective function decreased rapidly and eventually stabilized. With the decreasing of the objective function value, the PSNR value of the reconstruction result continues to rise, and a good reconstruction result is achieved.
V Experimental Results
We evaluate the reconstruction performance of the proposed HCSNet and compare it with multiple stateofart methods, including predefined priorbasedbased methods, , TwIST [25], GAPTV [26], DeSCI [29], and deep networkbased methods, , AutoEncoder [16], HSCNN [33], HyperReconNet [34], net [36] and the residual network (dubbed as DEIL) in [35]
. Same as the above methods, two typical image quality metrics are used to evaluate the performance, namely peaksignaltonoiseratio (PSNR) and structural similarity index (SSIM
[43]). PSNR can reflect the spectral reflectance accuracy and SSIM emphasizes spatial reflectance accuracy of the reconstruction. The larger the PSNR and SSIM values are, the better the reconstruction accuracy is.For a comprehensive evaluation, the reconstruction experiments are conducted upon both simulated and real CASSI measurements, and both the SSCASSI and SDCASSI systems are tested. The simulated CASSI measurements include the two cases, namely synthetic coded aperture masks and the “RealMaskintheLoop” masks from the real CASSI systems. The simulation experiments are conducted on two datasets, CAVE [42] and ICVL [44]. The CAVE dataset contains 32 scenes with a spatial resolution of over 31 spectral bands. The range of wavelengths of CAVE covers from 400nm to 700nm with uniform 10nm intervals. The images in the ICVL dataset has a spatial resolution of over 31 spectral bands. As in [36], the spectral resolution of ICVL dataset are downsampled to 24 bands. Besides the CAVE and ICVL datasets, we also test the proposed HCSNet on the real data. The real data captured by the hyperspectral imaging camera [15] has 24 spectral bands with the corresponding wavelengths as: 398.62, 404.40, 410.57, 417.16, 424.19, 431.69, 439.70, 448.25, 457.38, 467.13, 477.54, 488.66, 500.54, 513.24, 526.80, 541.29, 556.78, 573.33, 591.02, 609.93, 630.13, 651.74, 674.83, 699.51 nm.
Autoencoder [16], HSCNN [33], HyperReconNet [34] and DEIL [35] are four supervised networks, and they randomly select some HSIs from three datasets including CAVE, Harvard [45] and ICVL for network pretraining. net [36] is specifically designed for real hyperspectral data, and 150 hyperspectral data after spectral interpolation processing were selected from ICVL for network training. Different from the above four deep networks, the proposed HCSNet is an unsupervised deep network and does not require network pretraining. We implement HCS
Net using the Pytorch framework and all the experiments are performed on an NVIDIA GTX 1080 Ti GPU. In our experiments, the random code
is initialized as random noise maps with uniform distribution, the learning rate is set to 0.01 and the maximum number of iterations is 2500.
Va Ablation studies
Setting  Ablation method 1  Ablation method 2  Ablation method 3  Ablation method 4  HCSNet  

only random code as input  ✓  
only measurement as input  ✓  
random codes+ measurement as inputs  ✓  ✓  ✓  

only residual block  ✓  
only attention module  ✓  
residual block+ attention module  ✓  ✓  ✓  
metrics  PSNR  34.629  36.026  35.607  36.703  39.219  
SSIM  0.948  0.955  0.950  0.964  0.979 
We first conduct some ablation experiments to evaluate the influences of different settings of HCSNet on the reconstruction results. These settings fall into two categories. One is related to the network inputs, including only the random code as the input, only the snapshot measurement as the input and both the random code and snapshot measurement as the inputs. The other is related to the network architecture, including the cases of only the 11 residual blocks, only the spatialspectral attention module, and the complete architecture (Fig. 3). We choose appropriate settings to form four ablation methods, where ablation methods 1 and 2 are used to verify the network inputs and ablation methods 3 and 4 are used to verify the network architecture. Table I reports the experimental results with different ablation experiment settings upon the CAVE dataset.
Based on the average PSNR and SSIM values of the ablation methods 1 and 2, and HCSNet, we can see that taking both the snapshot measurement and random code as the inputs can obtain significant performance improvements. We analyze that this performance improvement mainly comes from two aspects. One is that using the snapshot measurement as a conditional input can make the generator aware of the reconstruction task, and the other is that the snapshot measurement contains the spatial structure of the underlying scene, so the convolutions on the snapshot measurement can still extract useful features for reconstruction. The comparison between the ablation methods 3 and 4 with HCSNet is supposed to verify the contributions of the 11 residual blocks and the spatialspectral attention module for reconstruction. From Table I, it can be also seen that the absence of anyone module will result in performance degradation. The combination of the 11 residual blocks and the spatialspectral attention module can achieve the best reconstruction, thus the ablation experiments demonstrate that the design of HCSNet is very reasonable.
Methods  PSNR  SSIM 
TwIST  23.74  0.85 
GAPTV  27.15  0.89 
DeSCI  29.20  0.91 
AutoEncoder  32.46  0.95 
HCSNet  39.22  0.98 
VB Results on Synthetic Measurement
Based on the CAVE dataset, we perform the experiments under both cases of simulated measurements, , SSCASSI and SDCASSI. Fig. 5 displays some representative color images of the scenes in the CAVE dataset. The coded masks used here are randomly generated according to the corresponding SDCASSI or SSCASSI principles.
Table II presents the average PSNR (dB) and SSIM values of various methods upon all the 32 images in the CAVE dataset using SSCASSI measurement. HCSNet has significant superiority over the predefined priorbased methods and pretrained networks in terms of both PSNR (dB) and SSIM metrics. To visualize the experimental results, the reconstructed results of five algorithms on the Toy and Beads images are shown in Fig. 6, Fig. 8 and Fig. 7, Fig. 9 respectively. We use the wavelengthtoRGB converter to display each band of the spectral reconstruction results, and display five spectral bands. The reconstructed spectral signatures at four positions indicated in the RGB images are also presented. The correlation coefficients of the reconstructed spectral signatures and the Groundtruths are shown in the legends. By comparing the reconstructed spectral bands and spectral signatures with the groundtruths, it can be clearly seen that HCSNet is superior to other four comparison algorithms. This shows that our method can effectively preserve the spatial structures and the spectral accuracy of the hyperspectral image during the snapshot reconstruction process, thereby demonstrating the advantages of HCSNet exploiting the intrinsic properties of hyperspectral images, and verifying the effectiveness of the unsupervised learning.
Table III lists the reconstruction results using the SDCASSI measurement. As in [33, 34, 35], we adopt the same 10 HSIs from the CAVE dataset for a fair comparison. The results of the other four methods in Table III are cited from [35]. We can see that HCSNet also achieves superior or comparable results over four stateoftheart deep networks, although they use a large number of data sets for pretraining. The superior performance of HCSNet mainly benefits from the conditional generative manner and the spatialspectral attention module, which can more effectively capture the inherent spatialspectral correlations of HSIs. In addition, the characteristics of unsupervised learning can also enhance adaptability and universality of our network in practical applications.
VC Results on “RealMaskintheLoop”
We further test the performance of different methods using the “RealMaskintheLoop” coded masks, that is, the masks used here is from the real CASSI systems. Compared to the mask generated by simulation, real masks contain more noise, which makes the reconstruction more difficult. For the sake of fairness, we adopt the same 10 HSIs of the ICVL dataset used in net [36] for comparison, which are shown in Fig. 10. These 10 HSIs have 24 spectral bands with a spatial resolution of 256256 by spectral interpolation and cropping operation. As in net, the SSCASSI measurement is used in this group of experiments.
Methods  Metrics  scene 1  scene 2  scene 3  scene 4  scene 5  scene 6  scene 7  scene 8  scene 9  scene 10  Average 
TwIST  PSNR  25.62  18.41  21.75  21.24  23.78  20.58  24.23  20.20  27.01  18.92  22.14 
SSIM  0.856  0.826  0.826  0.828  0.799  0.744  0.870  0.784  0.888  0.747  0.817  
GAPTV  PSNR  30.66  22.41  23.49  22.27  26.98  23.09  24.86  22.91  29.10  21.50  24.73 
SSIM  0.892  0.869  0.863  0.829  0.792  0.802  0.877  0.841  0.912  0.796  0.847  
DeSCI  PSNR  31.15  26.44  24.74  29.25  29.37  25.81  28.40  24.42  34.41  23.31  27.73 
SSIM  0.937  0.947  0.898  0.949  0.907  0.906  0.921  0.872  0.971  0.834  0.914  
Net  PSNR  36.11  32.05  33.34  29.60  35.40  28.57  35.22  32.35  33.42  28.20  32.43 
SSIM  0.949  0.975  0.974  0.937  0.942  0.902  0.969  0.951  0.916  0.924  0.944  
HCSNet  PSNR  39.94  36.74  36.30  37.43  32.07  24.06  39.59  35.70  32.57  30.79  34.52 
SSIM  0.990  0.992  0.981  0.984  0.963  0.899  0.991  0.978  0.974  0.955  0.970 
Table IV lists the average PSNR and SSIM values of these 10 scenes by using multiple methods. Because HSCNN, AutoEncoder, HyperReconNet and DEIL are trained for reconstructing HSIs of 31 spectral channels, thus they are not applicable to this group of experiments. Fig. 11 shows the three spectral bands in the reconstructed hyperspectral images of five algorithms. According to the results in Table IV, HCSNet exceeds both the predefined priorbased reconstruction algorithms and the deep networkbased reconstruction algorithms, and it obtains the best reconstruction quality in terms of the average PSNR and SSIM values. From the reconstructed spectral images of the 4 scenes in Fig. 11, we can see that HCSNet can reconstruct more clear structures and details than the competing methods.
VD Results on Real Compressive Snapshot Imaging Data
In order to demonstrate the superiority of HCSNet more persuasively, we further perform the experiments directly on the real hyperspectral compressive snapshot imaging data, , the hyperspectral image^{1}^{1}1The bird hyperspectral image is downloaded from [29]’s Github homepage https://github.com/hust512/DeSCI.. It is captured by the real CASSI system [15], consisting of 24 spectral bands with the spatial resolution . This means a more daunting challenge, as the 2D CASSI coded images captured by the real snapshot compressive imaging system are companied by more noise and outliers.
The reconstructed spectra signatures and exemplar bands and of the image are shown in Fig. 12 and Fig. 13. We also plot two reconstructed spectral signatures corresponding to the two positions indicated in the RGB images. HCSNet has a superior performance over GAPTV and DeSCI in terms of the PSNR and SSIM values. As shown in the reconstructed spectral bands, GAPTV still contains noise, and DeSCI produces excessively smooth results, leading to the loss of some details. In contrast, HCSNet can recover detailed structures, resulting in relatively good reconstruction quality. Moreover, HCSNet can reconstruct more accurate spectral signatures than DeSCI. Accurate reconstruction of spectral signatures is important for applications such as material deification and classification. It implies that HCSNet has the potential to promote the development of hyperspectral compressive snapshot imaging technology.
VE Time Complexity
Methods  TwIST  GAPTV  DeSCI 



Time(s)  441  49  14351  414  367 
We further analyze the time complexity of the proposed method and other baselines for hyperspectral compressive snapshot reconstruction. Table V shows the running time required for each method to reconstructing a 512512 HSI from the CAVE dataset. The first three methods, Twist, GAPTV, and DeSCI, run on the CPU, while AutoEncoder and the proposed HCSNet run on the GPU. According to Table V, GAPTV is relatively faster, but the reconstruction performance of this algorithm is far from satisfactory. DeSCI has the highest computational complexity, which is mainly due to the timeconsuming operations of block matching and weighted nuclear norm minimization in each iteration of optimization. Compared with DeSCI, the running time of HCSNet is obviously much shorter. The running time of AutoEncoder is comparable to our method, but the reconstruction quality is worse than our algorithm. In addition, the AutoEncoder method requires training time to learn the spectral prior. But HCSNet does not require pretraining on a large amount of hyperspectral data, so it will not consume training time. Therefore, while maintaining a superior reconstruction performance, we also maintain an acceptable time complexity of network learning.
Vi Conclusion
In this paper, we proposed the unsupervised HCSNet for compressive snapshot reconstruction of HSIs. The proposed HCSNet serves as a parametric mapping from the combination of the latent random code and snapshot measurement to the reconstruction. The bottleneck residual block and spatialspectral attention module can boost the network to capture the inherent spatial spectral correlation, thereby generating spatialspectral structures more precisely. Furthermore, HCSNet is optimized to generate the reconstruction from the given snapshot measurement, and this is fully unsupervised, requires no training data. HCSNet is applicable to both the SDCASSI and SSCASSI systems. According to the experimental results, it is striking that HCSNet can outperform the stateoftheart methods, including the deep networks with pretraining.
References
 [1] N. A. Hagen and M. W. Kudenov, “Review of snapshot spectral imaging technologies,” Optical Engineering, vol. 52, no. 9, p. 090901, 2013.
 [2] M. Borengasser, W. S. Hungate, and R. Watkins, Hyperspectral remote sensing: principles and applications. CRC press, 2007.
 [3] X. Cao, F. Zhou, L. Xu, D. Meng, Z. Xu, and J. Paisley, “Hyperspectral image classification with markov random fields and a convolutional neural network,” IEEE Transactions on Image Processing, vol. 27, no. 5, pp. 2354–2367, 2018.
 [4] Y. Xu, Z. Wu, J. Li, A. Plaza, and Z. Wei, “Anomaly detection in hyperspectral images based on lowrank and sparse representation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 4, pp. 1990–2000, 2015.
 [5] L. Ojha, M. B. Wilhelm, S. L. Murchie, A. S. McEwen, J. J. Wray, J. Hanley, M. Massé, and M. Chojnacki, “Spectral evidence for hydrated salts in recurring slope lineae on mars,” Nature Geoscience, vol. 8, no. 11, p. 829, 2015.
 [6] M. Attas, E. Cloutis, C. Collins, D. Goltz, C. Majzels, J. R. Mansfield, and H. H. Mantsch, “Nearinfrared spectroscopic imaging in art conservation: investigation of drawing constituents,” Journal of Cultural Heritage, vol. 4, no. 2, pp. 127–136, 2003.
 [7] X. Cao, T. Yue, X. Lin, S. Lin, X. Yuan, Q. Dai, L. Carin, and D. J. Brady, “Computational snapshot multispectral cameras: Toward dynamic capture of the spectral world,” IEEE Signal Processing Magazine, vol. 33, no. 5, pp. 95–108, 2016.
 [8] X. Lin, Y. Liu, J. Wu, and Q. Dai, “Spatialspectral encoded compressive hyperspectral imaging,” ACM Transactions on Graphics, vol. 33, no. 6, p. 233, 2014.
 [9] S.H. Baek, I. Kim, D. Gutierrez, and M. H. Kim, “Compact singleshot hyperspectral imaging using a prism,” ACM Transactions on Graphics, vol. 36, no. 6, p. 217, 2017.
 [10] Y. Y. Schechner and S. K. Nayar, “Generalized mosaicing: Wide field of view multispectral imaging,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 10, pp. 1334–1348, 2002.

[11]
H. Du, X. Tong, X. Cao, and S. Lin, “A prismbased system for multispectral
video acquisition,” in
IEEE International Conference on Computer Vision (ICCV)
. IEEE, 2009, pp. 175–182.  [12] P. Mouroulis and M. M. McKerns, “Pushbroom imaging spectrometer with high spectroscopic data fidelity: experimental demonstration,” Optical Engineering, vol. 39, no. 3, pp. 808–816, 2000.
 [13] G. Nahum, “Imaging spectroscopy using tunable filters: a review,” in Proc. SPIE, Wavelet Applications VII, vol. 4056, 2000, pp. 50–64.
 [14] W. M. Porter and H. T. Enmark, “A system overview of the airborne visible/infrared imaging spectrometer (aviris),” in Imaging Spectroscopy II, vol. 834. International Society for Optics and Photonics, 1987, pp. 22–31.
 [15] M. E. Gehm, R. John, D. J. Brady, R. M. Willett, and T. J. Schulz, “Singleshot compressive spectral imaging with a dualdisperser architecture,” Optics express, vol. 15, no. 21, pp. 14 013–14 027, 2007.
 [16] I. Choi, D. S. Jeon, G. Nam, D. Gutierrez, and M. H. Kim, “Highquality hyperspectral reconstruction using a spectral prior,” ACM Transactions on Graphics, vol. 36, no. 6, p. 218, 2017.
 [17] A. Wagadarikar, R. John, R. Willett, and D. Brady, “Single disperser design for coded aperture snapshot spectral imaging,” Applied optics, vol. 47, no. 10, pp. B44–B51, 2008.
 [18] D. Kittle, K. Choi, A. Wagadarikar, and D. J. Brady, “Multiframe image estimation for coded aperture snapshot spectral imagers,” Applied optics, vol. 49, no. 36, pp. 6824–6833, 2010.
 [19] G. Martín, J. M. BioucasDias, and A. Plaza, “Hyca: A new technique for hyperspectral compressive sensing,” IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 5, pp. 2819–2831, 2014.

[20]
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”
Nature, vol. 521, no. 7553, pp. 436–444, 2015.  [21] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in IEEE conference on computer Vision and Rattern Recognition (CVPR), 2016, pp. 2414–2423.
 [22] K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Transactions on Image Processing, vol. 26, no. 9, pp. 4509–4522, 2017.
 [23] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
 [24] D. L. Donoho et al., “Compressed sensing,” IEEE Transactions on information theory, vol. 52, no. 4, pp. 1289–1306, 2006.
 [25] J. M. BioucasDias and M. A. Figueiredo, “A new twist: Twostep iterative shrinkage/thresholding algorithms for image restoration,” IEEE Transactions on Image processing, vol. 16, no. 12, pp. 2992–3004, 2007.
 [26] X. Yuan, “Generalized alternating projection based total variation minimization for compressive sensing,” in IEEE International Conference on Image Processing (ICIP). IEEE, 2016, pp. 2539–2543.
 [27] Y. Wotao, O. Stanley, G. Donald, and D. Jerome, “Bregman iterative algorithms for l1minimization with applications to compressed sensing,” SIAM Journal on Imaging Sciences, vol. 1, no. 1, pp. 143–168, 2008.
 [28] Y. Xin, T. TsungHan, Z. Ruoyu, and et. al, “Compressive hyperspectral imaging with side information,” IEEE Journal of Selected Topics in Signal Processing, vol. 9, no. 6, pp. 964–976, 2015.
 [29] Y. Liu, X. Yuan, J. Suo, D. Brady, and Q. Dai, “Rank minimization for snapshot compressive imaging,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 12, pp. 2990–3006, 2019.

[30]
K. Kulkarni, S. Lohit, P. Turaga, R. Kerviche, and A. Ashok, “Reconnet:
Noniterative reconstruction of images from compressively sensed
measurements,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2016, pp. 449–458.  [31] Y. Sun, J. Chen, Q. Liu, and G. Liu, “Learning image compressed sensing with subpixel convolutional generative adversarial network,” Pattern Recognition, vol. 98, p. 107051, 2020.
 [32] Y. Yan, S. Jian, L. Huibin, and X. Zongben, “Admmcsnet: A deep learning approach for image compressive sensing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 3, pp. 521–538, 2020.
 [33] Z. Xiong, Z. Shi, H. Li, L. Wang, D. Liu, and F. Wu, “Hscnn: Cnnbased hyperspectral image recovery from spectrally undersampled projections,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 518–525.
 [34] L. Wang, T. Zhang, Y. Fu, and H. Huang, “Hyperreconnet: Joint coded aperture optimization and image reconstruction for compressive hyperspectral imaging,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2257–2270, 2018.
 [35] T. Zhang, Y. Fu, L. Wang, and H. Huang, “Hyperspectral image reconstruction using deep external and internal learning,” in IEEE International Conference on Computer Vision (ICCV), 2019, pp. 8559–8568.
 [36] X. Miao, X. Yuan, Y. Pu, and V. Athitsos, “net: Reconstruct hyperspectral images from a snapshot measurement,” in IEEE Conference on Computer Vision (ICCV), 2019.
 [37] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 9446–9454.
 [38] Y. Sun, Y. Yang, Q. Liu, and et. al, “Learning nonlocally regularized compressed sensing network with halfquadratic splitting,” IEEE Transactions on Multimedia, online, 2020.
 [39] S. Jalali and X. Yuan, “Snapshot compressed sensing: performance bounds and algorithms,” IEEE Transactions on Information Theory, vol. 65, no. 12, pp. 8005–8024, 2019.
 [40] Y. Li and A. Gonzalo R, “A maximum likelihood approach to least absolute deviation regression,” EURASIP Journal on Advances in Signal Processing, 948982, 2004.
 [41] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [42] Y. F., M. T., I. D., and N. S.K., “Generalized assorted pixel camera: Postcapture control of resolution, dynamic range and spectrum,” Technical Report, Department of Computer Science, Columbia University, vol. CUCS06108.
 [43] W. Zhou, B. A. C., S. H. R., and S. E. P., “Image quality assessment: From error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
 [44] B. Arad and O. BenShahar, “Sparse recovery of hyperspectral signal from natural rgb images,” in European Conference on Computer Vision (ECCV). Springer, 2016, pp. 19–34.
 [45] C. Ayan and Z. Todd E., “Statistics of real world hyperspectral images,” in International Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2011, pp. 193–200.
Comments
There are no comments yet.