Unsupervised Spatial-spectral Network Learning for Hyperspectral Compressive Snapshot Reconstruction

12/18/2020 ∙ by Yubao Sun, et al. ∙ National University of Singapore 0

Hyperspectral compressive imaging takes advantage of compressive sensing theory to achieve coded aperture snapshot measurement without temporal scanning, and the entire three-dimensional spatial-spectral data is captured by a two-dimensional projection during a single integration period. Its core issue is how to reconstruct the underlying hyperspectral image using compressive sensing reconstruction algorithms. Due to the diversity in the spectral response characteristics and wavelength range of different spectral imaging devices, previous works are often inadequate to capture complex spectral variations or lack the adaptive capacity to new hyperspectral imagers. In order to address these issues, we propose an unsupervised spatial-spectral network to reconstruct hyperspectral images only from the compressive snapshot measurement. The proposed network acts as a conditional generative model conditioned on the snapshot measurement, and it exploits the spatial-spectral attention module to capture the joint spatial-spectral correlation of hyperspectral images. The network parameters are optimized to make sure that the network output can closely match the given snapshot measurement according to the imaging model, thus the proposed network can adapt to different imaging settings, which can inherently enhance the applicability of the network. Extensive experiments upon multiple datasets demonstrate that our network can achieve better reconstruction results than the state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 7

page 8

page 9

page 10

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Hyperspectral imaging aims at sampling the spectral reflectance of a scene to collect a three-dimensional (3D) dataset consisting of two spatial dimensions and one spectral dimension , called a data-cube [1]. Compared with panchromatic images, the sensed spectral signature of each pixel in hyperspectral images (HSIs) covers a broad range of wavelengths and has a high spectral resolution, so it can reveal more properties of objects at the corresponding spatial position in the scene. Hyperspectral imaging has the superiority of achieving more accurate object classification, and acts as a useful tool in many applications including environmental remote sensing [2], land cover classification [3]

, anomaly detection

[4] and material identification [1, 5, 6].

Many different techniques have been developed for acquiring 3D hyperspectral cubes [7, 8, 9, 10, 11]. Typically, these imaging systems capture one or two-dimensional subset of the data-cube, and resort to temporal scanning to sense the remaining dimensions. For instance, spatially scanned hyperspectral imaging systems [12] measure the slices of the data-cube by a two-dimensional sensor array in push-broom imaging spectrometers, or collect only one point of the data-cube in whisk-broom imaging spectrometers, and complement the data-cube by conducting spatial scanning. Spectrally scanned hyperspectral imaging systems, such as fixed or tunable filter spectrometers, sense a single spectral band of the data-cube at one time and cover all the spectral bands by scanning along the spectral dimension [13]. However, these spectrometers may suffer from motion artifacts during the period of scanning. Furthermore, since the light collection efficiency of the entrance slit in these spectrometers is insufficient, the imaging quality may be degraded.

Fig. 1: The flowchart of the proposed hyperspectral compressive snapshot reconstruction method based on unsupervised network learning. The detailed network architecture of conditional generative network is shown in Fig. 3

Different from the scanning hyperspectral imaging systems mentioned above [14, 10], snapshot imaging spectrometers collect both the spectral and spatial information in a single integration period. Therefore, motion artifacts can be avoided in snapshot imaging, and the light collection efficiency can also be significantly improved, enabling the capture of dynamic scenes. Coded aperture snapshot spectral imaging (CASSI) [15] is one of the well-known hyperspectral snapshot imaging systems. It takes advantage of the compressed sensing (CS) technology and achieves a two-dimensional (2D) snapshot measurement by random linear projection. Specifically, the random projection in CASSI starts from encoding the scene through a binary coded aperture mask. There are mainly two encoding manners [16], one is encoding the optical field in the spatial dimension with a single disperser like SD-CASSI [17], and the other is encoding the optical field in both spatial and spectral dimension, like SS-CASSI [8] or DD-CASSI [15]. The encoded light field is then integrated by a 2D detector array. An optimization algorithm needs to be used to reconstruct the spectral scenes from the 2D snapshot measurement, which is far fewer than the number of samples required by conventional scanning based spectrometers.

Due to the under-determined observations in CASSI, HSI reconstruction from the snapshot measurement is an ill-posed inverse problem. To deal with this issue, some hand-crafted structures have been designed to represent hyperspectral images, including total variation (TV) [18], sparsity [17], low-rank [19], and non-local self-similarity. The reconstruction can be obtained by solving these priors regularized optimization problems. However, these prior structures are designed empirically and are therefore insufficient to represent the complicated spectral variation of the real-world scenes. With the powerful learning capabilities of deep networks [20, 21, 22, 23], some works attempted to learn the parameterized network representation of HSIs in a data-driven manner [16]

. However, they all require a large number of hyperspectral images for supervised learning. In practical scenarios, it is expensive to collect enough training data sets for network pre-training. In addition, due to the differences in the spectral response characteristics and spectral wavelength range of different spectral imaging devices, the pre-trained network upon some specific hyperspectral datasets usually cannot be well applicable to other hyperspectral imagers.

In order to cope with these issues, we propose an unsupervised network to learn HSI reconstruction only from the compressive snapshot measurement without pre-training. As shown in Fig. 1, the proposed network acts as a conditional generative network for generating the underlying hyperspectral images from random code conditioned on the given snapshot measurement . Different from gray or color images, hyperspectral images present a joint correlation among spatial and spectral dimensions. Therefore, the conditional generative network is equipped with specific modules to capture spatial-spectral correlations, which can effectively reconstruct the spatial-spectral information of HSIs. The network parameters are optimized to generate the optimal reconstruction which can closely match the given snapshot measurement according to the imaging model. We refer to our Hyperspectral Compressive SnapShot reconstruction Network as HCS-Net for short. Our main contributions can be summarized as:

  1. We propose an unsupervised HCS

    -Net for hyperspectral compressive snapshot reconstruction, which learns the reconstruction only from the snapshot measurement without pre-training. In practical scenarios, it can greatly enhance the adaptability and generalization due to the characteristics of unsupervised learning.

  2. The spatial-spectral joint attention module is designed to seize the correlation between the spatial and spectral dimensions of HSIs. This module learns multi-scale 3D attention maps to adaptively weight each entry of the feature map, which is beneficial to improve the reconstruction quality.

  3. The proposed HCS-Net is evaluated upon multiple simulated and real data sets with both the SD-CASSI and SS-CASSI systems. The quantitative results show that HCS-Net achieves promising reconstruction results, and outperforms the state-of-the-art methods.

The remainder of this paper is organized as follows. In section II, we review some related works, especially two kinds of popular works, namely the predefined prior-based and deep network-based methods. Section III introduces the CASSI systems, and section IV describes the proposed HCS-Net, including network architecture and network learning. We report the experimental results in section V and conclude the paper in section VI.

Ii Related work

The coding-based snapshot imaging methods rely on the principle of compressed sensing [24], and the number of entries in the snapshot measurement (such as SD-CASSI and SS-CASSI measurements) is much smaller than the original HSI size. Therefore, this under-determined reconstruction problem usually leverages on a proper prior representation of HSIs to achieve reliable reconstruction. The popular HSI compressive snapshot reconstruction methods can be mainly grouped into two categories: predefined prior-based methods and deep network-based reconstruction methods.

Predefined Prior-based HSI reconstruction. This kind of method seeks the reconstruction by optimizing an objective function consisting of a data fidelity term and a regularization term. The data fidelity term penalizes the mismatch between the unknown HSI and the given measurement according to the imaging observation model, and the regularization term constraints the prior structures of HSIs. Many prior structures have been exploited to represent the HSI, such as the sparsity prior, the total variation and the low rank structure [18, 17, 19].

Many studies are developed within this paradigm. [25] and [26] both choose the prior of total variation (TV) as a regularization term for each band, and they employ the two-step iterative shrinkage thresholding (TwIST) algorithm and the generalized alternating projection (GAP) for model optimization respectively. [27] represents the unknown signal in the wavelet and DCT domain and solves the induced sparsity-regularized reconstruction problem by the bregman iterative algorithm. Instead of using a predefined transformation, [28] learns a dictionary to represent the underlying HSI data-cube. Liu . [29]

proposed a method dubbed DeSCI to capture the nonlocal self-similarity of HSIs by minimizing the weighted nuclear norm. Compared with the TV regularization, DeSCI can achieve a better reconstruction performance, but it takes a lot of time to carry out patch search and singular value decomposition. Overall, all these prior structures are hand-crafted based on empirical knowledge, so they lack an adaptive ability to spectral diversities and the non-linearity distribution of hyperspectral data. At the same time, these priors also involve the empirical setting of some parameters.

Deep Network-based HSI reconstruction

. In recent years, deep neural networks have been proven to achieve the start-of-the-art results for a variety of image-related tasks including image compressive sensing reconstruction

[30, 31, 32]. Unlike the predefined prior based algorithms, deep network-based methods attempt to directly learn the image prior by training the network on a large number of data sets, thereby capturing the inherent statistical characteristics of HSIs.

Several studies have assessed deep networks for hyperspectral compressive reconstruction from the CASSI measurement. Xiong [33]

upsampled the undersampled measurement into the same dimension as the original HSI, and enhances the reconstruction by learning the incremental residual with a convolutional neural network. Choi

[16]

designed a convolutional autoencoder to obtain nonlinear spectral representation of HSIs, and they adopted the learned autoencoder prior and total variation prior as a compound regularization term of the unified variational reconstruction problem, and this reconstruction problem was optimized with the alternating direction method of multipliers to obtain the final reconstruction. Wang

[34] proposed a network named HyperReconNet to learn the reconstruction, in which a spatial network and a spectral network were concatenated to finish the spatial-spectral information prediction. [35] mainly exploits a network consisting of multiple dense residual blocks with residual channel attention modules to learn the reconstruction. In order to capture the complex variety nature of HSIs, the external dataset and the internal information of input coded image are used in [35]. [36] proposes a

-net to learn the reconstruction mapping by a two-stage generative adversarial network. It used a deeper self-attention U-net for the first stage reconstruction and used another U-net to improve the first stage reconstruction.

Although these deep network-based methods can explore the power of deep feature representation for boosting the reconstruction accuracy, they all require large data sets for network pre-training. At the same time, these pre-trained networks are dedicated to a single observation system. When a new coded aperture mask is used in the imaging system, it is necessary to re-train the reconstruction networks for good reconstruction, so the pre-trained networks have a poor generalization ability to other HSI imaging systems. Unsupervised network learning is an effective method to cope with this issue. Motivated by the deep network prior

[37], a non-locally regularized network is developed for compressed sensing of images without network pre-training [38]. However, the compressive imaging principle of HSIs is different from that of monochromatic images [38] in that HSIs have complex spatial-spectral joint structures. In this paper, we propose an unsupervised spatial-spectral network for hyperspectral compressive snapshot reconstruction, in which multi-scale spatial-spectral attention modules are designed to capture the complex spatial-spectral correlation and HSIs are reconstructed only based on the given snapshot measurement without network pre-training, thereby improving the applicability of the HSIs reconstruction network.

Iii Coded aperture snapshot spectral imaging

The CASSI system makes full use of a coded aperture (physical mask) and one or more dispersive elements to modulate the optical field of a target scene, and achieves the projection from the 3D HSI data-cube into a 2D detector according to the specific sensing process, as shown in Fig. 2. According to different methods of encoding spectral signatures, CASSI can be mainly divided into two categories: SD-CASSI [17] using a single disperser encoded in the spatial domain and DD-CASSI [8] or SS-CASSI [15] encoded in both spatial and spectral domains.

Fig. 2: Illustration of two optically coded principles in CASSI. The upper part is the SD-CASSI sensing process, and the lower part is the DD-CASSI sensing process. The main difference between the two processes is that SD-CASSI uses a single disperser and encodes only in the spatial domain; whereas DD-CASSI or SS-CASSI uses two dispersers, which encodes in both the spatial and spectral domains.

For concreteness, let indicate the discrete values of source spectral intensity with wavelength at location . A coded aperture mask creates coded patterns by its transmission function , while a dispersive prism produces a shear along one spatial axis based on a wavelength-dependent dispersive function . Here, is assumed to be linear.

For spatial encoding, the imaging systems, such as SD-CASSI, first create a coding of the incident light field and then shear the coded field through a dispersive element. The final snapshot measurement at the 2D detector array can be represented as an integral over the spectral wavelength ,

(1)

For spatial-spectral encoding, the imaging systems, such as DD-CASSI and SS-CASSI, have two dispersive elements, and a coded aperture is placed between them. Specifically, the imaging system disperses incident light field, and creates a coded field through the coded aperture mask, and employs additional optics to unshear this coding. The final snapshot measurement can be presented as

(2)

In summary, the CASSI imaging process can be rewritten in the following standard form of an underdetermined system,

(3)

where

is the vectorized representation of the underlying 3D HSI

with as its spatial resolution, as its number of spectral bands and computed as , denotes the vectorized formulation of the corresponding 2D snapshot measurement. For the SD-CASSI system, the sensed measurement has the dimension of , so is computed as . Since a second dispersive element in the DD-CASSI system can undo the dispersion caused by the first one, the sensed measurement of DD-CASSI system has the same spatial dimension as , and is accordingly computed as . is the measurement matrix. is the measurement noise.

Taking the CASSI system with spatial-spectral encoding as an example, the snapshot imaging process can be expressed by the measurement matrix , and the measurement rate can be computed as . In order to demonstrate the intrinsic structure of , the snapshot projection produce can be externalized as the following according to Eq. (2),

(4)

where is the matrix representation of the sensed snapshot measurement, is the Hadamard (element-wise) product, is the -th spectral band and is the shifted code mask corresponding to the -th band, is the measurement noise. Specifically, for each pixel with a -dimensional spectral vector, it is collapsed to form one pixel in the snapshot measurement. Thus, the measurement matrix in Eq. (3) can be specialized as

(5)

where is the diagonal matrix by taking the vectorized as its diagonal entries. Thus, the matrix

has a very sparse structure. Although being different from the dense random matrix in conventional CS,

[39] studies theoretical analysis of snapshot CS systems in terms of the compression-based framework and demonstrates that the reconstruction error of the CASSI system is bounded.

Iv Hyperspectral Compressive Snapshot Reconstruction

This paper aims at learning the reconstruction only based on the 2D CASSI measurement with a unsupervised generative network. To reach this purpose, two issues need to be solved. One is how to construct the generative network of the unknown hyperspectral image, and the other is how to effectively estimate the network parameters based on the given snapshot measurement. In the following subsections, we will discuss the details.

Iv-a Spatial-Spectral Reconstruction Network

Fig. 3 illustrates the architecture of the proposed HCS-Net for hyperspectral compressive snapshot reconstruction. For the purpose of making the generative network CS task-aware, the generative network is required to be conditional on the snapshot measurement . Thus, we concatenate feature maps of the latent random codes and snapshot measurement as the network inputs. Then the network inputs are processed through multiple bottleneck residual blocks and one spatial-spectral attention module, which are dedicated to capturing spatial-spectral correlation in hyperspectral images. Finally, a convolution is employed to adjust the number of the final output channels to be the same as the number of hyperspectral bands (for example, 24 bands and 31 bands). The sigmoid activation limits the output range to .

Fig. 3: The overall framework of the proposed HCS-Net. The snapshot measurement and the random code are taken as the inputs of HCS

-Net, and multiple bottleneck residual blocks and one spatial-spectral attention module are exploited to reconstruct the original hyperspectral data-cube under the guidance of the loss function.

Bottleneck Residual Block. Residual blocks have been shown to perform well in image feature representation. Taking into account the close similarity between spectral bands, we specialize the residual blocks with bottleneck connection and cascade three

bottleneck residual blocks (dubbed as BRB) for feature extraction. Taking

as the input of first residual block, the skip connection here is to fuse the correlation between bands as , and the main path can capture the remaining information . Thus, the output of the first block is computed as

(6)

where denotes the operation in the first block. The subsequent two blocks further extract features from the previous block output, so the output of the -th block can be expressed as,

(7)

This result is fed into the spatial-spectral attention module for further processing.

Spatial-Spectral Attention Module. Spatial-spectral joint correlation is an inherent characteristic of hyperspectral data-cube. At the same time, hyperspectral data also has multi-scales structure, just like in gray and color images. Thus, we design the spatial-spectral attention module to predict three-dimensional (3D) attention maps at multi-scales, so that these characteristics can be exploited to represent HSI more effectively.

As shown in Fig. 3, the spatial-spectral attention module is conducted on multi-scale features and the 3D attention prediction is performed at each scale. The input of spatial-spectral attention module is . We omit the superscript in this subsection for simplicity. Let represent the feature maps at scale , is computed from through a downsampling operation and convolution, ,

(8)

where is the downsampled feature maps from

by a convolution with stride 2. The spatial resolution of

is , where is the spatial resolution of the feature maps . Then, the feature maps at the -th scale are used to compute the attention map to enhance the feature maps in the -th scale. The computation flow is defined as,

(9)

where denotes the double upsampled feature maps from

by bilinear interpolation,

is the three-dimensional attention map for the -th scale feature maps, denotes the Hadamard product. Specifically, we predict through the convolution and Sigmoid activation processing of

. Different from the two-dimensional attention map or the tensor product of a two-dimensional attention and an one-dimensional attention map, we directly learn the 3D attention map, so that each entry of the feature maps can be adaptively weighted. Correspondingly, we obtain the attention enhanced feature maps

by the Hadamard product operation between and . Furthermore, we concatenate with to better fuse the spatial and spectral information among different scales. The final output at -th scale are computed by the below formula,

(10)

is then fed into the succeeding operations.

Iv-B Unsupervised Network Learning

We define the conditional generative network in Fig. 1 as , in which is the network parameters, is the random code and is the snapshot measurement. The generative network acts as the parametric mapping from the latent random code to the reconstruction conditioned on the snapshot measurement . With the aim of unsupervised learning, we try to learn the reconstruction only from the given snapshot measurement . According to the observation model in Eq. (4), the optimal reconstruction can be derived by solving the following loss function,

(11)

where is the operation to extract the -th band of the network output. Compared with the norm, the

norm is more robust to outliers in the snapshot measurement

[40]. As in [37], we can optimize so that the generated image can match as closely as possible to the given measurement .

According to the chain rule of calculus, the error can be backpropagated from the loss function

to the network parameters . The Adam [41] algorithm is then used to find the optimal for reconstruction. This optimization is tailored to solve the reconstruction task for one specific image, and the final reconstruction can be computed from the optimal parameters as .

Fig. 4: A sample result of the Toy image from CAVE dataset. The left is the curve of loss function value with the number of iterations, and the right is the curve of the PSNR value with the number of iterations.

Fig. 4 plots the variation of the loss function versus the number of iterations taking the Toy image from the CAVE dataset [42] as an example. It can be seen that during the first 1000 iterations, the value of the objective function decreased rapidly and eventually stabilized. With the decreasing of the objective function value, the PSNR value of the reconstruction result continues to rise, and a good reconstruction result is achieved.

V Experimental Results

We evaluate the reconstruction performance of the proposed HCS-Net and compare it with multiple state-of-art methods, including predefined prior-based-based methods, , TwIST [25], GAP-TV [26], DeSCI [29], and deep network-based methods, , AutoEncoder [16], HSCNN [33], HyperReconNet [34], -net [36] and the residual network (dubbed as DEIL) in [35]

. Same as the above methods, two typical image quality metrics are used to evaluate the performance, namely peak-signal-to-noise-ratio (PSNR) and structural similarity index (SSIM

[43]). PSNR can reflect the spectral reflectance accuracy and SSIM emphasizes spatial reflectance accuracy of the reconstruction. The larger the PSNR and SSIM values are, the better the reconstruction accuracy is.

For a comprehensive evaluation, the reconstruction experiments are conducted upon both simulated and real CASSI measurements, and both the SS-CASSI and SD-CASSI systems are tested. The simulated CASSI measurements include the two cases, namely synthetic coded aperture masks and the “Real-Mask-in-the-Loop” masks from the real CASSI systems. The simulation experiments are conducted on two datasets, CAVE [42] and ICVL [44]. The CAVE dataset contains 32 scenes with a spatial resolution of over 31 spectral bands. The range of wavelengths of CAVE covers from 400nm to 700nm with uniform 10nm intervals. The images in the ICVL dataset has a spatial resolution of over 31 spectral bands. As in [36], the spectral resolution of ICVL dataset are downsampled to 24 bands. Besides the CAVE and ICVL datasets, we also test the proposed HCS-Net on the real data. The real data captured by the hyperspectral imaging camera [15] has 24 spectral bands with the corresponding wavelengths as: 398.62, 404.40, 410.57, 417.16, 424.19, 431.69, 439.70, 448.25, 457.38, 467.13, 477.54, 488.66, 500.54, 513.24, 526.80, 541.29, 556.78, 573.33, 591.02, 609.93, 630.13, 651.74, 674.83, 699.51 nm.

Autoencoder [16], HSCNN [33], HyperReconNet [34] and DEIL [35] are four supervised networks, and they randomly select some HSIs from three datasets including CAVE, Harvard [45] and ICVL for network pre-training. -net [36] is specifically designed for real hyperspectral data, and 150 hyperspectral data after spectral interpolation processing were selected from ICVL for network training. Different from the above four deep networks, the proposed HCS-Net is an unsupervised deep network and does not require network pre-training. We implement HCS

-Net using the Pytorch framework and all the experiments are performed on an NVIDIA GTX 1080 Ti GPU. In our experiments, the random code

is initialized as random noise maps with uniform distribution, the learning rate is set to 0.01 and the maximum number of iterations is 2500.

V-a Ablation studies

Setting Ablation method 1 Ablation method 2 Ablation method 3 Ablation method 4 HCS-Net
Network Input
evaluation
only random code as input
only measurement as input
random codes+ measurement as inputs
Network Architecture
evaluation
only residual block
only attention module
residual block+ attention module
metrics PSNR 34.629 36.026 35.607 36.703 39.219
SSIM 0.948 0.955 0.950 0.964 0.979
TABLE I: Performance evaluation of our method under different settings.

We first conduct some ablation experiments to evaluate the influences of different settings of HCS-Net on the reconstruction results. These settings fall into two categories. One is related to the network inputs, including only the random code as the input, only the snapshot measurement as the input and both the random code and snapshot measurement as the inputs. The other is related to the network architecture, including the cases of only the 11 residual blocks, only the spatial-spectral attention module, and the complete architecture (Fig. 3). We choose appropriate settings to form four ablation methods, where ablation methods 1 and 2 are used to verify the network inputs and ablation methods 3 and 4 are used to verify the network architecture. Table I reports the experimental results with different ablation experiment settings upon the CAVE dataset.

Based on the average PSNR and SSIM values of the ablation methods 1 and 2, and HCS-Net, we can see that taking both the snapshot measurement and random code as the inputs can obtain significant performance improvements. We analyze that this performance improvement mainly comes from two aspects. One is that using the snapshot measurement as a conditional input can make the generator aware of the reconstruction task, and the other is that the snapshot measurement contains the spatial structure of the underlying scene, so the convolutions on the snapshot measurement can still extract useful features for reconstruction. The comparison between the ablation methods 3 and 4 with HCS-Net is supposed to verify the contributions of the 11 residual blocks and the spatial-spectral attention module for reconstruction. From Table I, it can be also seen that the absence of anyone module will result in performance degradation. The combination of the 11 residual blocks and the spatial-spectral attention module can achieve the best reconstruction, thus the ablation experiments demonstrate that the design of HCS-Net is very reasonable.

Methods PSNR SSIM
TwIST 23.74 0.85
GAP-TV 27.15 0.89
DeSCI 29.20 0.91
AutoEncoder 32.46 0.95
HCS-Net 39.22 0.98
TABLE II: Average PSNR (dB) and SSIM performance comparisons of various methods on the whole CAVE dataset using the SS-CASSI measurement. The best performance is labeled in bold.
Methods PSNR SSIM
HSCNN [33] 25.02 0.91
AutoEncoder [16] 25.74 0.91
HyperReconNet [34] 24.44 0.90
DEIL [35] 29.05 0.95
HCS-Net 29.33 0.93
TABLE III: Average PSNR (dB) and SSIM performance comparisons of various methods on the CAVE dataset using the SD-CASSI measurement.

V-B Results on Synthetic Measurement

Fig. 5: Some representative scenes in CAVE dataset.

Based on the CAVE dataset, we perform the experiments under both cases of simulated measurements, , SS-CASSI and SD-CASSI. Fig. 5 displays some representative color images of the scenes in the CAVE dataset. The coded masks used here are randomly generated according to the corresponding SD-CASSI or SS-CASSI principles.

Fig. 6: The RGB image of the Toy scene from the CAVE dataset and the corresponding snapshot measurement by the SS-CASSI encoding. The Toy scene is with spatial and spectral resolution of . The reconstructed spectral signatures corresponding to the four patches indicated in the RGB image and the corresponding ground-truths are shown in right two columns. Correlation coefficients are also calculated to quantitatively evaluate the accuracy of the reconstructed spectral signatures of the five methods.
Fig. 7: The RGB image of the Beads scene from the CAVE dataset and the corresponding snapshot measurement by the SS-CASSI encoding. The reconstructed spectral signatures of fives methods and the corresponding ground-truths are shown in the right two columns. Correlation coefficients are also calculated for quantitatively evaluating the accuracy of the reconstructed spectral signatures.
Fig. 8: Reconstructed spectral image of the Toy data from the CAVE dataset. The visualization of reconstruction on five spectral bands (wavelength: 400nm, 470nm, 540nm, 610nm and 700nm) of the Toy hyperspectral image by five methods are presented, and the corresponding PSNR and SSIM values of each method are TwIST (25.87/0.84), GAP-TV (27.91/0.92), DeSCI (28.35/0.92), AutoEncoder (33.52/0.98) and HCS-Net (39.79/0.99).
Fig. 9: Reconstructed spectral image of the Beads data from the CAVE dataset. The visualization of five reconstructed spectral bands (wavelength: 400nm, 470nm, 540nm, 610nm and 700nm) by five methods are presented, and the corresponding PSNR and SSIM values of each method are TwIST (15.80/0.47), GAP-TV (19.99/0.64), DeSCI (23.51/0.83), AutoEncoder (27.29/0.92) and HCS-Net (33.51/0.96).

Table II presents the average PSNR (dB) and SSIM values of various methods upon all the 32 images in the CAVE dataset using SS-CASSI measurement. HCS-Net has significant superiority over the predefined prior-based methods and pre-trained networks in terms of both PSNR (dB) and SSIM metrics. To visualize the experimental results, the reconstructed results of five algorithms on the Toy and Beads images are shown in Fig. 6, Fig. 8 and Fig. 7, Fig. 9 respectively. We use the wavelength-to-RGB converter to display each band of the spectral reconstruction results, and display five spectral bands. The reconstructed spectral signatures at four positions indicated in the RGB images are also presented. The correlation coefficients of the reconstructed spectral signatures and the Ground-truths are shown in the legends. By comparing the reconstructed spectral bands and spectral signatures with the ground-truths, it can be clearly seen that HCS-Net is superior to other four comparison algorithms. This shows that our method can effectively preserve the spatial structures and the spectral accuracy of the hyperspectral image during the snapshot reconstruction process, thereby demonstrating the advantages of HCS-Net exploiting the intrinsic properties of hyperspectral images, and verifying the effectiveness of the unsupervised learning.

Table III lists the reconstruction results using the SD-CASSI measurement. As in [33, 34, 35], we adopt the same 10 HSIs from the CAVE dataset for a fair comparison. The results of the other four methods in Table III are cited from [35]. We can see that HCS-Net also achieves superior or comparable results over four state-of-the-art deep networks, although they use a large number of data sets for pre-training. The superior performance of HCS-Net mainly benefits from the conditional generative manner and the spatial-spectral attention module, which can more effectively capture the inherent spatial-spectral correlations of HSIs. In addition, the characteristics of unsupervised learning can also enhance adaptability and universality of our network in practical applications.

V-C Results on “Real-Mask-in-the-Loop”

Fig. 10: 10 testing scenes in the ICVL dataset.
Fig. 11: Reconstructed spectral images of four scenes from the ICVL hyperspectral dataset. The three spectral bands (with wavelengths 448.25nm, 541.29nm and 699.51nm) of each reconstructed hyperspectral image are selected for visualization. The reconstructions of five methods (TwIST/GAP-TV/DeSCI/-net/HCS-Net) are shown from top to bottom, and the corresponding ground-truths are on the last row. (It is recommended to zoom in for better visual effects.)

We further test the performance of different methods using the “Real-Mask-in-the-Loop” coded masks, that is, the masks used here is from the real CASSI systems. Compared to the mask generated by simulation, real masks contain more noise, which makes the reconstruction more difficult. For the sake of fairness, we adopt the same 10 HSIs of the ICVL dataset used in -net [36] for comparison, which are shown in Fig. 10. These 10 HSIs have 24 spectral bands with a spatial resolution of 256256 by spectral interpolation and cropping operation. As in -net, the SS-CASSI measurement is used in this group of experiments.

Methods Metrics scene 1 scene 2 scene 3 scene 4 scene 5 scene 6 scene 7 scene 8 scene 9 scene 10 Average
TwIST PSNR 25.62 18.41 21.75 21.24 23.78 20.58 24.23 20.20 27.01 18.92 22.14
SSIM 0.856 0.826 0.826 0.828 0.799 0.744 0.870 0.784 0.888 0.747 0.817
GAP-TV PSNR 30.66 22.41 23.49 22.27 26.98 23.09 24.86 22.91 29.10 21.50 24.73
SSIM 0.892 0.869 0.863 0.829 0.792 0.802 0.877 0.841 0.912 0.796 0.847
DeSCI PSNR 31.15 26.44 24.74 29.25 29.37 25.81 28.40 24.42 34.41 23.31 27.73
SSIM 0.937 0.947 0.898 0.949 0.907 0.906 0.921 0.872 0.971 0.834 0.914
-Net PSNR 36.11 32.05 33.34 29.60 35.40 28.57 35.22 32.35 33.42 28.20 32.43
SSIM 0.949 0.975 0.974 0.937 0.942 0.902 0.969 0.951 0.916 0.924 0.944
HCS-Net PSNR 39.94 36.74 36.30 37.43 32.07 24.06 39.59 35.70 32.57 30.79 34.52
SSIM 0.990 0.992 0.981 0.984 0.963 0.899 0.991 0.978 0.974 0.955 0.970
TABLE IV: Average PSNR (dB) and SSIM values of five methods upon 10 scenes from the ICVL dataset by using the “Real-Mask-in-the-Loop” coded masks.

Table IV lists the average PSNR and SSIM values of these 10 scenes by using multiple methods. Because HSCNN, AutoEncoder, HyperReconNet and DEIL are trained for reconstructing HSIs of 31 spectral channels, thus they are not applicable to this group of experiments. Fig. 11 shows the three spectral bands in the reconstructed hyperspectral images of five algorithms. According to the results in Table IV, HCS-Net exceeds both the predefined prior-based reconstruction algorithms and the deep network-based reconstruction algorithms, and it obtains the best reconstruction quality in terms of the average PSNR and SSIM values. From the reconstructed spectral images of the 4 scenes in Fig. 11, we can see that HCS-Net can reconstruct more clear structures and details than the competing methods.

V-D Results on Real Compressive Snapshot Imaging Data

In order to demonstrate the superiority of HCS-Net more persuasively, we further perform the experiments directly on the real hyperspectral compressive snapshot imaging data, , the hyperspectral image111The bird hyperspectral image is downloaded from [29]’s Github homepage https://github.com/hust512/DeSCI.. It is captured by the real CASSI system [15], consisting of 24 spectral bands with the spatial resolution . This means a more daunting challenge, as the 2D CASSI coded images captured by the real snapshot compressive imaging system are companied by more noise and outliers.

Fig. 12: Real data results: the reconstructed spectral signatures of the real hyperspectral data captured by the real CASSI system. The correlation coefficient of the reconstructed spectral and the ground-truth is shown in the legends.
Fig. 13: Real data results: Four spectral bands (wavelength: 488.67nm, 541.29nm, 573.33nm and 630.13nm) of the real hyperspectral data reconstructed by three methods, and the corresponding PSNR and SSIM values of each method are GAP-TV (21.99/0.67), DeSCI (24.50/0.67) and HCS-Net (24.69/0.70).

The reconstructed spectra signatures and exemplar bands and of the image are shown in Fig. 12 and Fig. 13. We also plot two reconstructed spectral signatures corresponding to the two positions indicated in the RGB images. HCS-Net has a superior performance over GAP-TV and DeSCI in terms of the PSNR and SSIM values. As shown in the reconstructed spectral bands, GAP-TV still contains noise, and DeSCI produces excessively smooth results, leading to the loss of some details. In contrast, HCS-Net can recover detailed structures, resulting in relatively good reconstruction quality. Moreover, HCS-Net can reconstruct more accurate spectral signatures than DeSCI. Accurate reconstruction of spectral signatures is important for applications such as material deification and classification. It implies that HCS-Net has the potential to promote the development of hyperspectral compressive snapshot imaging technology.

V-E Time Complexity

Methods TwIST GAP-TV DeSCI
Auto
Encoder
HCS
-Net
Time(s) 441 49 14351 414 367
TABLE V: Comparison of the running-time of various methods on a 512512 HSI from the CAVE dataset.

We further analyze the time complexity of the proposed method and other baselines for hyperspectral compressive snapshot reconstruction. Table V shows the running time required for each method to reconstructing a 512512 HSI from the CAVE dataset. The first three methods, Twist, GAP-TV, and DeSCI, run on the CPU, while AutoEncoder and the proposed HCS-Net run on the GPU. According to Table V, GAP-TV is relatively faster, but the reconstruction performance of this algorithm is far from satisfactory. DeSCI has the highest computational complexity, which is mainly due to the time-consuming operations of block matching and weighted nuclear norm minimization in each iteration of optimization. Compared with DeSCI, the running time of HCS-Net is obviously much shorter. The running time of AutoEncoder is comparable to our method, but the reconstruction quality is worse than our algorithm. In addition, the AutoEncoder method requires training time to learn the spectral prior. But HCS-Net does not require pre-training on a large amount of hyperspectral data, so it will not consume training time. Therefore, while maintaining a superior reconstruction performance, we also maintain an acceptable time complexity of network learning.

Vi Conclusion

In this paper, we proposed the unsupervised HCS-Net for compressive snapshot reconstruction of HSIs. The proposed HCS-Net serves as a parametric mapping from the combination of the latent random code and snapshot measurement to the reconstruction. The bottleneck residual block and spatial-spectral attention module can boost the network to capture the inherent spatial spectral correlation, thereby generating spatial-spectral structures more precisely. Furthermore, HCS-Net is optimized to generate the reconstruction from the given snapshot measurement, and this is fully unsupervised, requires no training data. HCS-Net is applicable to both the SD-CASSI and SS-CASSI systems. According to the experimental results, it is striking that HCS-Net can outperform the state-of-the-art methods, including the deep networks with pre-training.

References

  • [1] N. A. Hagen and M. W. Kudenov, “Review of snapshot spectral imaging technologies,” Optical Engineering, vol. 52, no. 9, p. 090901, 2013.
  • [2] M. Borengasser, W. S. Hungate, and R. Watkins, Hyperspectral remote sensing: principles and applications.   CRC press, 2007.
  • [3] X. Cao, F. Zhou, L. Xu, D. Meng, Z. Xu, and J. Paisley, “Hyperspectral image classification with markov random fields and a convolutional neural network,” IEEE Transactions on Image Processing, vol. 27, no. 5, pp. 2354–2367, 2018.
  • [4] Y. Xu, Z. Wu, J. Li, A. Plaza, and Z. Wei, “Anomaly detection in hyperspectral images based on low-rank and sparse representation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 4, pp. 1990–2000, 2015.
  • [5] L. Ojha, M. B. Wilhelm, S. L. Murchie, A. S. McEwen, J. J. Wray, J. Hanley, M. Massé, and M. Chojnacki, “Spectral evidence for hydrated salts in recurring slope lineae on mars,” Nature Geoscience, vol. 8, no. 11, p. 829, 2015.
  • [6] M. Attas, E. Cloutis, C. Collins, D. Goltz, C. Majzels, J. R. Mansfield, and H. H. Mantsch, “Near-infrared spectroscopic imaging in art conservation: investigation of drawing constituents,” Journal of Cultural Heritage, vol. 4, no. 2, pp. 127–136, 2003.
  • [7] X. Cao, T. Yue, X. Lin, S. Lin, X. Yuan, Q. Dai, L. Carin, and D. J. Brady, “Computational snapshot multispectral cameras: Toward dynamic capture of the spectral world,” IEEE Signal Processing Magazine, vol. 33, no. 5, pp. 95–108, 2016.
  • [8] X. Lin, Y. Liu, J. Wu, and Q. Dai, “Spatial-spectral encoded compressive hyperspectral imaging,” ACM Transactions on Graphics, vol. 33, no. 6, p. 233, 2014.
  • [9] S.-H. Baek, I. Kim, D. Gutierrez, and M. H. Kim, “Compact single-shot hyperspectral imaging using a prism,” ACM Transactions on Graphics, vol. 36, no. 6, p. 217, 2017.
  • [10] Y. Y. Schechner and S. K. Nayar, “Generalized mosaicing: Wide field of view multispectral imaging,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 10, pp. 1334–1348, 2002.
  • [11] H. Du, X. Tong, X. Cao, and S. Lin, “A prism-based system for multispectral video acquisition,” in

    IEEE International Conference on Computer Vision (ICCV)

    .   IEEE, 2009, pp. 175–182.
  • [12] P. Mouroulis and M. M. McKerns, “Pushbroom imaging spectrometer with high spectroscopic data fidelity: experimental demonstration,” Optical Engineering, vol. 39, no. 3, pp. 808–816, 2000.
  • [13] G. Nahum, “Imaging spectroscopy using tunable filters: a review,” in Proc. SPIE, Wavelet Applications VII, vol. 4056, 2000, pp. 50–64.
  • [14] W. M. Porter and H. T. Enmark, “A system overview of the airborne visible/infrared imaging spectrometer (aviris),” in Imaging Spectroscopy II, vol. 834.   International Society for Optics and Photonics, 1987, pp. 22–31.
  • [15] M. E. Gehm, R. John, D. J. Brady, R. M. Willett, and T. J. Schulz, “Single-shot compressive spectral imaging with a dual-disperser architecture,” Optics express, vol. 15, no. 21, pp. 14 013–14 027, 2007.
  • [16] I. Choi, D. S. Jeon, G. Nam, D. Gutierrez, and M. H. Kim, “High-quality hyperspectral reconstruction using a spectral prior,” ACM Transactions on Graphics, vol. 36, no. 6, p. 218, 2017.
  • [17] A. Wagadarikar, R. John, R. Willett, and D. Brady, “Single disperser design for coded aperture snapshot spectral imaging,” Applied optics, vol. 47, no. 10, pp. B44–B51, 2008.
  • [18] D. Kittle, K. Choi, A. Wagadarikar, and D. J. Brady, “Multiframe image estimation for coded aperture snapshot spectral imagers,” Applied optics, vol. 49, no. 36, pp. 6824–6833, 2010.
  • [19] G. Martín, J. M. Bioucas-Dias, and A. Plaza, “Hyca: A new technique for hyperspectral compressive sensing,” IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 5, pp. 2819–2831, 2014.
  • [20]

    Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”

    Nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • [21] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in IEEE conference on computer Vision and Rattern Recognition (CVPR), 2016, pp. 2414–2423.
  • [22] K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Transactions on Image Processing, vol. 26, no. 9, pp. 4509–4522, 2017.
  • [23] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
  • [24] D. L. Donoho et al., “Compressed sensing,” IEEE Transactions on information theory, vol. 52, no. 4, pp. 1289–1306, 2006.
  • [25] J. M. Bioucas-Dias and M. A. Figueiredo, “A new twist: Two-step iterative shrinkage/thresholding algorithms for image restoration,” IEEE Transactions on Image processing, vol. 16, no. 12, pp. 2992–3004, 2007.
  • [26] X. Yuan, “Generalized alternating projection based total variation minimization for compressive sensing,” in IEEE International Conference on Image Processing (ICIP).   IEEE, 2016, pp. 2539–2543.
  • [27] Y. Wotao, O. Stanley, G. Donald, and D. Jerome, “Bregman iterative algorithms for l1-minimization with applications to compressed sensing,” SIAM Journal on Imaging Sciences, vol. 1, no. 1, pp. 143–168, 2008.
  • [28] Y. Xin, T. Tsung-Han, Z. Ruoyu, and et. al, “Compressive hyperspectral imaging with side information,” IEEE Journal of Selected Topics in Signal Processing, vol. 9, no. 6, pp. 964–976, 2015.
  • [29] Y. Liu, X. Yuan, J. Suo, D. Brady, and Q. Dai, “Rank minimization for snapshot compressive imaging,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 12, pp. 2990–3006, 2019.
  • [30] K. Kulkarni, S. Lohit, P. Turaga, R. Kerviche, and A. Ashok, “Reconnet: Non-iterative reconstruction of images from compressively sensed measurements,” in

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2016, pp. 449–458.
  • [31] Y. Sun, J. Chen, Q. Liu, and G. Liu, “Learning image compressed sensing with sub-pixel convolutional generative adversarial network,” Pattern Recognition, vol. 98, p. 107051, 2020.
  • [32] Y. Yan, S. Jian, L. Huibin, and X. Zongben, “Admm-csnet: A deep learning approach for image compressive sensing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 3, pp. 521–538, 2020.
  • [33] Z. Xiong, Z. Shi, H. Li, L. Wang, D. Liu, and F. Wu, “Hscnn: Cnn-based hyperspectral image recovery from spectrally undersampled projections,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 518–525.
  • [34] L. Wang, T. Zhang, Y. Fu, and H. Huang, “Hyperreconnet: Joint coded aperture optimization and image reconstruction for compressive hyperspectral imaging,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2257–2270, 2018.
  • [35] T. Zhang, Y. Fu, L. Wang, and H. Huang, “Hyperspectral image reconstruction using deep external and internal learning,” in IEEE International Conference on Computer Vision (ICCV), 2019, pp. 8559–8568.
  • [36] X. Miao, X. Yuan, Y. Pu, and V. Athitsos, “-net: Reconstruct hyperspectral images from a snapshot measurement,” in IEEE Conference on Computer Vision (ICCV), 2019.
  • [37] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 9446–9454.
  • [38] Y. Sun, Y. Yang, Q. Liu, and et. al, “Learning non-locally regularized compressed sensing network with half-quadratic splitting,” IEEE Transactions on Multimedia, online, 2020.
  • [39] S. Jalali and X. Yuan, “Snapshot compressed sensing: performance bounds and algorithms,” IEEE Transactions on Information Theory, vol. 65, no. 12, pp. 8005–8024, 2019.
  • [40] Y. Li and A. Gonzalo R, “A maximum likelihood approach to least absolute deviation regression,” EURASIP Journal on Advances in Signal Processing, 948982, 2004.
  • [41] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [42] Y. F., M. T., I. D., and N. S.K., “Generalized assorted pixel camera: Post-capture control of resolution, dynamic range and spectrum,” Technical Report, Department of Computer Science, Columbia University, vol. CUCS-061-08.
  • [43] W. Zhou, B. A. C., S. H. R., and S. E. P., “Image quality assessment: From error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [44] B. Arad and O. Ben-Shahar, “Sparse recovery of hyperspectral signal from natural rgb images,” in European Conference on Computer Vision (ECCV).   Springer, 2016, pp. 19–34.
  • [45] C. Ayan and Z. Todd E., “Statistics of real world hyperspectral images,” in International Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE, 2011, pp. 193–200.