Hyperspectral (HS) imaging technology refers to the spectral signature is densely sampled to many narrow bands. It combines imaging technology with spectral technology to detect the two-dimensional geometric space and one-dimensional spectral information of the target to obtain continuous, narrow-band images with high spectral resolution. Normally, most of the civil cameras capture only three primary colors. However, HS spectrometers can obtain the spectrum of each pixel in the scene and collect the information into a set of images. To visualize HS images, a response function is adopted to transform HS images into RGB format. Conversely, we can acquire HS images from the visible format by learning the inverse function. In this paper, we propose a general hierarchical regression network (HRNet) for spectral reconstruction from RGB images.
HS imaging technology has many advantages and particular characteristics. There have been many applications based on HS imaging technology, e.g, remote sensing technology , pedestrian detection [17, 23], food processing , medical imaging . However, in recent years, the development of HS imaging has encountered a bottleneck since it mainly depends on spectrometers. The traditional spectrometers saves images with huge volume and need long operation time, which restricts HS imaging technology applied to portable platforms and high-speed moving scenes . Although researchers have continuously optimized the traditional pipeline [7, 35], these hardware devices are still expensive and of high complexity. Thus, we present a low cost and automate approach only based on RGB cameras. To address the problem, we propose a HRNet that learns the process of RGB images to corresponding HS projections.
In general, spectral reconstruction is an ill-posed problem. Moreover, there is unknown noise in environment leading to degraded RGB images. However, there is dense correspondence between RGB images and HS images, making it possible to exploit the correlation from many RGB-HS pairs. Since the information of RGB image is much less than HS image, there may be many reasonable HS image combinations corresponding to a same RGB image. The algorithm needs to learn a reasonable mapping function that produces high-quality HS images. With the development of deep convolutional neural network (CNN), it is eligible to learn the blind mapping for spectral construction.
The previous methods [32, 21, 33, 6, 36] mainly utilize an auto-encoder structure with residual blocks . The network often performs convolution at low spatial resolution since the features are more compact and the computation is more efficient. However, as the network goes deeper, it fails to remain the original pixel information due to performing down-sampling by convolutions. To address this problem, we introduce a lossless and learnable sampling operator PixelShuffle . To further boost the quality of generated images, we propose a hierarchical architecture that extracts the features of different scales. At each level, the input is obtained by the reverse PixelShuffle (PixelUnShuffle) that no pixel is lost. Moreover, we propose to use residual dense block and residual global block in HRNet for removing artifacts and noise and modelling remote pixel correlation, respectively.
In general, there are three main contributions of this paper:
(1) We propose a HRNet that utilizes PixelUnShuffle and Pixelshuffle layers for downsampling and upsampling without information loss. We also propose residual dense block with residual global block to enlarge perceptive field and boost generation quality;
(2) We propose a 8-setting ensemble strategy to further enhance the generalization of HRNet;
(3) We evaluate proposed HRNet on NTIRE 2020 HS dataset. The HRNet is winning method of track 2 - real world images and ranks 3rd on track 1 - clean images.
2 Related work
Hyperspectral image acquisition. Conventional methods for hyperspectral image acquisition often adopt spectrograph with spatial scanning or spectral scanning technology. There are several types of scanner utilized for capturing images including pushbroom scanner, whiskbroom scanner, and band sequential scanner. They have been widely used to many applications such as detector, environmental monitoring and remote sensor for decades. For instance, pushbroom scanner and whiskbroom scanner are used for photogrammetric and remote sensing by satellite sensors [28, 5]. However, those devices need to capture the spectral information of single points or bands separately, then scan the whole scene to get a fully HS image, which is difficult to capture scenes with moving objects. In addition, they are too large physically and not suitable for portable platforms. In order to address the problems, many kinds of non-scanning spectrometers have been developed to adapt the application of dynamic scenes [10, 7, 35].
Hyperspectral image reconstruction from RGB images.
Since the traditional methods for hyperspectral image acquisition are not portable or time-consuming for many applications, current methods attempt to reconstruct hyperspectral image from RGB image. By learning the mapping from RGB images to hyperspectral images on a big RGB-HS dataset, it is more convenient to obtain many HS images. Recent years have witnessed various studies including sparse coding and deep learning. In 2008, Parmar et al. proposed a data sparsity expanding method to recover the spatial spectral data cube. Arad et al.  first leveraged HS prior in order to create a sparse dictionary of HS signatures and their corresponding RGB projections. While Aeschbacher et al.  pushed the performance of Arad et al.’s method for better accuracy and runtime based on A+ framework .
Beyond the dataset provided by Arad et al. , many approaches proposed their own dataset. For instance, Yasuma et al.  utilized a CCD camera (Apogee Alta U260) to captured 31-band multispectral images (400–700 nm, at 10 nm intervals) of several static scenes. Nguyen et al.  captured a dataset by Specim’s PFD-CL-65-V10E (400 nm to 1000 nm) spectral camera and there were total 64 images. Chakrabarti et al.  explored a statistical model based on 55 HS images of indoor and outdoor scenes. With the improvement of the scale and resolution of natural HS dataset, the training of deep learning method becomes more feasible, a number of algorithms based on convolutional neural network were proposed [21, 33]. Simon et al.  proposed a fully convolutional densely connected “Tiramisu” network with one hundred layers for semantic segmentation. Galliani et al. 
enhanced it for spectral image super-resolution. Can et al. improved it to avoid overfitting to the training data and obtain faster inference speed. Moreover, Xiong et al.  proposed a unified HSCNN framework for hyperspectral recovery from both RGB and compressive measurements. To boost the performance, they developed a deep residual network named HSCNN-R, and another distinct architecture that replaces the residual block by the dense block with a novel fusion scheme, named HSCNN-D, collectively called HSCNN+ .
Convolutional neural networks.
The convolutional neural networks have been successfully applied in many low-level vision tasks, e.g. colorization[39, 19], inpainting [18, 38], deblurring , denoising [13, 9], and demosaicking [9, 40]. Hyperspectral reconstruction, as one of low-level task, has gained great improvement of performance recently by deep convolutional neural networks. In order to facilitate convergence and extract features effectively, many well-known basic blocks are utilized in those frameworks such as residual block and dense block. He et al.  proposed a residual network initially for image classification. It improves the accuracy obviously compared with traditional cascade convolutional structure. Then, the residual block has been widely used in image enhancement region for maintaining low-level features by the short connection. It was enhanced by densenet proposed by Huang et al.  to improve the feature fusion ability. Moreover, Hu et al.  strengthened them by a squeeze-and-excitation network including a feature attention mechanism. It was implemented by MLP layers for modelling connections of pixels in different spatial location. In general, our HRNet combines the advantages of above methods and provides a more effective and accurate solution for HS reconstruction.
We train our approach on the HS dataset provided by NTIRE Challenge 2020. This dataset consists of three parts: spectral images, clean RGB images (for track 1) and real world RGB images (for track 2). There are overall 450 RGB-HS pairs in training for both tracks involving different scenes. Each spectral image has the information of 31 bands in range of 400 nm to 700 nm. It is of spatial resolution. To generate its corresponding RGB image, there is a fixed response function applied to HS bands. The rendering process can be defined as:
The RGB images and HS images include 3 and 31 channels, respectively. The maps each HS band to visible channel R, G, and B by 93 parameters. For clean RGB images, they are constructed by a known response function and saved as uncompressed format. However, the real world RGB images are acquired by unknown response function with additional blind noise and demosaicking operation. Some examples are illustrated in Figure 1 (e.g. 1st band approximately covers the 395-405 nm range).
3.2 HRNet architecture
are utilized to downsample the input to each level without adding parameters. Therefore, the number of pixels of input is fixed while the spatial resolution decreases. Conversely, the learnable PixelShuffle layers are adopted to upsample feature maps and reduce channels for inter-level connection. The PixelShuffle only reshapes feature maps and does not introduces interpolation like bilinear upsampling. It allows the network to learn upsampling operation adaptively.
For each level, the process is decomposed to inter-level integration, artifacts reduction, and global feature extraction. For inter-level learning, the output features of subordinate level are pixel shuffled, then concatenated to current level, finally processed by an additional convolutional layer to unify channel number. In order to effectively reduce artifacts, we adopt residual dense block[14, 16], containing 5 dense-connected convolutional layers and a residual. Moreover, the residual global block [14, 15] with short-cut connection of input is used to extract attention for every remote pixels by MLP layers.
Since the features are most compact in bottom level, there is a convolutional layer attached to the last of bottom level in order to enhance tone mapping by weighting all channels. The two mid levels process features at different scales. Moreover, the top level uses the most blocks to effectively integrate features and reduce artifacts thus produce high-quality spectral images. The illustration of these blocks are in Figure 3.
3.3 Implementation details
We only use L1 loss in the training process, which is a PSNR-oriented optimization for the system. The L1 loss is defined as:
where and are input and output, respectively. The is the proposed HRNet. Note that, we utilize the local patches for efficient training. The input RGB image and output spectral images are cropped in same spatial region.
For network architecture, all the layers are LeakyReLU 
activated except output layer. We do not use any normalization in HRNet to maintain the data distribution. The reflect padding is adopted for each convolutional layer in order to reduce border effect. The weights of VCGAN are initialized by Xavier algorithm.
For training details, we use the entire NTIRE 2020 HS dataset (450 HS-RGB pairs for both tracks) at training. The whole HRNet is trained for 10000 epochs overall. The initial learning rate isand halved every 3000 epochs. For optimization, we use Adam optimizer with , and batch size equals to 8. The image pairs are randomly cropped to region and normalized to range [0, 1]. All the experiments are implemented using 2 NVIDIA Titan Xp GPUs. It takes approximately 7 days for whole training process.
3.4 Ensemble strategy
Since the solution space of spectral reconstruction is often large, there may be multiple settings that achieve same performance on the training set. Therefore, a single network may lead to poor generalization performance since it tends to fall into local minima. However, we can minimize this risk by combining multiple network settings to enhance generalization and fuse the knowledge. In order to perform ensemble strategy, we use 4 other hyper-parameter settings and train HRNet from scratch for both tracks. These settings can be summarized as:
Re-train the HRNet using baseline training setting.
Exchange the position of residual dense block and residual global block in HRNet, and use baseline training setting.
Train the network with different batch size (2 or 4) and keep other hyper-parameter settings, network architecture.
Train the network with different cropping patch size ( or ) and keep other hyper-parameter settings, network architecture.
Therefore, there are 8 kinds of training methods. All the methods used for ensemble are trained for 10000 epochs. We record the MRAE (mean absolute value between all bands of generated spectral images and ground truth ) every 1000 epochs, as shown in Table 1 and Figure 4. Finally, we utilize the epoch with best MRAE value of 8 methods for computing average.
|Setting||track 1||track 2|
|Re-train baseline (1st)||0.043408||0.071044|
|Re-train baseline (2nd)||0.043487||0.070668|
|Exchange position of blocks||0.042418||0.071798|
|Change batch size 8 to 4||0.041936||0.071259|
|Change batch size 8 to 2||0.041507||0.072797|
|Change patch size 256 to 320||0.042810||0.070502|
|Change patch size 256 to 384||0.042166||0.072313|
4.1 Experimental settings
We evaluate proposed HRNet by comparing with other network architectures and conducting ablation study on NTIRE 2020 HS dataset. For each track, there are 10 validation RGB images. The evaluation metrics are defined as:
MRAE. It computes the pixel-wise disparity (mean absolute value) between all bands of generated spectral images and ground truth . It explicitly represents the construction quality of network. It is defined as:
where denotes the overall pixels of spectral images.
RMSE. It computes the root mean square error between the generated and ground truth spectral images with 31 bands. It is defined as:
Back Projection MRAE (BPMRAE). It evaluates the colorimetric accuracy of recovered RGB images from the generated and ground truth spectral images by a fixed camera response function. It is defined as:
where denotes the function .
4.2 Comparison with other architectures
We utilize two common network architectures for comparison: U-Net  and U-ResNet [30, 14]. Both of them have been widely used in many previous low-level tasks [19, 18, 38, 22, 9, 40]. The first convolutional layer and last convolutional layer utilize
convolution without changing spatial resolution. The training scheme for all methods are same. Other details are concluded as: (1) U-Net. The encoder layers perform convolution with stride of 2. The spatial resolution of bottom feature map equals to. There are short concatenations between each encoder layer and decoder layer with same resolution; (2) U-ResNet. The total number of encoder layers and decoder layers are half of U-Net. Instead, there are 4 residual blocks attached to the last layer of encoder. The concatenations are reserved.
We train both networks using same hyper-parameters of HRNet until convergence. There is no ensemble strategy used. We generate the reconstructed spectral images using the best epoch of them. The results are summarized in Table 2. We also visualize each method in Figure 5 and 6 by pseudo-color map. The first three rows show the data distribution of 3 methods and last row indicates ground truth. We recommend readers to compare textures of background.
There are two reasons that proposed HRNet outperforms other two methods. The first is that HRNet utilizes PixelShuffle to connect each level. Traditional nearest or bilinear upsampling will introduce redundancy information to features, which is unnecessary for feature extraction. However, by the combination of PixelUnShuffle and PixelShuffle, HRNet could process high-level features more efficiently. The second is that HRNet adopts two residual-based blocks, which facilitate convergence and assist each level to exploit different scales of features. Moreover, the blocks with residual learning helps remove artifacts. The residual global block enhances context information since it models correlation for every two pixels.
|Method||w/o ResDB||w/o ResGB||w/o both||HRNet|
|Method||HRNet ()||HRNet ()||HRNet ()|
|Team||MRAE||Runtime / Image (seconds)||Compute Platform|
|HRNet||0.03231183605||3.748||2NVIDIA Titan Xp|
|sunnyvick||0.03516495956||0.7||Tesla K80 12GB|
|Team||MRAE||Runtime / Image (seconds)||Compute Platform|
|HRNet||0.06200744887||3.748||2NVIDIA Titan Xp|
|PARASITE||0.06514769779||30||NVIDIA Titan Xp|
4.3 Ablation study
In order to demonstrate the effectiveness of both residual dense block (ResDB) and residual global block (ResGB), we replace them by plain convolution layers with similar FLOPs. The results in track 1 - clean images is shown in Table 3. The baseline of HRNet is shown in Table 1, which has better performance comparing with all ablation settings. If we delete all ResDB or ResGB in HRNet, the MRAE decreases the most, which demonstrates the combination of both blocks is significant for spectral reconstruction.
We conduct another experiment that shrinks the HRNet model size by decreasing channels of each convolutional layer to half, one fourth, and one eighth of original numbers. It will compress model size greatly by sacrificing pixel fidelity. To better compare these settings, we conclude the multiply–accumulate operation (MACs), total network parameters (Params), model size saved on machine (Weights) and 3 quantitative metrics results in Table 4. The MACs, Params, and Weights of baseline HRNet are 182.347 Gb, 31.705 Mb, and 123.879 Mb, respectively. Users can choose high-quality HRNet to obtain high pixel fidelity of spectral images (MRAE 0.042328) or high-efficiency HRNet with small size (Weights 2.410 Mb).
4.4 Testing result on NTIRE 2020 challenge
The proposed HRNet ranks 3rd and 1st on track 1 and track 2, respectively, of NTIRE 2020 Spectral Reconstruction from RGB Images Challenge . The comparison results on testing set are summarized in Table 5 and 6. Moreover, the HRNet has better performance on track 2 since it adopts two effective blocks for removing artifacts while utilizes learnable PixelShuffle upsampling operator. The ensemble strategy works obviously on both tracks that improves the MRAE from 0.042328 to 0.039893 since it avoids the HRNet to fall into local minima. In conclusion, both HRNet architecture and ensemble strategy contribute to spectral reconstruction performance.
In this paper, we presented a 4-level HRNet for automatically generating spectrum from RGB images. For each level, it adopts both residual dense block and residual global block for effectively extracting features. While the PixelShuffle is utilized for inter-level connection. Then, we proposed a novel 8-setting ensemble strategy to further enhance the quality of predicted spectral images. Finally, we validated the HRNet outperforms the well-known low-level vision frameworks such as U-Net and U-ResNet on NTIRE 2020 HS dataset. Furthermore, we presented 3 types of compressed HRNets and analyzed their reconstruction performance and computing efficiency. The proposed HRNet is the winning method of track 2 - real world images and ranks 3rd on track 1 - clean images.
In defense of shallow learned spectral reconstruction from rgb images.
Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 471–479. Cited by: §2.
-  (1987) Multispectral system for medical fluorescence imaging. IEEE Journal of Quantum Electronics 23 (10), pp. 1798–1805. Cited by: §1.
-  (2016) Sparse recovery of hyperspectral signal from natural rgb images. In Proceedings of the European Conference on Computer Vision, pp. 19–34. Cited by: §2, §2.
NTIRE 2020 challenge on spectral reconstruction from an rgb image.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §4.4.
-  (2000) Geometric correction of airborne whiskbroom scanner imagery using hybrid auxiliary data. International Archives of Photogrammetry and Remote Sensing 33 (B3/1; PART 3), pp. 93–100. Cited by: §2.
-  (2018) An efficient cnn for spectral reconstruction from rgb images. arXiv preprint arXiv:1804.04647. Cited by: §1, §2.
-  (2011) A prism-mask system for multispectral video acquisition. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (12), pp. 2423–2435. Cited by: §1, §2.
-  (2011) Statistics of real-world hyperspectral images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 193–200. Cited by: §2.
-  (2018) Learning to see in the dark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3291–3300. Cited by: §2, §4.2.
-  (1995) Computed-tomography imaging spectrometer: experimental calibration and reconstruction results. Applied Optics 34 (22), pp. 4817–4826. Cited by: §2.
-  (2017) Learned spectral super-resolution. arXiv preprint arXiv:1703.09470. Cited by: §2.
Understanding the difficulty of training deep feedforward neural networks.
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §3.3.
-  (2019) Self-guided network for fast image denoising. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2511–2520. Cited by: §2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §1, §2, §3.2, §4.2.
-  (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2, §3.2.
-  (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §2, §3.2.
-  (2015) Multispectral pedestrian detection: benchmark dataset and baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
-  (2017) Globally and locally consistent image completion. ACM Transactions on Graphics (ToG) 36 (4), pp. 1–14. Cited by: §2, §4.2.
-  (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134. Cited by: §2, §4.2.
-  (2017) The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 11–19. Cited by: §2.
-  (2018) 2d-3d cnn based architectures for spectral reconstruction from rgb images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 844–851. Cited by: §1, §2.
-  (2018) Deblurgan: blind motion deblurring using conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8183–8192. Cited by: §2, §4.2.
-  (2016) Multispectral deep neural networks for pedestrian detection. arXiv preprint arXiv:1611.02644. Cited by: §1.
Rectifier nonlinearities improve neural network acoustic models.
Proceedings of the International Conference on Machine Learning, Vol. 30, pp. 3. Cited by: §3.3.
Classification of hyperspectral remote sensing images with support vector machines. IEEE Transactions on Geoscience and Remote Sensing 42 (8), pp. 1778–1790. Cited by: §1.
-  (2014) Training-based spectral reconstruction from a single rgb image. In Proceedings of the European Conference on Computer Vision, pp. 186–201. Cited by: §2.
-  (2008) Spatio-spectral reconstruction of the multispectral datacube using sparse recovery. In IEEE International Conference on Image Processing, pp. 473–476. Cited by: §2.
-  (2012) Review of developments in geometric modelling for high resolution satellite pushbroom sensors. The Photogrammetric Record 27 (137), pp. 58–73. Cited by: §1, §2.
-  (2013) Hyperspectral and multispectral imaging for evaluating food safety and quality. Journal of Food Engineering 118 (2), pp. 157–171. Cited by: §1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 234–241. Cited by: §4.2.
-  (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874–1883. Cited by: §1, §3.2.
-  (2018) Hscnn+: advanced cnn-based hyperspectral recovery from rgb images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 939–947. Cited by: §1, §2.
-  (2018) Reconstructing spectral images from rgb-images using a convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 948–953. Cited by: §1, §2.
-  (2014) A+: adjusted anchored neighborhood regression for fast super-resolution. In Asian Conference on Computer Vision, pp. 111–126. Cited by: §2.
-  (2016) Adaptive nonlocal sparse representation for dual-camera compressive hyperspectral imaging. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (10), pp. 2104–2111. Cited by: §1, §2.
-  (2017) Hscnn: cnn-based hyperspectral image recovery from spectrally undersampled projections. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 518–525. Cited by: §1, §2.
-  (2010) Generalized assorted pixel camera: postcapture control of resolution, dynamic range, and spectrum. IEEE Transactions on Image Processing 19 (9), pp. 2241–2253. Cited by: §2.
Free-form image inpainting with gated convolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4471–4480. Cited by: §2, §4.2.
-  (2016) Colorful image colorization. In Proceedings of the European Conference on Computer Vision, pp. 649–666. Cited by: §2.
-  (2019) Saliency map-aided generative adversarial network for raw to rgb mapping. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 3449–3457. Cited by: §2, §4.2.