Accurate Spectral Super-resolution from Single RGB Image Using Multi-scale CNN

06/10/2018 ∙ by Yiqi Yan, et al. ∙ 0

Different from traditional hyperspectral super-resolution approaches that focus on improving the spatial resolution, spectral super-resolution aims at producing a high-resolution hyperspectral image from the RGB observation with super-resolution in spectral domain. However, it is challenging to accurately reconstruct a high-dimensional continuous spectrum from three discrete intensity values at each pixel, since too much information is lost during the procedure where the latent hyperspectral image is downsampled (e.g., with x10 scaling factor) in spectral domain to produce an RGB observation. To address this problem, we present a multi-scale deep convolutional neural network (CNN) to explicitly map the input RGB image into a hyperspectral image. Through symmetrically downsampling and upsampling the intermediate feature maps in a cascading paradigm, the local and non-local image information can be jointly encoded for spectral representation, ultimately improving the spectral reconstruction accuracy. Extensive experiments on a large hyperspectral dataset demonstrate the effectiveness of the proposed method.



There are no comments yet.


page 6

page 11

Code Repositories


Spectral Super-resolution from Single RGB Image Using Multi-scale CNN

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hyperspectral imaging encodes the reflectance of the scene from hundreds or thousands of bands with a narrow wavelength interval (e.g., 10nm) into a hyperspectral image. Different from conventional images, each pixel in the hyperspectral image contains a continuous spectrum, thus allowing the acquisition of abundant spectral information. Such information has proven to be quite useful for distinguishing different materials. Therefore, hyperspectral images have been widely exploited to facilitate various applications in computer vision community, such as visual tracking 

[20], image segmentation [18]

, face recognition 


, scene classification 


, and anomaly detection 


The acquisition of spectral information, however, comes at the cost of decreasing the spatial resolution of hyperspectral images. This is because a fewer number of photons are captured by each detector due to the narrower width of the spectral bands. In order to maintain a reasonable signal-to-noise ratio (SNR), the instantaneous field of view (IFOV) needs to be increased, which renders it difficult to produce hyperspectral images with high spatial resolution. To address this problem, many efforts have been made for the hyperspectral imagery super-resolution.

Most of the existing methods mainly focus on enhancing the spatial resolution of the observed hyperspectral image. According to the input images, they can be divided into two categories: fusion based methods where a high-resolution conventional image (, RGB image) and a low-resolution hyperspectral image are fused together to produce a high-resolution hyperspectral image [22, 11] single image super-resolution which directly increases the spatial resolution of a hyperspectral image [12, 24, 27, 25]. Although these methods have shown effective performance, the acquisition of the input hyperspectral image often requires specialized hyperspectral sensors as well as extensive imaging cost. To mitigate this problem, some recent literature [4, 2, 13, 7] turn to investigate a novel hyperspectral imagery super-resolution scheme, termed spectral super-resolution, which aims at improving the spectral resolution of a given RGB image. Since the input image can be easily captured by conventional RGB sensors, imaging cost can be greatly reduced.

However, it is challenging to accurately reconstruct a hyperspectral image from a single RGB observation, since mapping three discrete intensity values to a continuous spectrum is a highly ill-posed linear inverse problem. To address this problem, we propose to learn a complicated non-linear mapping function for spectral super-resolution with deep convolutional neural networks (CNN). It has been shown that the 3-dimensional color vector for a specific pixel can be viewed as the downsampled observation of the corresponding spectrum. Moreover, for a candidate pixel, there often exist abundant locally and no-locally similar pixels (

exhibiting similar spectra) in the spatial domain. As a result, the color vectors corresponding to those similar pixels can be viewed as a group of downsampled observations of the latent spectra for the candidate pixel. Therefore, accurate spectral reconstruction requires to explicitly consider both the local and non-local information from the input RGB image. To this end, we develop a novel multi-scale CNN. Our method jointly encodes the local and non-local image information through symmetrically downsampling and upsampling the intermediate feature maps in a cascading paradigm, thus enhancing the spectral reconstruction accuracy. We experimentally show that the proposed method can be easily trained in an end-to-end scheme and beat several state-of-the-art methods on a large hyperspectral image dataset with respect to various evaluation metrics.

Our contributions are twofold:

  • We design a novel CNN architecture that is able to encode both local and non-local information for spectral reconstruction.

  • We perform extensive experiments on a large hyperspectral dataset and obtain the state-of-the-art performance.

2 Related Work

This section gives a brief review of the existing spectral super-resolution methods, which can be divided into the following two categories.

Statistic based methods This line of research mainly focus on exploiting the inherent statistical distribution of the latent hyperspectral image as priors to guide the super-resolution [26, 21]. Most of these methods involve building overcomplete dictionaries and learning sparse coding coefficients to linearly combine the dictionary atoms. For example, in [4], Arad leveraged image priors to build a dictionary using K-SVD [3]. At test time, orthogonal matching pursuit [15] was used to compute a sparse representation of the input RGB image. [2] proposed a new method inspired by A+ [19]

, where sparse coefficients are computed by explicitly solving a sparse least square problem. These methods directly exploit the whole image to build image prior, ignoring local and non-local structure information. What’s more, since the image prior is often handcrafted or heuristically designed with shallow structure, these methods fail to generalize well in practice.

Learning based methods These methods directly learn a certain mapping function from the RGB image to a corresponding hyperspectral image. For example, [13]

proposed a training based method using a radial basis function network. The input data was pre-processed with a white balancing function to alleviate the influence of different illumination. The total reconstruction accuracy is affected by the performance of this pre-processing stage. Recently, witnessing the great success of deep learning in many other ill-posed inverse problems such as image denoising 

[23] and single image super-resolution [6], it is natural to consider using deep networks (especially convolutional neural networks) for spectral super-resolution. In [7], Galliani exploited a variant of fully convolutional DenseNets (FC-DenseNets [9]) for spectral super-resolution. However, this method is sensitive to the hyper-parameters and its performance can still be further improved.

3 Proposed Method

In this section, we will introduce the proposed multi-scale convolution neural network in details. Firstly, we introduce some building blocks which will be utilized in our network. Then, we will illustrate the architecture of the proposed network.

3.1 Building Blocks

Double Conv
Batch normalization

Leaky ReLU

2D Dropout
Batch normalization
Leaky ReLU
2D Dropout
Pixel shuffle
Table 1: Basic building blocks of our network

There are three basic building blocks in our network. Their structures are shown in Table 1.

Double convolution (Double Conv) block consists of two convolutions. Each of them is followed by batch normalization, leaky ReLU and dropout. We exploit batch normalization and dropout to deal with overfitting.

Downsample block contains a regular max-pooling layer. It reduces the spatial size of the feature map and enlarges the receptive field of the network.

Upsample block is utilized to upsample the feature map in the spatial domain. To this end, much previous literature often adopts the transposed convolution. However, it is prone to generate checkboard artifacts. To address this problem, we use the pixel shuffle operation [17]. It has been shown that pixel shuffle alleviates the checkboard artifacts. In addition, due to not introducing any learnable parameters, pixel shuffle also helps improve the robustness against over-fitting.

Figure 1: Diagram of the proposed method. “Conv ” represents convolutional layers with an output of feature maps. We use convolution in green blocks and convolution in the red block. Gray arrows represent feature concatenation.

3.2 Network Architecture

Our method is inspired by the well known U-Net architecture for image segmentation [16]. The overall architecture of the proposed multi-scale convolution neural network is depicted in Figure 1. The network follows the encoder-decoder pattern. For the encoder part, each downsampling step consists of a “Double Conv” with a downsample block. The spatial size is progressively reduced, and the number of features is doubled at each step. The decoder is symmetric to the encoder path. Every step in the decoder path consists of an upsampling operation followed by a “Double Conv” block. The spatial size of the features is recovered, while the number of features is halved every step. Finally, a convolution maps the output features to the reconstructed 31-channel hyperspectral image. In addition to the feedforward path, skip connections are used to concatenate the corresponding feature maps of the encoder and decoder.

Our method naturally fits the task of spectral reconstruction. The encoder can be interpreted as extracting features from RGB images. Through downsampling in a cascade way, the receptive field of the network is constantly increased, which allows the network to “see” more pixels in an increasingly larger field of view. By doing so, both the local and non-local information can be encoded to better represent the latent spectra. The symmetric decoder procedure is employed to reconstruct the latent hyperspectral images based on these deep and compact features. The skip connections with concatenations are essential for introducing multi-scale information and yielding better estimation of the spectra.

4 Experiments

4.1 Datasets

In this study, all experiments are performed on the NTIRE2018 dataset [1]. This dataset is extended from the ICVL dataset [4]. The ICVL dataset includes images captured using Specim PS Kappa DX4 hyperspectral camera. Each image is of size in spatial resolution and contains spectral bands in the range of . In experiments, successive bands ranging from with interval are extracted from each image for evaluation. In the NTIRE2018 challenge, this dataset is further extended by supplementing extra images of the same spatial and spectral resolution. As a result, high-resolution hyperspectral images are collected as the training data. In addition, another hyperspectral images are further introduced as the test set. In the NTIRE2018 dataset, the corresponding RGB rendition is also provided for each image. In the following, we will employ the RGB-hyperspectral image pairs to evaluate the proposed method.

BGU_00257 BGU_00259 BGU_00261 BGU_00263 BGU_00265 Average
Interpolation 1.8622 1.7198 2.8419 1.3657 1.9376 1.9454
Arad 1.7930 1.4700 1.6592 1.8987 1.2559 1.6154
A+ 1.3054 1.3572 1.3659 1.4884 0.9769 1.2988
Galliani 0.7330 0.7922 0.8606 0.5786 0.8276 0.7584
Our 0.6172 0.6865 0.9425 0.5049 0.8375 0.7177
BGU_00257 BGU_00259 BGU_00261 BGU_00263 BGU_00265 Average
Interpolation 3.0774 2.9878 4.1453 2.0874 3.9522 3.2500
Arad 3.4618 2.3534 2.6236 2.5750 2.0169 2.6061
A+ 2.1911 1.9572 1.9364 2.0488 1.3344 1.8936
Galliani 1.2381 1.2077 1.2577 0.8381 1.6810 1.2445
Ours 0.9768 1.3417 1.6035 0.7396 1.7879 1.2899
BGU_00257 BGU_00259 BGU_00261 BGU_00263 BGU_00265 Average
Interpolation 0.0658 0.0518 0.0732 0.0530 0.0612 0.0610
Arad 0.0807 0.0627 0.0624 0.0662 0.0560 0.0656
A+ 0.0580 0.0589 0.0612 0.0614 0.0457 0.0570
Galliani 0.0261 0.0268 0.0254 0.0237 0.0289 0.0262
Ours 0.0235 0.0216 0.0230 0.0205 0.0278 0.0233
BGU_00257 BGU_00259 BGU_00261 BGU_00263 BGU_00265 Average
Interpolation 0.1058 0.0933 0.1103 0.0759 0.1338 0.1038
Arad 0.1172 0.0809 0.0819 0.0685 0.0733 0.0844
A+ 0.0580 0.0589 0.0612 0.0614 0.0457 0.0610
Galliani 0.0453 0.0372 0.0331 0.0317 0.0562 0.0407
Ours 0.0357 0.0413 0.0422 0.0280 0.0598 0.0414
BGU_00257 BGU_00259 BGU_00261 BGU_00263 BGU_00265 Average
Interpolation 3.9620 3.0304 4.2962 3.1900 3.9281 3.6813
Arad 4.2667 3.7279 3.4726 3.3912 3.3699 3.6457
A+ 3.2952 3.5812 3.2952 3.0256 3.2952 3.2985
Galliani 1.4725 1.5013 1.4802 1.4844 1.8229 1.5523
Ours 1.3305 1.2458 1.7197 1.1360 1.9046 1.4673
Table 2: Quantitative results on each test image.
Figure 2: Sample results of spectral reconstruction by our method. Top line: RGB rendition. Bottom line: groundtruth (solid) amd reconstructed (dashed) spectral response of four pixels identified by the dots in RGB images.

4.2 Comparison Methods & Implementation Details

To demonstrate the effectiveness of the proposed method, we compare it with four spectral super-resolution methods, including spline interpolation, the sparse recovery method in [4] (Arad ), A+ [2], and the deep learning method in [7] (Galliani ). [4, 2] are implemented by the codes released by the authors. Since there is no code released for [7], we reimplement it in this study. In the following, we will give the implementation details of each method.

Spline interpolation The interpolation algorithm serves as the most primitive baseline in this study. Specifically, for each RGB pixel , we use spline interpolation to upsample it and obtain a -dimensional spectrum (). According to the visible spectrum111, the , , values of an RGB pixel are assigned to , , and , respectively.

Arad and A+ The low spectral resolution image is assumed to be a directly downsampled version of the corresponding hyperspectral image using some specific linear projection matrix. In [4, 2]

this matrix is required to be perfectly known. In our experiments, we fit the projection matrix using training data with conventional linear regression.

(a) Training curve
(b) test curve
(c) test curve
(d) test curve
(e) test curve
(f) test curve
Figure 3: Training and test curves.

Galliani and our method We experimentally find the optimal set of hyper-parameters for both methods. dropout is applied to Galliani , while our method utilizes

dropout rate. All the leaky ReLU activation functions are applied with a negative slope of 0.2. We train the networks for 100 epochs using Adam optimizer with

regularization. Weight initialization and learning rate vary for different methods. For Galliani , the weights are initialized via HeUniform [8], and the learning rate is set to for the first 50 epochs, decayed to for the next 50 epochs. As for our method, we use HeNormal initialization [8]. The initial learning rate is and is multiplied by 0.93 every 10 epochs. We perform data augmentation by extracting patches of size

with a stride of 40 pixels from training data. The total amount of training samples is over

. At the test phase, we directly feed the whole image to the network and get the estimated hyperspectral image in one single forward pass.

4.3 Evaluation Metrics

To quantitatively evaluate the performance of the proposed method, we adopt the following two categories of evaluation metrics.

Pixel-level reconstruction error We follow [2] to use absolute and relative root-mean-square error (RMSE and rRMSE) as quantitative measurements for reconstruction accuracy. Let and denote the th element of the real and estimated hyperspectral images, is the average of , and is the total number of elements in one hyperspectral image. There are two formulas for RMSE and rRMSE respectively.

Spectral similarity Since the key for spectral super-resolution is to reconstruct the spectra, we also use spectral angle mapper () to evaluate the performance of different methods. calculates the average spectral angle between the spectra of real and estimated hyperspectral images. Let represents the spectra of the th hyperspectral pixel in real and estimated hyperspectral images ( is the number of bands), and is the total number of pixels within an image. The value can be computed as follows.

4.4 Experimental Results

Convergence Analysis We plot the curve of loss on the training set and the curves of five evaluation metrics computed on the test set in Figure 3. It can be seen that both the training loss and the value of metrics gradually decrease and ultimately converge with the proceeding of the training. This demonstrates that the proposed multi-scale convolution neural network converges well.

Quantitative Results Table 2 provides the quantitative results of our method and all baseline methods. It can be seen that our model outperforms all competitors with regards to and , and produces comparable results to Galliani on and . More importantly, our method surpasses all the others with respect to spectral angle mapper. This clearly proves that our model reconstructs spectra more accurately than other competitors. It is worth pointing out that reconstruction error (absolute and relative ) is not necessarily positively correlated with spectral angle mapper (). For example, when the pixels of an image are shuffled, and will remain the same, while will change completely. According to the results in Table 2, we can find that our finely designed network enhances spectral super-resolution from both aspects, , yielding better results on both average root-mean-square error and spectral angle similarity.

Figure 4: Visualization of absolute reconstruction error. From left to right: RGB rendition, A+, Galliani , and our method

Visual Results To further clarify the superiority in reconstruction accuracy. We show the absolute reconstruction error of test images in Figure 4. The error is summarized over all bands of the hyperspectral image. Since A+ outperforms Arad in terms of any evaluation metric, we use A+ to represent the sparse coding methods. It can be seen that our method yields smoother reconstructed images as well as lower reconstruction error than other competitors.

In addition, we randomly choose three test images and plot the real and reconstructed spectra for four pixels in Figure 2 to further demonstrate the effectiveness of the proposed method in spectrum reconstruction. It can be seen that only slight difference exists between the reconstructed spectra and the ground truth.

According to these results above, we can conclude that the proposed method is effective in spectral super-resolution and outperforms several state-of-the-art competitors.

5 Conclusion

In this study, we show that leveraging both the local and non-local information of input images is essential for the accurate spectral reconstruction. Following this idea, we design a novel multi-scale convolutional neural network, which employs a symmetrically cascaded downsampling-upsampling architecture to jointly encode the local and non-local image information for spectral reconstruction. With extensive experiments on a large hyperspectral images dataset, the proposed method clearly outperforms several state-of-the-art methods in terms of reconstruction accuracy and spectral similarity.

6 Acknowledgement

This work was supported in part by the National Natural Science Foundation of China (No. 61671385, 61571354), Natural Science Basis Research Plan in Shaanxi Province of China (No. 2017JM6021, 2017JM6001) and China Postdoctoral Science Foundation under Grant (No. 158201).


  • [1] NTIRE 2018 challenge on spectral reconstruction from rgb images.
  • [2]

    Aeschbacher, J., Wu, J., Timofte, R., CVL, D., ITET, E.: In defense of shallow learned spectral reconstruction from rgb images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 471–479 (2017)

  • [3] Aharon, M., Elad, M., Bruckstein, A.: -svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing 54(11), 4311–4322 (2006)
  • [4] Arad, B., Ben-Shahar, O.: Sparse recovery of hyperspectral signal from natural rgb images. In: European Conference on Computer Vision. pp. 19–34. Springer (2016)
  • [5] Cheng, G., Yang, C., Yao, X., Guo, L., Han, J.: When deep learning meets metric learning: Remote sensing image scene classification via learning discriminative cnns. IEEE Transactions on Geoscience and Remote Sensing (2018)
  • [6] Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38(2), 295–307 (2016)
  • [7] Galliani, S., Lanaras, C., Marmanis, D., Baltsavias, E., Schindler, K.: Learned spectral super-resolution. CoRR abs/1703.09470 (2017),
  • [8]

    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision. pp. 1026–1034 (2015)

  • [9] Jégou, S., Drozdzal, M., Vazquez, D., Romero, A., Bengio, Y.: The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on. pp. 1175–1183. IEEE (2017)
  • [10] Kang, X., Zhang, X., Li, S., Li, K., Li, J., Benediktsson, J.A.: Hyperspectral anomaly detection with attribute and edge-preserving filters. IEEE Transactions on Geoscience and Remote Sensing 55(10), 5600–5611 (2017)
  • [11] Loncan, L., de Almeida, L.B., Bioucas-Dias, J.M., Briottet, X., Chanussot, J., Dobigeon, N., Fabre, S., Liao, W., Licciardi, G.A., Simoes, M., et al.: Hyperspectral pansharpening: A review. IEEE Geoscience and remote sensing magazine 3(3), 27–46 (2015)
  • [12] Mei, S., Yuan, X., Ji, J., Zhang, Y., Wan, S., Du, Q.: Hyperspectral image spatial super-resolution via 3d full convolutional neural network. Remote Sensing 9(11),  1139 (2017)
  • [13] Nguyen, R.M., Prasad, D.K., Brown, M.S.: Training-based spectral reconstruction from a single rgb image. In: European Conference on Computer Vision. pp. 186–201. Springer (2014)
  • [14] Pan, Z., Healey, G., Prasad, M., Tromberg, B.: Face recognition in hyperspectral images. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(12), 1552–1560 (2003)
  • [15] Pati, Y.C., Rezaiifar, R., Krishnaprasad, P.S.: Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In: Signals, Systems and Computers, 1993. 1993 Conference Record of The Twenty-Seventh Asilomar Conference on. pp. 40–44. IEEE (1993)
  • [16] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)
  • [17] Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D., Wang, Z.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1874–1883 (2016)
  • [18] Tarabalka, Y., Chanussot, J., Benediktsson, J.A.: Segmentation and classification of hyperspectral images using watershed transformation. Pattern Recognition 43(7), 2367–2379 (2010)
  • [19] Timofte, R., De Smet, V., Van Gool, L.: A+: Adjusted anchored neighborhood regression for fast super-resolution. In: Asian Conference on Computer Vision. pp. 111–126. Springer (2014)
  • [20] Van Nguyen, H., Banerjee, A., Chellappa, R.: Tracking via object reflectance using a hyperspectral video camera. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on. pp. 44–51. IEEE (2010)
  • [21] Yan, Q., Sun, J., Li, H., Zhu, Y., Zhang, Y.: High dynamic range imaging by sparse representation. Neurocomputing 269, 160–169 (2017)
  • [22] Yokoya, N., Grohnfeldt, C., Chanussot, J.: Hyperspectral and multispectral data fusion: A comparative review of the recent literature. IEEE Geoscience and Remote Sensing Magazine 5(2), 29–56 (2017)
  • [23] Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L.: Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing 26(7), 3142–3155 (2017)
  • [24] Zhang, L., Wang, P., Shen, C., Liu, L., Wei, W., Zhang, Y., Hengel, A.v.d.: Adaptive importance learning for improving lightweight image super-resolution network. arXiv preprint arXiv:1806.01576 (2018)
  • [25] Zhang, L., Wei, W., Bai, C., Gao, Y., Zhang, Y.: Exploiting clustering manifold structure for hyperspectral imagery super-resolution. IEEE Transactions on Image Processing (2018)
  • [26] Zhang, L., Wei, W., Shi, Q., Shen, C., Hengel, A.v.d., Zhang, Y.: Beyond low rank: A data-adaptive tensor completion method. arXiv preprint arXiv:1708.01008 (2017)
  • [27] Zhang, L., Wei, W., Zhang, Y., Shen, C., van den Hengel, A., Shi, Q.: Cluster sparsity field: An internal hyperspectral imagery prior for reconstruction. International Journal of Computer Vision pp. 1–25 (2018)