Spectral Super-resolution from Single RGB Image Using Multi-scale CNN
Different from traditional hyperspectral super-resolution approaches that focus on improving the spatial resolution, spectral super-resolution aims at producing a high-resolution hyperspectral image from the RGB observation with super-resolution in spectral domain. However, it is challenging to accurately reconstruct a high-dimensional continuous spectrum from three discrete intensity values at each pixel, since too much information is lost during the procedure where the latent hyperspectral image is downsampled (e.g., with x10 scaling factor) in spectral domain to produce an RGB observation. To address this problem, we present a multi-scale deep convolutional neural network (CNN) to explicitly map the input RGB image into a hyperspectral image. Through symmetrically downsampling and upsampling the intermediate feature maps in a cascading paradigm, the local and non-local image information can be jointly encoded for spectral representation, ultimately improving the spectral reconstruction accuracy. Extensive experiments on a large hyperspectral dataset demonstrate the effectiveness of the proposed method.READ FULL TEXT VIEW PDF
Recently, single gray/RGB image super-resolution reconstruction task has...
We describe a novel method for blind, single-image spectral super-resolu...
Spectral super-resolution (SSR) aims at generating a hyperspectral image...
In recent times, CNNs have made significant contributions to application...
In this paper, we describe a novel deep convolutional neural network (CN...
This work studies Hyperspectral image (HSI) super-resolution (SR). HSI S...
Intra-operative measurements of tissue shape and multi/ hyperspectral
Spectral Super-resolution from Single RGB Image Using Multi-scale CNN
Hyperspectral imaging encodes the reflectance of the scene from hundreds or thousands of bands with a narrow wavelength interval (e.g., 10nm) into a hyperspectral image. Different from conventional images, each pixel in the hyperspectral image contains a continuous spectrum, thus allowing the acquisition of abundant spectral information. Such information has proven to be quite useful for distinguishing different materials. Therefore, hyperspectral images have been widely exploited to facilitate various applications in computer vision community, such as visual tracking, image segmentation 14]5]
, and anomaly detection.
The acquisition of spectral information, however, comes at the cost of decreasing the spatial resolution of hyperspectral images. This is because a fewer number of photons are captured by each detector due to the narrower width of the spectral bands. In order to maintain a reasonable signal-to-noise ratio (SNR), the instantaneous field of view (IFOV) needs to be increased, which renders it difficult to produce hyperspectral images with high spatial resolution. To address this problem, many efforts have been made for the hyperspectral imagery super-resolution.
Most of the existing methods mainly focus on enhancing the spatial resolution of the observed hyperspectral image. According to the input images, they can be divided into two categories: fusion based methods where a high-resolution conventional image (, RGB image) and a low-resolution hyperspectral image are fused together to produce a high-resolution hyperspectral image [22, 11] single image super-resolution which directly increases the spatial resolution of a hyperspectral image [12, 24, 27, 25]. Although these methods have shown effective performance, the acquisition of the input hyperspectral image often requires specialized hyperspectral sensors as well as extensive imaging cost. To mitigate this problem, some recent literature [4, 2, 13, 7] turn to investigate a novel hyperspectral imagery super-resolution scheme, termed spectral super-resolution, which aims at improving the spectral resolution of a given RGB image. Since the input image can be easily captured by conventional RGB sensors, imaging cost can be greatly reduced.
However, it is challenging to accurately reconstruct a hyperspectral image from a single RGB observation, since mapping three discrete intensity values to a continuous spectrum is a highly ill-posed linear inverse problem. To address this problem, we propose to learn a complicated non-linear mapping function for spectral super-resolution with deep convolutional neural networks (CNN). It has been shown that the 3-dimensional color vector for a specific pixel can be viewed as the downsampled observation of the corresponding spectrum. Moreover, for a candidate pixel, there often exist abundant locally and no-locally similar pixels (
exhibiting similar spectra) in the spatial domain. As a result, the color vectors corresponding to those similar pixels can be viewed as a group of downsampled observations of the latent spectra for the candidate pixel. Therefore, accurate spectral reconstruction requires to explicitly consider both the local and non-local information from the input RGB image. To this end, we develop a novel multi-scale CNN. Our method jointly encodes the local and non-local image information through symmetrically downsampling and upsampling the intermediate feature maps in a cascading paradigm, thus enhancing the spectral reconstruction accuracy. We experimentally show that the proposed method can be easily trained in an end-to-end scheme and beat several state-of-the-art methods on a large hyperspectral image dataset with respect to various evaluation metrics.
Our contributions are twofold:
We design a novel CNN architecture that is able to encode both local and non-local information for spectral reconstruction.
We perform extensive experiments on a large hyperspectral dataset and obtain the state-of-the-art performance.
This section gives a brief review of the existing spectral super-resolution methods, which can be divided into the following two categories.
Statistic based methods This line of research mainly focus on exploiting the inherent statistical distribution of the latent hyperspectral image as priors to guide the super-resolution [26, 21]. Most of these methods involve building overcomplete dictionaries and learning sparse coding coefficients to linearly combine the dictionary atoms. For example, in , Arad leveraged image priors to build a dictionary using K-SVD . At test time, orthogonal matching pursuit  was used to compute a sparse representation of the input RGB image.  proposed a new method inspired by A+ 
, where sparse coefficients are computed by explicitly solving a sparse least square problem. These methods directly exploit the whole image to build image prior, ignoring local and non-local structure information. What’s more, since the image prior is often handcrafted or heuristically designed with shallow structure, these methods fail to generalize well in practice.
Learning based methods These methods directly learn a certain mapping function from the RGB image to a corresponding hyperspectral image. For example, 
proposed a training based method using a radial basis function network. The input data was pre-processed with a white balancing function to alleviate the influence of different illumination. The total reconstruction accuracy is affected by the performance of this pre-processing stage. Recently, witnessing the great success of deep learning in many other ill-posed inverse problems such as image denoising and single image super-resolution , it is natural to consider using deep networks (especially convolutional neural networks) for spectral super-resolution. In , Galliani exploited a variant of fully convolutional DenseNets (FC-DenseNets ) for spectral super-resolution. However, this method is sensitive to the hyper-parameters and its performance can still be further improved.
In this section, we will introduce the proposed multi-scale convolution neural network in details. Firstly, we introduce some building blocks which will be utilized in our network. Then, we will illustrate the architecture of the proposed network.
There are three basic building blocks in our network. Their structures are shown in Table 1.
Double convolution (Double Conv) block consists of two convolutions. Each of them is followed by batch normalization, leaky ReLU and dropout. We exploit batch normalization and dropout to deal with overfitting.
Downsample block contains a regular max-pooling layer. It reduces the spatial size of the feature map and enlarges the receptive field of the network.
Upsample block is utilized to upsample the feature map in the spatial domain. To this end, much previous literature often adopts the transposed convolution. However, it is prone to generate checkboard artifacts. To address this problem, we use the pixel shuffle operation . It has been shown that pixel shuffle alleviates the checkboard artifacts. In addition, due to not introducing any learnable parameters, pixel shuffle also helps improve the robustness against over-fitting.
Our method is inspired by the well known U-Net architecture for image segmentation . The overall architecture of the proposed multi-scale convolution neural network is depicted in Figure 1. The network follows the encoder-decoder pattern. For the encoder part, each downsampling step consists of a “Double Conv” with a downsample block. The spatial size is progressively reduced, and the number of features is doubled at each step. The decoder is symmetric to the encoder path. Every step in the decoder path consists of an upsampling operation followed by a “Double Conv” block. The spatial size of the features is recovered, while the number of features is halved every step. Finally, a convolution maps the output features to the reconstructed 31-channel hyperspectral image. In addition to the feedforward path, skip connections are used to concatenate the corresponding feature maps of the encoder and decoder.
Our method naturally fits the task of spectral reconstruction. The encoder can be interpreted as extracting features from RGB images. Through downsampling in a cascade way, the receptive field of the network is constantly increased, which allows the network to “see” more pixels in an increasingly larger field of view. By doing so, both the local and non-local information can be encoded to better represent the latent spectra. The symmetric decoder procedure is employed to reconstruct the latent hyperspectral images based on these deep and compact features. The skip connections with concatenations are essential for introducing multi-scale information and yielding better estimation of the spectra.
In this study, all experiments are performed on the NTIRE2018 dataset . This dataset is extended from the ICVL dataset . The ICVL dataset includes images captured using Specim PS Kappa DX4 hyperspectral camera. Each image is of size in spatial resolution and contains spectral bands in the range of . In experiments, successive bands ranging from with interval are extracted from each image for evaluation. In the NTIRE2018 challenge, this dataset is further extended by supplementing extra images of the same spatial and spectral resolution. As a result, high-resolution hyperspectral images are collected as the training data. In addition, another hyperspectral images are further introduced as the test set. In the NTIRE2018 dataset, the corresponding RGB rendition is also provided for each image. In the following, we will employ the RGB-hyperspectral image pairs to evaluate the proposed method.
To demonstrate the effectiveness of the proposed method, we compare it with four spectral super-resolution methods, including spline interpolation, the sparse recovery method in  (Arad ), A+ , and the deep learning method in  (Galliani ). [4, 2] are implemented by the codes released by the authors. Since there is no code released for , we reimplement it in this study. In the following, we will give the implementation details of each method.
Spline interpolation The interpolation algorithm serves as the most primitive baseline in this study. Specifically, for each RGB pixel , we use spline interpolation to upsample it and obtain a -dimensional spectrum (). According to the visible spectrum111http://www.gamonline.com/catalog/colortheory/visible.php, the , , values of an RGB pixel are assigned to , , and , respectively.
this matrix is required to be perfectly known. In our experiments, we fit the projection matrix using training data with conventional linear regression.
Galliani and our method We experimentally find the optimal set of hyper-parameters for both methods. dropout is applied to Galliani , while our method utilizesregularization. Weight initialization and learning rate vary for different methods. For Galliani , the weights are initialized via HeUniform , and the learning rate is set to for the first 50 epochs, decayed to for the next 50 epochs. As for our method, we use HeNormal initialization . The initial learning rate is and is multiplied by 0.93 every 10 epochs. We perform data augmentation by extracting patches of size
with a stride of 40 pixels from training data. The total amount of training samples is over. At the test phase, we directly feed the whole image to the network and get the estimated hyperspectral image in one single forward pass.
To quantitatively evaluate the performance of the proposed method, we adopt the following two categories of evaluation metrics.
Pixel-level reconstruction error We follow  to use absolute and relative root-mean-square error (RMSE and rRMSE) as quantitative measurements for reconstruction accuracy. Let and denote the th element of the real and estimated hyperspectral images, is the average of , and is the total number of elements in one hyperspectral image. There are two formulas for RMSE and rRMSE respectively.
Spectral similarity Since the key for spectral super-resolution is to reconstruct the spectra, we also use spectral angle mapper () to evaluate the performance of different methods. calculates the average spectral angle between the spectra of real and estimated hyperspectral images. Let represents the spectra of the th hyperspectral pixel in real and estimated hyperspectral images ( is the number of bands), and is the total number of pixels within an image. The value can be computed as follows.
Convergence Analysis We plot the curve of loss on the training set and the curves of five evaluation metrics computed on the test set in Figure 3. It can be seen that both the training loss and the value of metrics gradually decrease and ultimately converge with the proceeding of the training. This demonstrates that the proposed multi-scale convolution neural network converges well.
Quantitative Results Table 2 provides the quantitative results of our method and all baseline methods. It can be seen that our model outperforms all competitors with regards to and , and produces comparable results to Galliani on and . More importantly, our method surpasses all the others with respect to spectral angle mapper. This clearly proves that our model reconstructs spectra more accurately than other competitors. It is worth pointing out that reconstruction error (absolute and relative ) is not necessarily positively correlated with spectral angle mapper (). For example, when the pixels of an image are shuffled, and will remain the same, while will change completely. According to the results in Table 2, we can find that our finely designed network enhances spectral super-resolution from both aspects, , yielding better results on both average root-mean-square error and spectral angle similarity.
Visual Results To further clarify the superiority in reconstruction accuracy. We show the absolute reconstruction error of test images in Figure 4. The error is summarized over all bands of the hyperspectral image. Since A+ outperforms Arad in terms of any evaluation metric, we use A+ to represent the sparse coding methods. It can be seen that our method yields smoother reconstructed images as well as lower reconstruction error than other competitors.
In addition, we randomly choose three test images and plot the real and reconstructed spectra for four pixels in Figure 2 to further demonstrate the effectiveness of the proposed method in spectrum reconstruction. It can be seen that only slight difference exists between the reconstructed spectra and the ground truth.
According to these results above, we can conclude that the proposed method is effective in spectral super-resolution and outperforms several state-of-the-art competitors.
In this study, we show that leveraging both the local and non-local information of input images is essential for the accurate spectral reconstruction. Following this idea, we design a novel multi-scale convolutional neural network, which employs a symmetrically cascaded downsampling-upsampling architecture to jointly encode the local and non-local image information for spectral reconstruction. With extensive experiments on a large hyperspectral images dataset, the proposed method clearly outperforms several state-of-the-art methods in terms of reconstruction accuracy and spectral similarity.
This work was supported in part by the National Natural Science Foundation of China (No. 61671385, 61571354), Natural Science Basis Research Plan in Shaanxi Province of China (No. 2017JM6021, 2017JM6001) and China Postdoctoral Science Foundation under Grant (No. 158201).
Aeschbacher, J., Wu, J., Timofte, R., CVL, D., ITET, E.: In defense of shallow learned spectral reconstruction from rgb images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 471–479 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision. pp. 1026–1034 (2015)