Almost all consumer cameras available today function by converting the light spectrum to match the trichromaticity of the human eyes (as Red, Green and Blue channels). This is effective for presenting information to humans, but it ignores much of the visible spectrum. Hyperspectral images (HSI) and multispectral images (MSI), on the other hand, capture additional frequencies of the spectrum and often measure spectra with greater fidelity. This additional information can be used for many applications, including precision agriculture , food quality analysis  and aerial object tracking .
Typically, MSI have 4 - 10 channels spread over a large bandpass, and HSI have 30 - 600 channels with finer spectral resolution. MSI and HSI data can enable discrimination tasks where RGB will fail due to the increased spectral resolution. However, MSI and HSI data have drawbacks: (1) MSI and HSI cameras are very expensive, and (2) HSI and MSI have a significantly lower spatial and temporal resolution than RGB cameras (Fig. 1). As a result, the use of spectral imagery has been limited to domains where these drawbacks are mitigated. Given the high hardware costs of flying an HSI sensor, we explore the possibility of learning RGB to HSI mappings in low resolution spectral imagery and then applying those mappings to high resolution spatial RGB imagery to obtain images with both high spatial and high spectral resolution.
Spectral super-resolution SSR algorithms attempt to infer additional spectral bands in the 400 nm – 700 nm range from an RGB and low resolution HSI images at an interval of 10 nm. Recently, SSR algorithms using deep learning [6, 3, 30] have been proposed that attempt to solve this problem in natural images. These methods bypass the need for a low resolution HSI input by learning RGB to Spectral mappings from a large sample of natural images .
Recently, generative adversarial networks (GAN)  and its variants have shown tremendous success in being able to generate realistic looking images by learning a generative model of the data. Conditional GANs are similar to conventional GANs, except that they learn the output distribution as a function of noise and the input, thus making them suitable for text-to-image  and image-to-image  translation purposes.
This paper makes three major contributions:
We show that conditional GANs can learn the target distribution for 31 spectral bands from low spatial resolution RGB images.
We describe a new aerial spectral dataset called AeroCampus that contains a wide variety of objects, including, but not limited to, cars, roads, trees, and buildings.
We demonstrate that our conditional GAN achieves an effective root mean square error (RMSE) on AeroCampus of less than 3.0. We then use our model on RGB images with high spatial resolution to obtain images with both high spatial and high spectral resolution.
2 Related work
SSR is closely related to hyperspectral super-resolution [16, 2, 5]. Hyperspectral super-resolution involves inferring a high resolution HSI from two inputs: a low resolution HSI and a high resolution image (typically RGB). SSR is a harder task because it does not have access to the low resolution HSI, which can be expensive to obtain.
Nguyen et al. 
used a radial basis function (RBF) that leverages RGB white-balancing to recover the mapping from color to spectral reflectance values. They have two key assumptions that make their approach too restrictive: (1) They assume the color matching function of the camera is known beforehand and, (2) that the scene has been illuminated by an uniform illumination. Their method includes stages for recovering two things - the object reflectance and, the scene illumination and is very dependent on the assumptions for training the RBF network. Arad and Ben-Shahar
proposed learning a sparse dictionary of hyperspectral signature priors and their corresponding RGB projections. They then used a many-to-one mapping technique for estimating hyperspectral signatures in the test image, while using all other images in the dataset for learning the dictionary. This approach yielded better results indomain-specific subsets than the complete set uniformly since the dictionary has access to a lot similar naturally-occurring pixel instances in the training data and can be optimized for the target subset. Similar to Arad and Ben-Shahar, Aeschbacher et al.  adapted the A+ method  to the spectral reconstruction domain to achieve significantly better results without the need for online learning of the RBG-HSI dictionary (Arad and Ben-Shahar’s approach was inspired by the works of Zeyde et al. ). However, these approaches tackle the mapping problem on a pixel level and fail to take advantage of area around the pixel that would possibly yield better information for predicting signatures, for example - if a particular color ‘blue’ to be spectrally up-sampled, does it belong to the blue car or the sky? The above approaches fail to use this spatial information.
A number of papers that use applied deep learning for SSR have been published this year. Galliani et al.  proposed the use of the Tiramisu architecture , a fully convolutional version of DenseNet . They modified the network to a regression based problem by replacing Softmax-Cross Entropy loss for class segmentation with the Euclidean loss and established the first state-of-the-art results in the field. Xiong et al
. proposed to use spectral interpolation techniques to first up-sample the RGB image in the channel space to a desired spectral resolution and then use CNNs to enhance the up-sampled spectral image. Similar to our work, Alvarez-Gilaet al. 
recently used a pix2pix image-to-image translation framework for SSR using GANs on natural images. A key point in applied deep learning methods being: unlike dictionary based algorithms which require information about the camera’s color matching functions, these methods do not rely on this information.
3 AeroCampus RGB and HSI Data Sets
The AeroCampus data set (see Fig. 2) was generated by flying two types of camera systems over Rochester Institute of Technology’s university campus on August 8th, 2017. The systems were flown simultaneously in a Cessna aircraft. The first camera system consisted of an 80 megapixel (MP), RGB, framing-type silicon sensor while the second system consisted a visible/near infrared (VNIR) hyperspectral Headwall Photonics Micro Hyperspec E-Series CMOS sensor. The entire data collection took place over the span of a couple hours where the sky was completely free of cloud cover, with the exception of the last couple flight lines at the end of the day.
The wavelength range for the 80 MP sensor was 400 to 700nm with typical band centers around 450, 550, and 650nm and full-width-half-max (FWHM) values ranging from 60-90nm. The hyperspectral sensor provided spectral data in the range of 397 to 1003nm, divided into 372 spectral bands. The ground sample distance (GSD) is completely dependent on flying altitude. The aircraft was flown over the campus at altitude of approximately 5,000 feet, yielding an effective GSD for the RGB data of about 5cm and 40cm for the hyperspectral imagery.
Both data sets were ortho-rectified based on survey grade GPS. That is, camera distortion was removed along with uniform scaling and re-sampling using a nearest neighbor approach so as to preserve radiometric fidelity. The RGB data was ortho-rectified onto the Shuttle Radar Topography Mission (SRTM) v4.1 Digital Elevation Model (DEM) while the HSI was rectified onto a flat plane at the average terrain height of the flight line (i.e., a low resolution DEM). Both data sets were calibrated to spectral radiance in units of . To preserve the integrity of the training and testing data, we only use one of the six flight lines collected to record our results. There was significant overlap between the other flight lines and hence, the one with the largest spatial extent was chosen to obtain a considerable split in the dataset (Fig. 2).
Comparison to other datasets. To the best of our knowledge, AeroCampus is the first of its kind as an aerial spectral dataset. The closest contender would be the Kaggle DSTL Satellite Imagery Dataset with a 8 band multispectral channel between 400 nm to 1040nm. Not having an uniform pre-defined split also causes a problem when it comes to validating the current state of the art methods over newly proposed models. For the ICVL dataset , Galliani et al.  used a 50% global split of the available images and randomly sampled a set of image patches for training the Tiramisu network. At test time, they constructed the spectral signatures of a given image by tiling patches with eight pixel overlap to avoid boundary artifacts. For the same dataset, Alvarez-Gila et al.  train their network by using a different global split and report their results, making it difficult to validate other approaches due to the lack of uniformly accepted data splits. For AeroCampus, we follow a simple split (Fig. 2): we use 60 % of the data as training and the remaining 40 % as testing. This is done to ensure that there is enough spectral variety present in the dataset with respect to key areas of classes like cars, roads, vegetation and buildings.
4 AeroGAN for Aerial SSR
Problem statement. As shown in Fig. 1, we define our under-constrained problem as follows: Given a three band (RGB) image, is it possible to learn up-sampling in the spectral domain to regress information for 31 bands between 400 nm - 700 nm? To this end, we experiment with a conventional encoder-decoder network and extend the capacity by modeling the task as a target distribution learning problem.
4.1 CNN Framework Analysis
The network architecture for aerial SSR is constrained by the following requirements: (1) It should be able to process low resolution features very well due to the nature of the data, (2) it should be able to propagate information to all layers of the network so that valuable information is not lost during sampling operations and, (3) it should be able to make the most out of limited data samples. For our model, we use a variant of the UNet 
framework since it has been known to operate well on low resolution medical imagery and limited data samples. The network is modified to solve a regression problem by replacing the last softmax layer with a ReLU activation which then gets forwarded to another convolution layer for predicting the band values. The skip connections from encoder to decoder layers ensure conveyance of trivial but useful information whose positioning remains consistent at the output end as well, ensuring all possible information has been utilized to its maximum.
Following popular approaches in spatial super-resolution, we use LeakyReLUs in the encoder side and normal ReLUs in the decoder side to avoid facing vanishing gradients. The last obtained set of filters is then given to a channel convolution layer  to obtain the final set of 31 bands. The intuition behind using filter here is two fold: it forces the network to learn dimensionality reduction on the 64 channel space and at the same time, gives each of the pixel location its own distinct signature since the filters do not concern themselves with correlation in the spatial feature map space, but rather look at variation in the temporal feature map space. We regress the values for the bands between and and found this to be important for achieving a more stable flow in predictions generated by the network. Dropout is applied to all but the last two layers of the CNN to ensure smooth gradient flow through the network while trying to minimize the loss. It is worth mentioning that both, FC-DenseNet (used in 
) and UNet failed to obtain a good representation of the mapping using conventional loss functions, possibly due to an insufficient number of training samples.
4.2 cGAN Framework Analysis
While using pixel-wise L1/ MSE loss works for regressing for optimal values of the spectral bands, we further improved the network by turning the problem to a target distribution learning task. Conditional GANs, first proposed in , have been used widely for generating realistic looking synthetic images [14, 33, 17, 11]. To overcome the difficulty of dealing with pixel-wise MSE loss, Johnson et al.  and Ledig et al.  used similar loss functions that were based on the activations of the feature maps in the VGG  network layers. There exists no such network in the spectral domain that can help minimize the activations at feature map levels to improve the quality of the generated samples. The functioning of our GAN is inspired by the image to image translation framework of Isola et al. in . Similar to the their paper where the task is to regress 2/3 channels depending on the problem, we formulate our objective for regressing 31 spectral bands as follows:
where the generator (G) tries to minimize the objective while the adversarial discriminator (D) tries to maximize it. The other loss in Eqn. 1 is an additional term imposed on the generator, which is now tasked with not only fooling the discriminator but also being as close to the ground truth output image as possible. This is accomplished by using as a L1 loss, after having tested with L2 loss and similarity index based losses like SSIM . L2 loss has been the most popular for pixel-wise reconstruction and though it is effective in low frequency content restoration, it suppresses most of the high frequency detail, which is undesirable given the lack of high frequency content available in the first place. Isola et al.  proposed to trade-off the L2 loss by using L1 loss for correcting low frequency components while using the PatchGAN discriminator to deal with high frequency components by penalizing structural integrity at the patch level. PatchGAN is described in  as the size of the discriminator’s receptive field to determine whether that portion of the sample is real or fake. For instance, a receptive field will bias its opinion only on the pixel values individually while a receptive field will determine if the region in the image rendered is real or fake and then average all the scores. This architecture works in our favor since the PatchGAN layers assess spectral data similarity inherently without the need to mention any separate loss function. On the generator side, is set to in Eqn. 2 with L1 loss to normalize it’s contribution in the overall loss function. We found that the best results were obtained (Table 1, Fig. 7) by setting the discriminator’s receptive field to .
5 Experiments and Results
Data Preparation. Finding the right alignment between RGB and HSI imagery captured at different altitudes is quite a task when it comes to problems such as SSR. Following the work of other researchers [4, 6, 5], we synthesize the RGB images from the hyperspectral data using the standard camera sensitivity functions for the Canon 1D Mark III as collected by Jiang et al. . This eliminates the process of establishing accurate spatial correspondence that would have been needed in the original scenario. Camera sensitivity functions give a mapping for the image sensor’s relative efficiency of light conversion against the wavelengths. They are used to find correspondences between the radiance in the actual scene and the RGB digital counts generated. In our case, the original hyperspectral scene contains images taken with 372 narrow filters, each separated by about 1 nm. Using ENVI (Exelis Visual Information Solutions, Boulder, Colorado), we first convert this data to 31 bands separated by 10 nm and ranging from 400 nm to 700 nm to form our hyperspectral cube. Using the camera sensitivity function at the corresponding 31 wavelengths, we then synthesize the RGB images. All images are normalized between to before being fed into the networks.
We used PyTorch for all our implementations. All models were initialized with HeUniform and a dropout of was applied to avoid overfitting and as a replacement for noise in adversarial networks. For optimization, we used Adam  with a learning rate of , gradually decreasing to
halfway through the epochs. We found these to be the optimum parameters for all our results. All GANs were trained for 50 epochs to achieve optimal results. All max pooling and up-sampling layers were replaced with strided convolutions and transposed convolutions layers respectively. Inspired by Gallianiet al. , we replaced all transposed convolutions with subpixel up-sampling , but did not achieve significant improvement. Thus transposed convolutions are retained in all our models.
Error metrics. We use two error metrics for judging the performance of our network: Root Mean Square Error (RMSE) and Peak Signal to Noise Ratio (PSNR). To avoid any discrepancy in the future, it is worth mentioning that the RMSE is computed on a 8-bit range by converting the corresponding values between (following approaches in [4, 6]) while the PSNR is measured in the range.
Fig. 5 shows a set of different scenarios from the test data that were analyzed. The first row is a set of 4 different scenes from the test dataset at resolution, namely: running track, baseball field, vegetation and parking lot. The scenes are picked such that the former two objects have never been seen by the network and the latter two are some permutation of instances in the training data. The network is able to generate significant band resemblances in all cases, thus proving the viability of our method. Secondly, since the network is fully convolutional, we also test a scenario where it has to infer information in a resolution patch (Fig. 7). We sample a set of four points as shown Fig. 6 and analyze the plots for the three discriminator windows: , , and .
From Fig. 7, we observe that none of the models predicted the bump observed at 400 - 420 nm range in case of the tree sample. This bump has been caused mostly due to high signal to noise ratio at the sensor end and hence can be treated as noise, which the networks managed to ignore. The inference for car, building and asphalt also looks smooth, and even though the discriminator does not get the right magnitude levels, the spectra constructed has similar key points for unique object identification, which is close to solving the reconstruction task.
Proof of concept.
The main aim of this study is to figure out if neural networks can learn spectral pattern distributions that could be applied to high resolution RGB images for getting best of both. For validation, we sample a set of patches from the RGB images that were collected and present a proof of concept (Fig.8) towards aerial SSR. As observed, the network managed to obtain significant spectral traits: (1) a bump in higher end of the spectrum for the red car and, (2) a peak in green corresponding to the vegetation patch. This shows that it is indeed possible for neural networks to observe information over time and possibly learn a pattern, provided enough samples are present for training.
In this section, we discuss other network architectures that were tried and also the limitations of using SSR with aerial imagery.
Other networks. Two additional network architectures were tested with to reduce the under-constrained problem space: (1) a 31-channel GAN architecture similar to , where each band gets its own set of convolution layers before being concatenated for calculating reconstruction loss; and (2) an architecture inspired by  in which two consecutive GANs learn to first generate an image at a lower resolution () and then upscale to a higher resolution (). In our case, we used two different GANs to first spectrally up-sample to 11 bands and then predict the remaining 20. However, we found both these networks to be more unstable than the simpler one. We believe the cause for this to possibly be the fact that it is more easier to learn an entire spectral distribution range than learning it split by split since there can be overlaps between objects of different categories in particular spectral ranges. We are continuing to develop these models.
Areas of development. SSR has its own set of limitations that cannot be resolved irrespective of the methods used. For example, one of the main motivations for this paper is to determine if an applied learning can be used instead of expensive hyperspectral cameras to predict light signatures in the hyperspectral space. While it is possible to model spectral signatures between nm - nm, it is next to impossible to model infrared and beyond signatures since they are not a function of just the RGB values. Here, we present two “solvable” limitations: Water and Shadows. Water does not have its own hyperspectral signature and instead takes over the signature of the sediments present in it - the signatures for clear water and turbid water would be distinctly apart. Detecting shadows has been known to be a problem in spectral imaging  since they also do not exhibit an unique spectral signature. The question posed here is simple - given a vast amount of data, is it possible to have a network learn how water and shadows work and affect the spectral signatures of objects under consideration? To this end, we sample a patch from another flight line (Fig. 10) that contains asphalt (road) under two different circumstances: sunlight and shadows. The corresponding spectral prediction is shown in Fig. 11 where we observe that the network managed to have a similar spectral signature to the sunlight patch with a decrease in magnitude. This could be of importance in tasks where knowing the presence of shadows is required.
In this paper, we trained a conditional adversarial network to determine the 31 band visible spectra of a aerial color image. Our network is based on the Image-to-Image Translation framework which we extend to predict 31 band values. We show that the network learns to extract features for determining an object’s spectra despite high noise interference in the spectral bands. Experimental results show a RMSE of 2.48, which shows that the network is successfully recovering the spectral signatures of a color image. Furthermore, we introduce two modeling complexities: water and shadows and release the AeroCampus dataset for other researchers to use.
This work was supported by the Dynamic Data Driven Applications Systems Program, Air Force Office of Scientific Research, under Grant FA9550-11-1-0348. We thank the NVIDIA Corporation for the generous donation of the Titan X Pascal that was used in this research.
J. Aeschbacher, J. Wu, R. Timofte, D. CVL, and E. ITET.
In defense of shallow learned spectral reconstruction from rgb
Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017.
-  N. Akhtar, F. Shafait, and A. Mian. Hierarchical beta process with gaussian process prior for hyperspectral image super resolution. In European Conference on Computer Vision, pages 103–120. Springer, 2016.
-  A. Alvarez-Gila, J. van de Weijer, and E. Garrote. Adversarial networks for spatial context-aware spectral image reconstruction from rgb. 2017.
-  B. Arad and O. Ben-Shahar. Sparse recovery of hyperspectral signal from natural rgb images. In European Conference on Computer Vision, pages 19–34. Springer, 2016.
R. Dian, L. Fang, and S. Li.
Hyperspectral image super-resolution via non-local sparse tensor factorization.In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5344–5353, 2017.
-  S. Galliani, C. Lanaras, D. Marmanis, E. Baltsavias, and K. Schindler. Learned spectral super-resolution. arXiv preprint arXiv:1703.09470, 2017.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
K. He, X. Zhang, S. Ren, and J. Sun.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.In Proceedings of the IEEE International Conference on Computer Vision, pages 1026–1034, 2015.
-  G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  W. Huang and M. Bu. Detecting shadows in high-resolution remote-sensing images of urban areas using spectral and spatial features. International Journal of Remote Sensing, 36(24):6224–6244, 2015.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio. The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1175–1183. IEEE, 2017.
-  J. Jiang, D. Liu, J. Gu, and S. Süsstrunk. What is the space of spectral sensitivity functions for digital color cameras? In Applications of Computer Vision (WACV), 2013 IEEE Workshop on, pages 168–179. IEEE, 2013.
-  J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  C. Lanaras, E. Baltsavias, and K. Schindler. Hyperspectral super-resolution by coupled spectral unmixing. In Proceedings of the IEEE International Conference on Computer Vision, pages 3586–3594, 2015.
-  C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016.
-  M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
-  M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
-  D. J. Mulla. Twenty five years of remote sensing in precision agriculture: Key advances and remaining knowledge gaps. Biosystems engineering, 114(4):358–371, 2013.
-  R. M. Nguyen, D. K. Prasad, and M. S. Brown. Training-based spectral reconstruction from a single rgb image. In European Conference on Computer Vision, pages 186–201. Springer, 2014.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop,
D. Rueckert, and Z. Wang.
Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1874–1883, 2016.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
P. L. Suárez, A. D. Sappa, and B. X. Vintimilla.
Infrared image colorization based on a triplet dcgan architecture.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 212–217. IEEE, 2017.
-  D. Sun. Computer vision technology for food quality evaluation. Academic Press, 2016.
-  R. Timofte, V. De Smet, and L. Van Gool. A+: Adjusted anchored neighborhood regression for fast super-resolution. In Asian Conference on Computer Vision, pages 111–126. Springer, 2014.
-  B. Uzkent, A. Rangnekar, and M. J. Hoffman. Aerial vehicle tracking by adaptive fusion of hyperspectral likelihood maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 233–242. IEEE, 2017.
-  Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
-  Z. Xiong, Z. Shi, H. Li, L. Wang, D. Liu, and F. Wu. Hscnn: Cnn-based hyperspectral image recovery from spectrally undersampled projections. In Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017.
-  F. Yasuma, T. Mitsunaga, D. Iso, and S. K. Nayar. Generalized assorted pixel camera: postcapture control of resolution, dynamic range, and spectrum. IEEE transactions on image processing, 19(9):2241–2253, 2010.
-  R. Zeyde, M. Elad, and M. Protter. On single image scale-up using sparse-representations. In International Conference on Curves and Surfaces, pages 711–730. Springer, 2010.
-  H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1612.03242, 2016.