MXR-U-Nets for Real Time Hyperspectral Reconstruction

04/15/2020 ∙ by Atmadeep Banerjee, et al. ∙ 0

In recent times, CNNs have made significant contributions to applications in image generation, super-resolution and style transfer. In this paper, we build upon the work of Howard and Gugger, He et al. and Misra, D. and propose a CNN architecture that accurately reconstructs hyperspectral images from their RGB counterparts. We also propose a much shallower version of our best model with a 10 video applications while still experiencing only about a 0.5 performance.



There are no comments yet.


page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hyperspectral imagery captures information about objects across a wide range of the electromagnetic spectrum. These images possess much more amount of useful information compared to standard RGB images and are especially useful in fields like remote sensing and medical diagnosis. The main problem with hyperspectral imagery is the expensive hardware required to capture these images, leading to a lack of availability of datasets in the public domain. RGB images, on the other hand, are easy to obtain. A system for accurate reconstruction of spectral bands of an RGB image would be beneficial for furthering research into the applications of hyperspectral images.

It may seem problematic to try to convert RGB images to hyperspectral images since the task essentially requires the generation of information that was never captured by an RGB camera, but, hyperspectral image pixel values have a strong correlation [6] to their RGB counterparts. It is, therefore, possible, to learn a mapping from RGB to hyperspectral images, given enough data. In fact, from visual inspection of the quality of results obtained even with simple CNN based approaches, we believe this is an easier task than an analogous task of converting gray scale images to RGB.

Our approach consists of using a CNN to convert RGB to hyperspectral images. We use a U-Net [21] based model with several key improvements taken from recent advancements in the fields of image generation, super-resolution and style transfer. We use an XResnet model, as proposed by He et al. [10] (referred to as Resnet-D in [10]) with Mish [16]activation function (replacing ReLU [17]) as the encoder. In the decoder, we use sub-pixel convolutions [23] for upsampling. Finally, we incorporate blur layers(approach B in [27]) and a self-attention layer from [29] in our decoder. We adapted the general decoder architecture from Howard and Gugger’s work in the Fastai [11] library and Antic’s work in DeOldify [2]. We study the effect of adding each component on the accuracy and running time. We also study the effect of changing the depth of the encoder. Our proposed family of architectures are capable of accurately reconstructing spectral bands from RGB images with very low inference times, on standard single GPU systems. This paper is describes a soltuion to the challenge posed in [4]

2 Related Work

The problem of spectral reconstruction of RGB images is an area of computer vision that has not been studied too extensively. Current state-of-the-art approaches are mostly CNN based. Older methods made use of sparse coding


, but recent advances in CNNs and availability of relatively larger datasets for hyperspectral reconstruction have led to increasing research into neural network based approaches. Earlier CNN based approaches used relatively shallow networks

[18] or even hybrid approaches combining sparse-coding and neural networks [20]. The NTIRE 2018 spectral reconstruction challenge introduced the BGU hyperspectral dataset, which was much larger than existing datasets. The challenge saw various deep CNN based [3] and a few GAN based approaches. The new state of the art was achieved by Shi et al. [24]. Baran and Timofte also did important work for lightweight real time spectral reconstruction in [5]

Our model is a modified version of the U-Net [21]

architecture. The U-Net is a popular deep learning architecture originally introduced to perform image segmentation. It has since then been used for a wide range of image-to-image tasks. It improves over standard encoder-decoder architectures by incorporating skip connections between the encoder and decoder, allowing much deeper models to be trained. Over time the original U-Net architecture has seen several key improvements. These include incorporating skip connections in the encoder

[7], using sub-pixel convolutions in the decoder [23] and using dilated convolution [30].

Our work is significantly inspired by Antic, J.’s work[2] in reconstructing RGB bands from grayscale images. We use a modified version of perceptual loss [12]

in our network. This kind of loss function has proved useful in style-transfer

[12] and super-resolution [13] applications. It makes networks focus on perceptual details

in an image. These details are not easily captured by standard evaluation metrics like RMSE, PSNR or MRAE but are readily visible to humans. We make use of sub-pixel convolutions

[23] for upsampling, in our decoder. It is an alternative to deconvolution operation for learned upsampling and is extensively used in super-resolution applications. It performs the convolution in a low resolution space and upsamples the result, instead of upsampling first. This approach is much more efficient while being mathematically equivalent to deconvolution.

3 Method

3.1 Architecture

Our proposed architecture uses models from the XResnet family with Mish activation function as the encoder(this architecture will be referred to as mxresnet [28]) in the U-Net. Skip connections between the encoder and decoder are made at four positions where the encoder subsamples the image. The final encoder output is passed through two successive convolutions and fed into the decoder.

Figure 1: A U-Net Block

The decoder consists of 4 upsampling blocks, each of which receives two input tensors and produces one output. The input from the previous decoder block is

x upsampled with a sub-pixel convolution with an ICNR[1] initialization scheme. A sub-pixel convolution operation combined with ICNR initialization has been attributed to performing high quality, checkerboard artifact free super-resolution. The upsampling is followed by a blur [27] layer which consists of average pooling with a x

filter and stride of

. This operation also aims to reduce artifacts in generated images. The upsampled feature map is concatenated with the second input, which comes from an encoder skip connection. The final output is formed by passing the concatenated feature map through two successive convolutions. The second decoder block is followed by a self-attention layer as proposed by Zhang et al. in [29]. This layer helps the network to focus on the relevant parts of the image.

The decoder output is x upsampled to make the resolution the same as the input image. This feature map is concatenated with the original RGB image and passed through a standard Resnet block. We find that this concatenation operation provides significant improvements to our results. Finally, a x convolution is used to bring down the channels to the desired number.

3.2 Loss

We use a slightly modified version of the loss function described by Johnson et al. in [12]

. Perceptual loss refer to a loss function that calculates the amount of dissimilarity between a generated image and the ground truth, based on perceptually relevant characteristics. Our loss function is the weighted sum of feature reconstruction loss, style reconstruction loss and pixel loss.


Feature Reconstruction Loss. We use a VGG16 [25]

network pretrained on Imagenet


(also called the loss network) to compute the features of the model outputs. We modify the first layer of the network to contain filters with

channels by copying over weights of the existing channels. Let be the activations of the th layer of the network when processing the image ; if is a convolutional layer then will be a feature map of shape . The feature reconstruction loss is the mean L1 distance between feature representations:


We use the activations before the second, third and fourth max pool layers in the loss network to calculate our feature reconstruction loss.

Figure 2: The mxresnet50 model

Style Reconstruction Loss This loss was proposed by Gatys et al. in [8] and adapted in [12]. It constitutes calculating the Gram matrices of the loss network activations for the output and target images. The modified style reconstruction loss is then the mean absolute difference between the Gram matrices of the output and target images. The Gram matrix can be computed efficiently by reshaping into a matrix of shape ; then .

Figure 3: This figure visualizes the model outputs of mxresnet18 U-Net(our smallest model) and mxresnet50 U-Net (our largest model) on a Validation Image. On the Clean track, model outputs are visually indistinguishable from each other and the ground truth. On the Real World images, however, the differences are more clearly visible. The larger model produces visibly cleaner outputs.

Pixel Loss The pixel loss is the mean Euclidean distance between the output image and the target . If both have shape , then the pixel loss is defined as


4 Training

We normalize the images and use the following data augmentation techniques: random flipping, random rotations, brightness and contrast jitter. All networks are trained for 200 epochs using the AdamW

[14] optimizer(Adam with weight decay) with a weight decay of 1e-3. The training follows the One Cycle schedule[26]. Under this schedule, the learning rate is started at 1e-5 and increased to 1e-3 over 60 epochs following a half cosine curve. After the learning rate peaks, it is reduced to 1e-9 over another 140 epochs following a similar half cosine curve. The momentum of the optimizer goes through a similar but opposite cycle. It starts at 0.95 and is reduced to 0.85 over 60 epochs and again increased to 0.95 over 140 epochs. We also use mixed-precision training[15] to lower training time and memory requirements. A single V100 GPU was used for all training runs. The entire training schedule takes hours (s per epoch) for an mxresnet34 encoder based U-Net.

Figure 4: To demonstrate the effect of perceptual loss, we visualize the outputs of a model with an mxresnet34 encoder trained with MSE loss, on a Real World validation image, and compare it with the outputs of our models trained with perceptual loss. On zooming in, it can be seen that the MSE model produces a considerably higher amount of artifacts.
Approach Clean Real World
Resnet34 (pretrained) 0.055625 0.092532
Resnet34 (no pretraining) 0.052844 0.090132
+ Self Attention
+ Blur
0.052818 0.089132
+ Self Attention
with MSE Loss
0.169942 0.162509
+ Self Attention
+ Blur
0.052089 0.088589
+ Self Attention
+ Blur
0.045434 0.083993
Table 1: Quantitative comparisons on both datasets for different approaches. Notable results are in bold or italics.

5 Experiments

Here we present some ablation studies comparing different variations of our proposed method. More specifically, we compare results for these encoder backbones: resnet34 [9], mxresnet34 [28], mxresnet18 and mxresnet50. We also vary the presence of the self-attention layer along with the blur layer in the decoder networks. All networks have sub-pixel convolution layers in their decoders. The loss function for every experiment is the aforementioned perceptual loss combination unless otherwise specified.

5.1 Dataset

The dataset was provided in the New Trends in Image Restoration and Enhancement (NTIRE) Challenge on Spectral Reconstruction from RGB Images at CVPR 2020 [4]. The datasets for both the competition tracks (Clean and Real World) consist of training images and validation images. The dataset for the clean track of the competition consists of -bit uncompressed RGB images and their channel hyperspectral counterparts as ground truth. For the real world track, we have the JPEG compressed -bit RGB images as the model input. In our experiments, the training and validation data for the models were as provided in the original datasets.

5.2 Results

As table 1 clearly indicates, a U-Net with an mxresnet50 encoder along with self-attention and blur in the decoder with perceptual losses produces the best results. We note that using a pre-trained model causes a slight reduction in performance as compared to other approaches that did not use any Imagenet [22] pretraining. Further, adding self attention and blur to the decoder along with modifications to the original resnet [9] architecture (xresnet [10] with a mish [16] activation function) gave a small performance boost.
Most surprisingly, though, an mxresnet18 based encoder was able to outperform all our other approaches except the mxresnet50 encoder while still being significantly shallow and computationally efficient as opposed to other approaches.

5.3 Inference Time

Our model with the mxresnet50 encoder takes seconds per image during inference time. While all other approaches take to seconds per image during inference. Notably, the network with an mxresnet18 backbone takes seconds during inference making it suitable for real time video. The mxresnet18 encoder model has about times fewer parameters as compared to the mxresnet50 encoder model (M vs M) while still reducing the performance by only and on the clean and real world tracks with respect to the MRAE metric.

6 Conclusion

In this paper, we use an encoder-decoder network based on the U-Net architecture with some of the recent advances/improvements in deep learning. We use the resnet-d[10] architecture with the mish [16] activation function as the encoder. In the decoder, we use sub-pixel convolution [23] layers for upsampling to help increase efficiency, blur [27] layers to reduce checkerboard artifacts and a self-attention layer [29] to focus the network on finer details. All these improvements allow our approach to produce good results even with a relatively shallow encoder network such as the mxresnet18. We note that much of these improvements were originally conceived and implemented in the FastAI [11] library by Howard, Gugger and Antic’s contributions. Our contribution has been to combine the mxresnet base architecture with these improvements that were made by the people mentioned above. We also introduce a model based on the mxresnet18 encoder that is suitable for real-time video applications with an inference time of seconds without any significant drop in performance.


  • [1] A. P. Aitken, C. Ledig, L. Theis, J. Caballero, Z. Wang, and W. Shi (2017) Checkerboard artifact free sub-pixel convolution: a note on sub-pixel convolution, resize convolution and convolution resize. ArXiv abs/1707.02937. Cited by: §3.1.
  • [2] J. Antic (2020-03) DeOldify. External Links: Link Cited by: §1, §2.
  • [3] B. Arad, O. Ben-Shahar, and R. Timofte (2018) NTIRE 2018 challenge on spectral reconstruction from rgb images.

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    , pp. 1042–104209.
    Cited by: §2.
  • [4] B. Arad, R. Timofte, O. Ben-Shahar, Y. Lin, G. Finlayson, et al. (2020-06) NTIRE 2020 challenge on spectral reconstruction from an rgb image. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §1, §5.1.
  • [5] Y. B. Can and R. Timofte (2018) An efficient cnn for spectral reconstruction from rgb images. ArXiv abs/1804.04647. Cited by: §2.
  • [6] A. Chakrabarti and T. Zickler (2011) Statistics of real-world hyperspectral images. In CVPR 2011, pp. 193–200. Cited by: §1.
  • [7] A. Chaurasia and E. Culurciello (2017) LinkNet: exploiting encoder representations for efficient semantic segmentation. 2017 IEEE Visual Communications and Image Processing (VCIP), pp. 1–4. Cited by: §2.
  • [8] L. A. Gatys, A. S. Ecker, and M. Bethge (2015) A neural algorithm of artistic style. ArXiv abs/1508.06576. Cited by: §3.2.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §5.2, §5.
  • [10] T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li (2018)

    Bag of tricks for image classification with convolutional neural networks

    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 558–567. Cited by: MXR-U-Nets for Real Time Hyperspectral Reconstruction, §1, §5.2, §6.
  • [11] J. Howard and S. Gugger (2020) Fastai: a layered api for deep learning. ArXiv abs/2002.04688. Cited by: MXR-U-Nets for Real Time Hyperspectral Reconstruction, §1, §6.
  • [12] J. Johnson, A. Alahi, and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In ECCV, Cited by: §2, §3.2, §3.2.
  • [13] C. Ledig, L. Theis, F. Huszár, J. A. Caballero, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi (2016) Photo-realistic single image super-resolution using a generative adversarial network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 105–114. Cited by: §2.
  • [14] I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In ICLR, Cited by: §4.
  • [15] P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen, D. García, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu (2017) Mixed precision training. ArXiv abs/1710.03740. Cited by: §4.
  • [16] D. Misra (2019) Mish: a self regularized non-monotonic neural activation function. ArXiv abs/1908.08681. Cited by: MXR-U-Nets for Real Time Hyperspectral Reconstruction, §1, §5.2, §6.
  • [17] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In ICML, Cited by: §1.
  • [18] R. M. Nguyen, D. K. Prasad, and M. S. Brown (2014) Training-based spectral reconstruction from a single rgb image. In European Conference on Computer Vision, pp. 186–201. Cited by: §2.
  • [19] M. Parmar, S. Lansel, and B. A. Wandell (2008) Spatio-spectral reconstruction of the multispectral datacube using sparse recovery. 2008 15th IEEE International Conference on Image Processing, pp. 473–476. Cited by: §2.
  • [20] A. Robles-Kelly (2015) Single image spectral reconstruction for multimedia applications. In MM ’15, Cited by: §2.
  • [21] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. ArXiv abs/1505.04597. Cited by: §1, §2.
  • [22] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §3.2, §5.2.
  • [23] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1874–1883. Cited by: §1, §2, §2, §6.
  • [24] Z. Shi, C. Chen, Z. Xiong, D. Liu, and F. Wu (2018) HSCNN+: advanced cnn-based hyperspectral recovery from rgb images. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1052–10528. Cited by: §2.
  • [25] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. Cited by: §3.2.
  • [26] L. N. Smith (2018) A disciplined approach to neural network hyper-parameters: part 1 - learning rate, batch size, momentum, and weight decay. ArXiv abs/1803.09820. Cited by: §4.
  • [27] Y. Sugawara, S. Shiota, and H. Kiya (2018) Super-resolution using convolutional neural networks without any checkerboard artifacts. 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 66–70. Cited by: §1, §3.1, §6.
  • [28] L. Wright and E. Shalnov (2020-03) Lessw2020/mish: mxresnet release. Zenodo. External Links: Document, Link Cited by: §3.1, §5.
  • [29] H. Zhang, I. J. Goodfellow, D. N. Metaxas, and A. Odena (2019) Self-attention generative adversarial networks. ArXiv abs/1805.08318. Cited by: §1, §3.1, §6.
  • [30] L. Zhou, C. Zhang, and M. Wu (2018) D-linknet: linknet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 192–1924. Cited by: §2.