Multi-FAN: Multi-Spectral Mosaic Super-Resolution Via Multi-Scale Feature Aggregation Network

09/17/2019 ∙ by Mehrdad Shoeiby, et al. ∙ CSIRO 4

This paper introduces a novel method to super-resolve multi-spectral images captured by modern real-time single-shot mosaic image sensors, also known as multi-spectral cameras. Our contribution is two-fold. Firstly, we super-resolve multi-spectral images from mosaic images rather than image cubes, which helps to take into account the spatial offset of each wavelength. Secondly, we introduce an external multi-scale feature aggregation network (Multi-FAN) which concatenates the feature maps with different levels of semantic information throughout a super-resolution (SR) network. A cascade of convolutional layers then implicitly selects the most valuable feature maps to generate a mosaic image. This mosaic image is then merged with the mosaic image generated by the SR network to produce a quantitatively superior image. We apply our Multi-FAN to RCAN (Residual Channel Attention Network), which is the state-of-the-art SR algorithm. We show that Multi-FAN improves both quantitative results and well as inference time.



There are no comments yet.


page 2

page 4

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent development of real-time snapshot mosaic image sensors have given rise to fast and portable multi-spectral camera devices, with performance comparable to modern trichromatic (RGB) cameras [45, 5]. Despite the great interest in these multi-spectral imaging devices, with applications ranging from astronomy [3] to object detection in autonomous vehicles [41, 30], they suffer from an inherent constraint, that is, a trade-off between the spatial and the spectral resolution. The reason is that there is a limited physical space on a 2D camera image sensor. A higher spatial resolution (smaller pixel size) reduces the number of wavelength bins that can fit on that pixel on the image sensor. This creates a limitation in certain applications where size and portability are essential factors, for instance, on a UAV [7]. A more portable (smaller/lighter) camera results in a device that suffers more from lower spatial and spectral resolution.

Despite the fact that multi-spectral cameras suffer from low spatial resolution compared to their RGB counterparts, and considering the vast interest in these devices, there is a lack of scientific literature on multi-spectral image super-resolution (SR). Therefore, in this article, we aim to develop a SR model for multi-spectral cameras, , SR of the mosaic image sensors, to address the gap identified in the literature. However, it should be noted that our method is general and robust as it can be applied to any mosaic snapshot sensor images. In this paper, we claim the following contributions:

Figure 1: Illustration of mosaic pattern generation procedure. Given the HR and LR cubes, we generate mosaic pattern images according to their wavelength position on the image sensor depicted in [36], which are then used to train our super-resolution network.


  • We propose a multi-spectral SR method that enhances the state of the art RCAN network in terms of performance and inference time.

  • A simple, yet novel, data representation, which exploits the mosaic pattern of a snapshot sensor, that allows super-resolving from the mosaic image rather image cubes which is customary in the SR literature. Our work is, to the best of our knowledge, the first CNN-based technique to super-resolve multi-spectral images from mosaic images directly.

2 Related Work

Here, we provide details about the current state-of-the-art RGB and multi-spectral SR algorithms. Dong  [8]

furnishes the pioneering work in CNN SR. The proposed linear network coined super-resolution convolution neural network (SRCNN) 

[8] is composed of three convolutional layers only, with various filter sizes. SRCNN [8] improved significantly over the traditional SR methods. Due to the success of SRCNN, many CNN based algorithms [26, 9, 46] were proposed.

Initially, the focus was on simple linear networks [9, 46], , networks without any skip-connections and learned the image itself. Although comprised of eight convolutional layers, fast super-resolution convolutional neural network, abbreviated as FSRCNN [9]

speeds up the SR process by using as input the original low-resolution patch instead of a bi-cubically up-sampled one highlighting the fact that using interpolation to scale up images deteriorates SR performance. Recently, another linear network, SRMD which stands for super-resolution network for multiple degradations  

[46], is proposed, which can handle multiple degradations deblurring, and unknown downscaling operators.

Other approaches use residual connections 

[19, 26, 1]. For example, Kim  [19] introduced very deep super-resolution (VDSR), which has a single global skip-connection from the input to the final output. Similarly, enhanced deep super-resolution EDSR [26] employs residual blocks (RBs). More recently, cascading residual network (CARN) [1] is proposed, which also employs a variant of RBs with cascading connections. CARN [1] lags behind EDSR [26] in terms of PSNR; however, its focus is on efficiency and speed.

With the rise of the usage of skip-connections, Kim  [20] utilized recursive connections, namely, deep recursive convolutional network (DRCN), which applies the same convolutions’ several times, as this helps in keeping the number of parameters small. Similar to [20], deep recursive residual network (DRRN) [40] proposed a deeper network replicating the necessary skip-connection modules many times to realize a multi-path architecture. Furthermore, a persistent memory network for image superresolution (abbreviated as MemNet) [39] relies on a recursive block that is like the RBs defined in [26]. The performances of the recursive skip-connection networks are comparable to each other.

More lately, motivated by the success of DenseNet [15], CNN-based SR networks concentrated on the dense connection model. For example, SRDenseNet [42] uses dense connections to learn compact models, avoiding the problem of vanishing gradients and ease the flow from low-level features to high-level features. The authors of [42] proposed a sequential arrangement of the dense modules followed by deconvolutional layers to reconstruct the final output from the high-level features only. Recently, residual-dense network (RDN) [49] employed dense connections to learn the local representations from the patches at hand. Likewise, the dense-deep back-projection network, also known as DDBPN [14] aims to model a feedback mechanism with a feed-forward procedure. Hence, they proposed a series of densely connected upsampling and downsampling layers and combined the intermediate outputs to predict the resulting SR image.

To improve the perceptual quality of the images, SR networks based on GANs [33, 13] have been attempted. Interesting work in this regard is SRResNet [24] that combine three different losses, , , perceptual, and adversarial. The generator is composed of RBs and long skip-connections. The SRResNet outperformed its competitors in terms of perceived quality by a significant margin. More recently, SRFeat [31] was proposed, which uses an additional discriminator to assist the generator in creating more realistic images. ESRGAN [43] is inspired by [24]

where batch normalization is removed, and dense connections between five consecutive convolutional layers are incorporated. Furthermore, to enforce residual learning, ESRGAN 

[43] also has a global skip-connection. Moreover, instead of using the traditional discriminator, ESRGAN [43] exercises an enhanced discriminator called Relativistic GAN [18].

The visual attention [28] concept is brought to SR by RCAN [48], which models the inter-channel dependencies using a channel attention (CA)mechanism. This process is coupled with a very deep network composed of groups of RBs. Following in the footsteps of [48], the residual attention module (SRRAM) by [21], employed a dual attention approach, , both spatial attention and CA to model the inter-channel and the intra-channel dependencies, respectively; however, SRRAM [21] lags behind RCAN [48].

(SRRAM) by [21], employed a dual attention approach, , both spatial attention and channel attention to model the inter-channel and the intra-channel dependencies, respectively; however, SRRAM [21] lags behind RCAN [48] and DRLN [2].

The SR algorithms mentioned above focus mainly on super-resolving RGB images even though the multi-spectral images are comparatively more adversely affected by the resolution constraints. These constraints are due to the inherent characteristics of the multi-spectral cameras. Moreover, current SR algorithms are designed for demosaiced images, and not for mosaic/cube images requiring them to take into account spectral and spatial correlation of multiple channels. The scarcity of multi-spectral SR algorithms may be due to the absence of multi-spectral SR benchmarking platforms as well as the difficulty of accessing suitable SR spectral datasets. For example, Li  [25] aims to improve the quality of hyperspectral (not multi-spectral) images and is one of the few CNN spectral SR methods. To the best of our knowledge, the only multi-spectral SR methods [23, 34], were submitted to the PIRM2018 multi-spectral SR challenge [37, 36]. To super-resolve the images, Lahoud  [23] adopted an image completion technique that requires and down-sampled images as input to a 12 layer convolutional network. While achieving good results, it addresses the problem of SR given and down-sampled images, rather than single image SR. It is also not an end-to-end CNN based implementation. The best end-to-end CNN based method in the challenge was proposed by Shi  [34], which implicitly employed the RCAN [48].

All the methods in the challenge do not directly take into account any spectral correlation or consider the spatial offsets of each wavelength channel. These methods only rely on the network to learn the inherent geometrical relationship of each pixel; to elaborate further, the input data encompass multi-spectral image cubes, which is a customary way to represent multi-spectral images. However, unlike hyperspectral cubes, where each pixel location contains information for the complete measurable spectral range, with multi-spectral mosaic sensors, each raw pixel location only contains information about one specific wavelength band. In other words, representing a mosaic image in the form of a multi-spectral cube implies that the precise geometrical information inherent to each raw pixel is lost during the process of conversion.

Preserving the aforementioned geometrical information can be achieved by creating the multi-spectral cubes via interpolation or demosaicing method [29, 17, 9, 47] to the dimension of the original mosaic image. However, demosaicing or interpolation has shown to deteriorate SR performance [10, 50, 35]. A couple of related works exploited mosaic images, instead of interpolation/demosaicing include Fu  [10] which utilizes a variational approach that exploits the mosaic RGB images to super-resolve hyperspectral images. This is preceded by Zhou  [50]

, which presented a deep residual network for super-resolving RGB that utilizes the generated (estimated) mosaic images. Both works highlight the fact that demosaicing involves interpolation (, bicubic) and introduce artifacts that can deteriorate the performance of the algorithms.

3 Proposed Method

Figure 2: Overview of our propose SR network. The model consists of two main blocks: RIRN/RCAN and Multi-FAN. While RIRN/RCAN attempts to super-resolve the input given the addition of the output of the first convolutional layer and the last residual group (i.e., RG), Multi-FAN aims to fuse features at multiple levels (including the ones from shallower RGs) to perform SR. The final output is then computed given the information derived from these two networks. Note, there are RGs RBs for RCAN which corresponds to the network of [34]. Channel-wise concatenation is denoted by ©. 2D convolutional layers, and 2D transposed convolutional layers are denoted by blue and red blocks respectively.

As explained in section 2, multi-spectral cubes provided in the PIRM2018 challenge do not contain the geometric information of each specific wavelength. To introduce this geometric information into the dataset without having to interpolate each channel, we propose the generation of mosaic LR and HR from raw multi-spectral cubes to carry out SR on mosaic images rather than image cubes. Moreover, we propose to utilize a Multi-FAN mechanism on top of features of RCAN [34, 47] at multiple semantic scales. This allows the model to effectively consider high-level semantic features (with larger receptive fields), as well as the ones from shallower layers, which capture more details (e.g., textures and edges, and multi-spectral intra-channel dependencies) but in local regions of the input images (with smaller receptive fields). The final super-resolved image is then computed given the information from multiple semantic scales and information from the deepest convolution layer. In addition, we empirically observed a trade-off between the effectiveness and the inference time of the CA mechanism. Hence, we also propose that removing CAs and using our Multi-FAN can improve inference time significantly while improving/maintaining PSNR performance.

3.1 Residual in Residual Channel Attention Network (RCAN)

In our experiments, we use RCAN, which is state-of-the-art in SR as our baseline. Since we also study the effect of the CA mechanism, we refer to the baseline as RCAN or RIRN (RCAN without CA) where appropriate. The RCAN/RIRN network block in Figure 2 is similar in structure to the RCAN in [34, 47], and in particular, the RCAN structure used is [34] for multi-spectral SR using the same dataset. The body is comprised of number of sequential residual groups (RGs). Each RG contains number of residual blocks (RBs). We choose , to be consistent with the parameters of RCAN in [34]. The tail of the network constitutes a 2D convolution layer creating feature channels from the input LR mosaic image. The head of the RIRN network is the reconstruction part, which consists of a transpose convolutional layer to up-sample the features and followed by a convolutional layer to produce the HR mosaic image. The network, given the input multi-spectral () and the high-resolution ground truth image (), is trained to super-resolve the input image.

3.2 Multi-scale Feature Aggregation Network (Multi-FAN)

Through a general observation of multi-spectral pixels, one could intuitively infer/observe a general relationship between the wavelengths corresponding to a single multi-spectral pixel [6]. As can be seen in Figures 1 and 2, each multi-spectral pixel is a filter pattern capturing 16 different wavelengths. These wavelengths contain information about the material in the scene and their relationship, and considering the pattern of a multi-spectral pixel, shallower semantic features could contain this intra-channel dependency information.

As illustrated in Figure 2, the RCAN/RIRN output () is generated after the residual connection; thus, the output is produced given the addition of the output of the first convolutional layer and the last RG (, RG). While such features are rich enough to super-resolve the input image, we argue that lower level features, , the ones from shallower RGs, account for fine-grained details, such as edges and texture, and possibly, as explained earlier, the intra-wavelength dependencies which are essential to consider when performing SR. To achieve this within our network, we concatenate the output of the last three RGs (RG to RG) in the network with the final feature map, right before the transposed convolutional layer in RIRN. The three feature maps from RG to RG (RG) contain valuable information of the various level of semantic features with different receptive fields. The final feature map, thanks to the skip-connection, contains very low-level semantic information via smaller receptive fields as well as the highest level semantic information with wider receptive fields.

After concatenating the four feature maps, we have a feature map with a larger number of channels (

). To extract the most valuable information from this feature representation and to facilitate feature reduction, we feed the

feature channels through a cascade of a 2D convolutional layer, a ReLU activation layer, a 2D transposed convolutional layer, a ReLU activation layer, followed by a 2D convolution layer we produce a mosaic image (

). This mosaic image is then concatenated with the mosaic image produced by RCAN/RIRN () and a 2D convolution layer learns to merge the two super-resolved images that leads to a lower value for the cost function and a quantitatively superior image ().

3.3 Loss functions

In the CNN-based SR literature, a simple loss function such as

[47] or [34, 23]

, or a perceptual loss function such as SSIM

[44] is usually utilized to train models. Here, for consistency, we choose loss as our baseline loss function since an

function is less sensitive to outliers compared to an

function. For our baseline function, we use the [38]PyTorch[32] implementation, which is a more stable implementation compared to vanilla . can be expressed as



and .

For simplicity, we refer to the function used here also as . Moreover, inspired by Goodfellow’s GAN paper [12] and the suggested loss function modification to increase derivatives at the beginning of training, we apply a modest modification to improve the performance. More specifically, we choose to minimise the cost function in logarithmic scale, that is, we minimise . As the cost function approaches zero, following the thought process in [12], using a logarithmic scale loss function could deepen or possibly stretch the cost function surface by changing the minimum possible value from to . In other words, as opposed to the GAN paper [12] where derivatives at the beginning of the training increase, here the derivative around convergence could increase, facilitating further training.

3.4 Implementation Details

PIRM2018 (SoTA) Ours
Method Bicubic RCAN* RCAN RIRN RIRN+Multi-FAN RIRN+Multi-FAN
PSNR (dB) 28.63 32.65 33.27 33.30 33.32 33.36
(3.502) (0.01) (0.01) (0.01) (0.01) (0.01)
SSIM 0.4552 0.6367 0.6480 0.6485 0.6498 0.6506
(0.0691) (0.0005) (0.0005) (0.0007) (0.0009) (0.0005)
Table 1: PSNR, and SSIM obtained using different models with , and

. Mean and standard deviation (in parenthesis) are calculated between the 11-fold cross-validation experiments.

Now we specify the implementation details of our proposed Multi-FAN. As mentioned before, the RIRN/RCAN part of our network has RGs. Each RG contains RBs. The kernel size of all our convolutional layers are set to

. The convolutional layers in the shallow feature extraction and the body has

filters, except at the tail of the RCAN where channels are reduced to . The following layers have 64, 32, and 1 channel outputs respectively. Our loss function is applied at the output of RCAN/RIRN block (), Multi-FAN () and the final output ().

3.5 Dataset: Generating LR mosaic images

As explained in the related work, performing SR on demosaiced images gives rise to artifacts related to demosaicing and interpolation [50, 10]. For example, SR CNN based methods such as VDSR [19], and SRCNN [8] that first interpolate the input LR images up to the scale of the HR images suffer from these artifacts via losing information and decreasing computational efficiency [47]. In the more related work of multi-spectral SR [34, 23] (from PIRM2018 spectral SR challenge) SR is carried out on multi-spectral cubes. The images were not interpolated, likely to avoid interpolation artifacts. However, not interpolating the images is likely to have led to the loss of spatial information contained by each wavelength. This is due to the spatial offset of each wavelength in a multi-spectral pattern [36] (see Figure 1). In theory, a CNN can learn the geometry of the wavelengths channels using the ground truth images. However, based on our empirical results we show that this is not the case.

Inspired by the ideas in [50, 10], which all in a way take advantage of the high frequency information of mosaic images, we propose our own procedure which is to generate HR-LR mosaic image pairs from image cubes. As demonstrated in Figure 1 we generate HR and LR mosaic images from HR multi-spectral cubes in the StereoMSI dataset [36], by taking the spatial offset of each wavelength in a multi-spectral pixel into account [36]. The HR multi-spectral cubes have a dimension of , and LR have a dimension of . According to [36], the multi-spectral pixels have a pattern of , meaning 16 channels. However, two of these channels are redundant, leaving us with 14 channels. We transform this 14 channel cube to its mosaic pattern following the spatial location provided in [36]. For the two redundant wavelengths, we assign zero value. In Figure 1, these two wavelengths shown as black. The resulting HR and LR mosaic patterns have dimensions , and respectively.

4 Experiments

4.1 Settings


To evaluate the effectiveness of our approach, we make use of the PIRM2018 spectral SR challenge dataset [36, 37]. The dataset is comprised of 350 multi-spectral pairs of and images. The training set consists of 300 images plus 30 and 20 images set aside for validation and testing respectively. and cubes exhibit resolutions of , and respectively. Following our proposed mosaic generation procedure in Section 3.5, we turn cubes to mosaic of size and cubes to mosaics of size .

Evaluation metrics:

The 20 test images were super-resolved to a scale of and evaluated using Pixel Signal to Noise Ratio (PSNR) and Structural Similarity Index (SSIM). For the SSIM metric, a window size of 7 is used, and the metric is calculated for each channel, and then averaged. It is important to note that, while we present SSIM results as it is customary in the SR literature, it is a relatively less descriptive metric in multi-spectral image SR compared to RGB SR. This is due to the fact that the amount of light absorption of a specific material/pixel is mostly related to PSNR rather that a perceptual metric such as SSIM.

11-fold cross-validation:

To verify the stability of our methods on the PIRM2018 dataset, we carried out an 11-fold cross-validation experiment. The 330 training plus validation images, were randomly divided into 11 folders with one folder iteratively selected as validation and the rest for training. The first fold corresponds to the original dataset division. This cross-validation was only performed for the results presented in Table 1, with model parameters corresponding more closely to the state-of-the-art algorithm from the challenge. Confirming the stability of our methods and obtaining very low standard deviations between the 11 experiments, we continued the rest of our ablation studies with the first fold which corresponds to the dataset in the challenge.

Training settings:

During training, we performed data augmentation on our mini-batches of 16 images of our 300 training images, which included random cropping of the input image, random rotation by , , , with , and random horizontal flip with . Our model is trained by the ADAM optimizer [22] with , , and . The initial learning rate is set to

and then halved every 2500 epochs. To implement our models we used PyTorch

[32], and in particular, for our loss function, the function [38] was used as the main building block. To test our algorithms, we selected the models with the best performance on the validation dataset, and present the testing results for those models.

4.2 Effect of using mosaic images

To evaluate the effectiveness of mosaic images in SR, we compare it with the RCAN trained on mosaic images and original data format as in [34]. Note, the data format used in [34] does not consider the spatial offset of each multi-spectral pixel. In Table 1, we provide the results of this comparison as RCAN and RCAN. A considerable improvement of in PSNR is achieved only by using our generated mosaic images. We can conclude that, the network is not fully capable of learning the geometrical relationship between the wavelengths (sub-pixels) in a multi-spectral pixel, highlighting the effect of our data representation in improving PSNR and SSIM results.

4.3 Effect of the CA

Before we demonstrate the effect of our proposed Multi-FAN module, we first study the effect of CA in the architecture. In our implementation of RCAN, which follows the parameters in [34] for multi-spectral SR, we have and translating to 15 CA mechanisms. Interestingly, CA seems to deteriorate PSNR performance by , with results presented in Table 1 for RCAN and RIRN (RCAN without CA). We hypothesize that this is due to the fact that our implementation of RCAN is not very deep, and CA can help to train very deep networks such as in [47]. In fact, according to [47], CA in a very deep RCAN with and , 200 CAs can lead to a PSNR improvement of . Hence, we also train the very deep RCAN of [47], with results presented in Table 3, showing that CA can improve PSNR by with a very deep network. The remainder of the results in Table 1, and 3 are discussed in the following.

4.4 Effect of the Multi-FAN module

Metric SR SR SR
PSNR (dB) 33.296 33.357 33.362
(0.012) (0.009) (3.698)
SSIM 0.6478 0.6502 0.6506
(0.0007) (0.0004) (0.0005)
Table 2: Effect of Multi-FAN demonstrated by assessing PSNR and SSIM at three points in our network, output of RIRN (SR) with and , output of Multi-FAN (SR). These results are averaged over our 11-fold cross validation experiments.
PSNR (dB) 33.39 33.37 33.39 33.42 33.44 33.44
(3.742) (3.763) (3.791) (3.790) (3.771) (3.777)
SSIM 0.6556 0.6535 0.6537 0.6572 0.6559 0.6569
(0.0614) (0.0614) (0.0619) (0.06108) 0.06147 (0.0612)
Table 3: Mean and standard deviation (in parenthesis) of PSNR, and SSIM obtained using different models with RCAN with and as the baseline.

As explained earlier, the incentive behind our Multi-FAN is to exploit multi-scale semantic features for the task of multi-spectral SR. To quantitatively assess the benefit of the Multi-FAN module, we have trained RIRN+Multi-FAN with results displayed in Table 1. The results are presented in Table 1. An improvement of can be achieved by using our proposed Multi-FAN compared to RIRN. The improvement of our RIRN+Multi-FAN compared to RCAN is . To study the effect of Multi-FAN in more detail, in Table 2, we have provided PSNR and SSIM results for the image at the output of RIRN (SR), at the output of Multi-FAN module (SR), and the output of RIRN+Multi-FAN (SR). Note that the results obtained via the Multi-FAN module are superior to the results at the output of RIRN. Besides, the final results at SR is superior to both results at SR, and SR, showing that the two networks (Multi-FAN, and RIRN) are learning complementary information.

When we changed our loss to , the results improved by another . In comparison to the RCAN baseline, our Multi-FAN module and logarithmic loss improve the PSNR results by . Note that due to recent progress in CNN based SR, in particular, the contribution of the RCAN paper [47], further improvement in PSNR is a substantially challenging task. Nevertheless, our improvement in PSNR in dB, which is purely due to the architectural contribution and logarithmic loss, amounts to a improvement in ratio. Overall, with our mosaic data representation, Multi-FAN module, and logarithmic loss, a generous PSNR improvement of is achieved. Also, note that the very small PSNR and SSIM standard deviation of our 11-fold cross-validation experiment in Table 1 is evident of reliability of our results and stability of our method.

In comparison to the top PIRM2018 algorithm, our algorithm clearly outperforms it. No self-ensemble algorithm was used in the post-processing stage to achieve further improvements. These results purely demonstrate the effect of our data-representation, our network architecture, as well as our logarithmic loss function contributions.

For the sake of completeness, we also further investigated the effect of Multi-FAN and logarithm loss on the very deep RCAN in [47] with , and with results presented in Table 3. Since the results in Table 1 exhibits very low standard deviations between the 11-fold cross-validation experiments, here, and for the rest of the paper, the experiments are carried out only on the first fold. Looking at the results in Table 3, the Multi-FAN module has resulted in a improvement in PSNR when used with RIRN, and an improvement of in PSNR when used with RCAN. Logarithmic loss has again shown to be effective in improving the results.

4.5 Effect of different loss combinations and their logarithmic versions

Loss SSIM SSIM L1 + L1+
function SSIM SSIM
PSNR (dB) 33.256 33.27 33.26 33.28
(3.69) (3.68) (3.69) (3.69)
SSIM 0.6567 0.6575 0.6570 0.6554
(0.0606) (0.0609) (0.0610) (0.0607)
Table 4: Effect of logarithmic loss for different loss functions for RCAN with and .

To further validate the effectiveness of the logarithmic version of a loss function, and also to investigate the effect of different loss functions on training, we carried out the following experiments. We train our RIRN+Multi-FAN with 1. SSIM loss function and its logarithmic version, 2. the summation of SmoothL1 and SSIM loss functions, and 3. the summation of the logarithmic versions of both loss functions. The results are presented in Table 4. From these results, it is apparent that a logarithmic loss improves performance both on PSNR and SSIM. Our best PSNR is achieved when using the logarithmic version of both L1 and SSIM loss functions. Our best SSIM result is obtained when using the logarithmic version of the SSIM loss function on its own.

4.6 Qualitative results

Figure 3: Qualitative results at . RCAN* indicates the approach from PIRM2018 challenge. The bounding box contains, from left to right, the results demonstrating the effect of our data representation, our Multi-FAN network and logarithmic scale loss function respectively.

The qualitative results are presented in Figure 3 and 4 displaying different wavelength channels. The images are from the test dataset and belong to wavelength channel in Figure 3, and wavelength channels and in Figure 4 respectively. The improvements introduced via mosaic representation can be seen in both Figures by comparing RCAN* results to RCAN. In all images, the edges are better defined. This further improves, yet more subtly, when we introduce our MFA-FAN and logarithmic version of the loss function. This effect can be better qualified by assessing the edges in the colour checker. Note, our ground truth images, due to the low spatial resolution of multi-spectral cameras, do not enjoy the high resolution of ground truth images usually used in the RGB SR literature [16, 11, 4, 27]. Besides, due to the grayscale nature of the images, visual improvements regarding the colours (intensity of the images) are more difficult to qualify.

Figure 4: Qualitative results at (left), and (right). RCAN* indicates the approach from the PIRM2018 challenge. The bounding box contains, from left to right, the results demonstrating the effect of our data representation, our Multi-FAN network and logarithmic scale loss function respectively.
, and , and
Multi-FAN Multi-FAN
Params 1.37M 1.36M 1.54M 15.33M 15.21M 15.57M
Time (sec) 7 3 4 2.5 0.35 0.46
PSNR(dB) 33.27 33.30 33.32 33.39 33.37 33.39
Table 5: Effect of CA and Multi-FAN on inference time (seconds), number of parameters, and PSNR (dB).

4.7 Inference time:

Since CA involves multiplication of a calculated weight vector by the input feature map, it can be computationally relatively expensive if used in abundance. In situations where one is forced to compromise between speed and accuracy, our Multi-FAN module can be exploited to improve performance while still maintaining a short inference time. To demonstrate this, Table

5 presents number of parameters, inference time, and PSNR results for the same loss function (SmoothL1) for RCAN, RIRN and RIRN+Multi-FAN for the two different network sizes discussed in this paper, that is, one with and [34], and the other with and [47].

In the case where and , we have 15 CA mechanisms. Removing these 15 CAs and instead employing a Multi-FAN module not only improves the performance in terms of the fidelity of images but also reduces inference time as well. Due to the small values of inference time for this scenario, we averaged the inference times between our 11-fold cross-validation experiments. In the case of a very deep residual network with and , we have 200 CA mechanisms. Removing these causes a reduction in PSNR. However, when using a Multi-FAN module a significant reduction in inference time is achieved while still maintaining its PSNR performance.

5 Conclusions

In this paper, we proposed a new architecture for the task of super-resolution of multi-spectral images. Experimental results show the effectiveness of our approach as well as the data format we generate. In particular, we show that 1) incorporating features at multiple semantic levels (e.g., from different RGs) is beneficial and 2) our generated mosaic images are more effective than the original raw multi-spectral cubes. Quantitatively, this work improves on [36] by a total of in PSNR with improvement being due to mosaic pattern generation, and was due to our proposed Multi-FAN network and our choice of logarithmic scale loss function.

Regarding inference time, we showed that removing CA mechanisms and using the Multi-FAN network instead can result in significant improvement of inference time, a decrease for the very deep RCAN [47] with no compromise in PSNR results.


  • [1] N. Ahn, B. Kang, and K. Sohn (2018) Fast, accurate, and, lightweight super-resolution with cascading residual network. arXiv preprint arXiv:1803.08664. Cited by: §2.
  • [2] S. Anwar and N. Barnes (2019) Densely residual laplacian super-resolution. arXiv preprint arXiv:1906.12021. Cited by: §2.
  • [3] J. F. Bell, D. Wellington, C. Hardgrove, A. Godber, M. S. Rice, J. R. Johnson, and A. Fraeman (2016) Multispectral imaging of mars from the mars science laboratory mastcam instruments: spectral properties and mineralogic implications along the gale crater traverse. In AAS/Division for Planetary Sciences Meeting Abstracts, Vol. 48. Cited by: §1.
  • [4] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding. Cited by: §4.6.
  • [5] M. Bigas, E. Cabruja, J. Forest, and J. Salvi (2006) Review of cmos image sensors. Microelectronics journal 37 (5), pp. 433–451. Cited by: §1.
  • [6] C. Chang (2000) An information-theoretic approach to spectral variability, similarity, and discrimination for hyperspectral image analysis. IEEE Transactions on information theory 46 (5), pp. 1927–1932. Cited by: §3.2.
  • [7] D. Doering, M. Vizzotto, C. Bredemeier, C. da Costa, R. Henriques, E. Pignaton, and C. Pereira (2016) MDE-based development of a multispectral camera for precision agriculture. IFAC-PapersOnLine 49 (30), pp. 24–29. Cited by: §1.
  • [8] C. Dong, C. C. Loy, K. He, and X. Tang (2016) Image super-resolution using deep convolutional networks. TPAMI. Cited by: §2, §3.5.
  • [9] C. Dong, C. C. Loy, and X. Tang (2016) Accelerating the super-resolution convolutional neural network. In ECCV, Cited by: §2, §2, §2.
  • [10] Y. Fu, Y. Zheng, H. Huang, I. Sato, and Y. Sato (2018) Hyperspectral image super-resolution with a mosaic rgb image. IEEE Transactions on Image Processing 27 (11), pp. 5539–5552. Cited by: §2, §3.5, §3.5.
  • [11] A. Fujimoto, T. Ogawa, K. Yamamoto, Y. Matsui, T. Yamasaki, and K. Aizawa (2016) Manga109 dataset and creation of metadata. In Proceedings of the 1st International Workshop on coMics ANalysis, Processing and Understanding, pp. 2. Cited by: §4.6.
  • [12] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial networks. arXiv preprint arXiv:1406.2661. Cited by: §3.3.
  • [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, Cited by: §2.
  • [14] M. Haris, G. Shakhnarovich, and N. Ukita (2018) Deep backprojection networks for super-resolution. In CVPR, Cited by: §2.
  • [15] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In CVPR, Cited by: §2.
  • [16] J. Huang, A. Singh, and N. Ahuja (2015) Single image super-resolution from transformed self-exemplars. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 5197–5206. Cited by: §4.6.
  • [17] S. P. Jaiswal, L. Fang, V. Jakhetiya, J. Pang, K. Mueller, and O. C. Au (2016) Adaptive multispectral demosaicking based on frequency-domain analysis of spectral correlation. IEEE Transactions on Image Processing 26 (2), pp. 953–968. Cited by: §2.
  • [18] A. Jolicoeur-Martineau (2018) The relativistic discriminator: a key element missing from standard gan. arXiv preprint arXiv:1807.00734. Cited by: §2.
  • [19] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Accurate image super-resolution using very deep convolutional networks. In CVPR, Cited by: §2, §3.5.
  • [20] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Deeply-recursive convolutional network for image super-resolution. In CVPR, Cited by: §2.
  • [21] J. Kim, J. Choi, M. Cheon, and J. Lee (2018) RAM: residual attention module for single image super-resolution. arXiv preprint arXiv:1811.12043. Cited by: §2, §2.
  • [22] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • [23] F. Lahoud, R. Zhou, and S. Süsstrunk (2018) Multi-modal spectral image super-resolution. In European Conference on Computer Vision, pp. 35–50. Cited by: §2, §3.3, §3.5.
  • [24] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network.. Cited by: §2.
  • [25] Y. Li, J. Hu, X. Zhao, W. Xie, and J. Li (2017) Hyperspectral image super-resolution using deep convolutional neural network. Neurocomputing 266, pp. 29–41. Cited by: §2.
  • [26] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee (2017) Enhanced deep residual networks for single image super-resolution. In CVPRW, Cited by: §2, §2, §2.
  • [27] D. Martin, C. Fowlkes, D. Tal, J. Malik, et al. (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. Cited by: §4.6.
  • [28] V. Mnih, N. Heess, A. Graves, et al. (2014) Recurrent models of visual attention. In Advances in neural information processing systems, pp. 2204–2212. Cited by: §2.
  • [29] Y. Monno, S. Kikuchi, M. Tanaka, and M. Okutomi (2015) A practical one-shot multispectral imaging system using a single image sensor. IEEE Transactions on Image Processing 24 (10), pp. 3048–3059. Cited by: §2.
  • [30] M. Najafi, S. T. Namin, and L. Petersson (2013) Classification of natural scene multi spectral images using a new enhanced crf. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3704–3711. Cited by: §1.
  • [31] S. Park, H. Son, S. Cho, K. Hong, and S. Lee (2018) SRFeat: single image super-resolution with feature discrimination. In ECCV, Cited by: §2.
  • [32] Pytorch v0.4.0. Note: accessed: 2019-03-22 Cited by: §3.3, §4.1.
  • [33] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §2.
  • [34] Z. Shi, C. Chen, Z. Xiong, D. Liu, Z. Zha, and F. Wu (2018) Deep residual attention network for spectral image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §2, Figure 2, §3.1, §3.3, §3.5, §3, §4.2, §4.3, §4.7.
  • [35] M. Shoeiby, P. Lars, S. Aliakbarian, A. Armin, and A. Robles-kelly Super-resolved chromatic mapping of snapshot mosaic image sensors via atexture sensitive residual network. Cited by: §2.
  • [36] M. Shoeiby, A. Robles-Kelly, R. Wei, and R. Timofte (2019) PIRM2018 challenge on spectral image super-resolution: dataset and study. External Links: 1904.00540 Cited by: Figure 1, §2, §3.5, §3.5, §4.1, §5.
  • [37] M. Shoeiby, A. Robles-Kelly, R. Zhou, F. Lahoud, S. Süsstrunk, Z. Xiong, Z. Shi, C. Chen, D. Liu, Z. Zha, F. Wu, K. Wei, T. Zhang, L. Wang, Y. Fu, Z. Zhong, K. Nagasubramanian, A. K. Singh, A. Singh, S. Sarkar, and G. Baskar (2018) PIRM2018 challenge on spectral image super-resolution: methods and results. In European Conference on Computer Vision Workshops (ECCVW), Cited by: §2, §4.1.
  • [38] SmoothL1 loss function. Note:
    /nn.html #smoothl1lossLast accessed: 2019-03-22
    Cited by: §3.3, §4.1.
  • [39] Y. Tai, J. Yang, X. Liu, and C. Xu (2017) Memnet: a persistent memory network for image restoration. In CVPR, Cited by: §2.
  • [40] Y. Tai, J. Yang, and X. Liu (2017) Image super-resolution via deep recursive residual network. In CVPR, Cited by: §2.
  • [41] K. Takumi, K. Watanabe, Q. Ha, A. Tejero-De-Pablos, Y. Ushiku, and T. Harada (2017) Multispectral object detection for autonomous vehicles. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pp. 35–43. Cited by: §1.
  • [42] T. Tong, G. Li, X. Liu, and Q. Gao (2017) Image super-resolution using dense skip connections. In ICCV, Cited by: §2.
  • [43] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, C. C. Loy, Y. Qiao, and X. Tang (2018) ESRGAN: enhanced super-resolution generative adversarial networks. arXiv preprint arXiv:1809.00219. Cited by: §2.
  • [44] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §3.3.
  • [45] D. Wu and D. Sun (2013) Advanced applications of hyperspectral imaging technology for food quality and safety analysis and assessment: a review—part i: fundamentals. Innovative Food Science & Emerging Technologies 19, pp. 1–14. Cited by: §1.
  • [46] K. Zhang, W. Zuo, and L. Zhang (2018) Learning a single convolutional super-resolution network for multiple degradations. In CVPR, Cited by: §2, §2.
  • [47] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu (2018) Image super-resolution using very deep residual channel attention networks. In European Conference on Computer Vision, pp. 294–310. Cited by: §2, §3.1, §3.3, §3.5, §3, §4.3, §4.4, §4.4, §4.7, §5.
  • [48] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu (2018) Image super-resolution using very deep residual channel attention networks. arXiv preprint arXiv:1807.02758. Cited by: §2, §2, §2.
  • [49] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu (2018) Residual dense network for image super-resolution. In CVPR, Cited by: §2.
  • [50] R. Zhou, R. Achanta, and S. Süsstrunk (2018) Deep residual network for joint demosaicing and super-resolution. In Color and Imaging Conference, Vol. 2018, pp. 75–80. Cited by: §2, §3.5, §3.5.