Super-resolution (SR) algorithms serve the purpose of reconstructing high-resolution (HR) images from either single or multiple low-resolution (LR) images. Due to constraints such as sensor limitations and exceedingly high acquisition costs, it is often challenging to obtain HR images. In this regard, SR algorithms provide viable opportunity to enhance and reconstruct HR images from LR images recorded by the sensors. Over more than three decades, progress has steadily been observed in the development of Super-resolution, as both multi-frame and single-frame SR now have substantial applications that can use the image generation purposefully.
SR is very significant to Remote Sensing because it provides opportunity to enhance LR images despite the inherent problems often faced in remote-sensing scenarios. The hardware and material costs for smaller missions around data accumulation are very high. Additionally, onboard instruments on satellites continue to generate ever-increasing data as spatial and spectral resolutions also increase, and this has progressively become challenging for compression algorithms , as they try to meet the bandwidth restrictions [2, 55]. Remote sensing is fundamental in obtaining images covering most of the globe, permitting many vital projects such as disaster monitoring, military surveillance, urban maps, and vegetation growth monitoring. It is thus imperative that enhancements and progress be made in post-processing techniques to overcome obstacles of increasing spatial resolution.
There are two main methods used in Super-resolution: Single-image SR (SISR) and Multi-image SR (MISR). SISR employs a single image to reconstruct a HR version of it. However, a single image is quite limited in the amount of information that it provides, mainly post the LR image formation process. Contrastingly, MISR involves multiple LR images of the same scene acquired from the same or different sensors to construct an HR image. The significant advantage MISR holds over SISR is in how it can draw out otherwise unavailable information from the different image observations of the same scene. It consequently constructs high spatial resolution image. However, to achieve the additional benefits of MISR, a multitude of problems need to be solved. Conventionally, multiple images are obtained by either a satellite during its multiple orbits or by different satellites at different times or different sensors acquiring images at the same time. With so many variables involved, many complications need to be considered, such as cloud coverage, time variance in scene content, and invariance to absolute brightness variability.
There has been significant progress in Single-image SR as deep learning methods and deep neural networks have been brought into use, allowing a better efficient generation of non-linear maps to deal with complex degradation models. However, there has not been any similar progress in MISR.
In this paper, building over the latest breakthroughs in SISR [9, 38, 61, 63, 6], we propose a deep learning MISR solution for remote-sensing applications that exploits both spatial and temporal correlations to combine multiple low-resolution acquisitions smartly. Indeed, our model provides a real end-to-end efficient solution to recover high-resolution images, overcoming limitations of previous similar methodologies, and providing enhanced reconstruction results. So, the presented research constitutes an exceptional opportunity, easily replicable, to access better quality and more useful information for the remote-sensing community. In particular, the main contribution of our work lies in:
The use of 3D convolutions to efficiently extract, directly from the stack of multiple low-resolution images, high-level representations, simultaneously exploiting spatial and temporal correlations.
The introduction of a novel feature attention mechanism for 3D convolutions that lets the network focus on most promising high-frequency information largely overcoming main locality limitations of convolutional operations. Moreover, the concurrent use of multiple nested residuals, inside the network, let low-frequency components flow directly to the output of the model.
The conceptualization and development of an efficient, highly replicable, deep learning neural network for MISR that makes use of 2D and 3D convolutions exclusively in the low-resolution space. It has been extensively evaluated on a major multi-frame open-source remote-sensing dataset proving state-of-the-art results with a considerable margin. So, it constitutes an exceptional tool and opportunity for the remote-sensing research community.
The remainder of this paper is structured as follows. Section II covers the related work on SR and its developments in techniques for both SISR and MISR. Section III explains the overall methodology, network architecture and its subsequent blocks, and training process. Section IV discusses the experimentation, the Proba-V dataset, data pre-processing, and results. Section V draws some conclusions and future directions.
Ii Related Work
Related literature is organized as follows. Firstly, a wide range of studies related to SISR are discussed which involve state-of-the-art methods and recent developments in SISR techniques, which is the basis of every SR method. Secondly, studies performed for SR in remote sensing domain are discussed. Lastly, MISR related studies, which are rarely addressed, are discussed including latest developments.
Ii-a Single-image Super-resolution
Ever since the late eighties and the early nineties, there has been an eager interest in SR, comprehensively reviewed by Borman and Stevenson . Following forth in the works of Tsai and Huang  and afterward, Kim et al. 
, the new approaches considered processing images in the frequency domain to recover lost information of higher-frequency. These first works had certain drawbacks, like the level of difficulty observed in successfully incorporating the prior available spatial information. Several studies performed by Irani and Peleg[19, 20, 18] focused over the spatial domain, proposing methods for SR reconstruction.
Learning-based methods build upon the relation between LR-HR images, and there have been many recent advancements in this approach, mostly due to deep convolutional neural networks (CNNs) [10, 27, 9]. The leading force for this was Dong et al. , who achieved superior results by proposing a Super-resolution CNN (SRCNN) framework. Kim et al. introduced residual learning and suggested very deep SR (VDSR)  and deeply recursive CN (DRCN)  with 20 layers. Later, Tai et al. pioneered deep recursive residual network (DRRN)  and memory blocks in MemNet . So going forth, particular emphasis has been placed on proper upscaling of spatial resolutions at network tail-ends, as well as extracting information of the original scale LR inputs. To that end, some enhancements have been proposed for accelerating the testing and training needed for SRCNN, a faster network structure FSRCNN . Ledig et al.  proposed SRGAN, a generative adversarial network (GAN) for photo-realistic SR with perceptual losses , and K. He et al. introduced ResNet  for image SR and to make a deeper network SRResNet. EnhanceNet  also used a GAN based model to merge perceptual loss with automated texture synthesis. Though, the predicted results can produce some artifacts and may not be a faithful reconstruction.
In recent past years, enhancements in deep networks have been proposed and showed promising results for SISR, for example, in , an Enhanced Deep Super-resolution (EDSR) network was developed to improve the performance by removing unnecessary modules and expanding the model size with the stable training process in conventional residual networks. Yu et. al 
demonstrated better results in terms of accurate SR by generating models with a wide range of features before ReLU activation and training with normalized weights. Zhang et. al proposed residual channel attention networks (RCAN) that exploits very deep network structure based on residual in residual (RIR) which bypass excessive low-frequency information through multiple skip connections.
Ii-B SR for Remotely Sensed Imagery
With the increasing availability of recent satellite-based multispectral sensors and transmission bandwidth restrictions , it is possible to obtain images at different spatial resolutions with multiple spectral bands. Keen attention is being paid to developing better methods of super-resolving the lower-resolution bands but simultaneously keeping the images at a high spatial resolution. An example can be seen in , where - through the utilization of only lower resolution bands – SR of multispectral remote sensing images is applied with convolutional layers. 
shows the integration of residual connections into a single image SR based architecture to achieve better SR performance. The performance of image enhancement methods in computer vision can also be increased prominently through generative adversarial networks (GANs)[34, 58]. Moreover, GANs have also been exploited to super-resolve remote sensing images. For example, Ma et. al.  developed a dense residual generative adversarial network (DRGAN)-based SISR method to super resolve remote sensing images. By designing a dense residual network as the generative network in GAN, their method makes full use of the hierarchical features from low-resolution (LR) images.
Dong et. al. 
proposed a novel multi-perception attention network (MPSR) for Super-resolution of low resolution remotely sensed images, which achieved better results by incorporating the proposed enhanced residual block (ERB) and residual channel attention group (RCAG). Their methodology is capable of dealing with low-resolution remote sensing images via multi-perception learning and multi-level information adaptive weighted fusion. They claimed that, a pre-train and transfer learning strategy can improve the SR performance and stabilize the training procedure. Gargiulo et. al. proposed a CNNs based approach to provide a 10 m super-resolved image of the original 20 m bands of remotely sensed Sentinel-2 images. In their experimental results, they claimed that the proposed solution can achieve better performance with respect to most of the state-of-the-art methods, including other deep learning based ones with a considerable saving of computational burden. Recently methods to enhance spatial resolution of remotely sensed images used Parallel Residual Network , Bidirectional Convolutional LSTMs , Deep Residual Squeez and Excitation Network .
Ii-C Multi-image Super-resolution
Multi-image SR (MISR) involves the extraction of information from many LR observations of the same scene to reconstruct HR images . The earliest work for MISR was proposed by Tsai and Huang  using a frequency-domain technique, by combining multiple images with sub-pixel displacements to improve the spatial resolution of images. Due to the some weaknesses of the first proposed method related to incorporate prior information of HR images, several spatial domain MISR techniques were considered . These include projection onto convex sets (POCS) 
, non-uniform interpolation, regularized methods [52, 47], and sparse coding 
. With the availability of more data from the multiple observations of the scene, it is possible to obtain a more accurate reconstruction than through single-image methods. MISR techniques involve different ways of degrading the original image by following an image model, and these involve blurring, warping, noise contamination, and decimation. Then the degradation is reversed by solving an ill-posed optimization problem. To this end, Bayesian reconstruction in the gradient projection algorithm was used alongside subpixel displacement estimation. An enhanced Fast and Robust SR (FRSR)  employs estimation of maximum likelihood analysis and simplified regulation. Another proposal in SR was for the Adaptive detail enhancement (SR-ADE) , which reconstructs satellite images with the use of a bilateral filter for decomposing input images while also amplifying high-frequency detail information.
Another approach Iterative Back Projection (IBP), introduced by Irani and Peleg , used a back-projection of the difference between the actual LR images obtained and the simulated LR images to the SR image. The forward imaging process is inverted and iteratively attempted in updates. As with MISR, there are apparent drawbacks when prior images are difficult to be included, or it is difficult to model an image’s degradation process.
In the past few years, many deep learning-based approaches have been exploited to address the MISR problems in the context of enhancing video sequences [23, 4, 21]. However, MISR is rarely exploited for remotely sensed satellite imagery. Kawulok et al  demonstrated the potential benefits of information fusion offered by multiple satellite images reconstruction and learning-based SISR approaches. In their work, EvoNet framework  based on several deep CNNs was adopted to exploit SISR in the preprocessing phase of the input data for MISR.
Recently, a challenge was set by the European Space Agency (ESA) to super-resolved multi-temporal PROBA-V satellite imagery111https://kelvins.esa.int/proba-v-super-resolution.. In this context, a new CNN-based architecture DeepSUM was proposed by Molini et. al  to super resolve multi-temporal PROBA-V imagery. An end-to-end learning approach was established by exploiting both spatial and temporal correlations. Most recently, Deudon et. al presented HighRes-net based on deep learning to deal with the MISR of remotely sensed PROBA-V satellite imagery . They proposed an end-to-end mechanism that learns the sub-tasks involved in MISR, that are co-registration, fusion, upsampling, and registration-at-the-loss.
MISR aims at recovering an HR image from a set of LR images of the same scene acquired in a certain temporal window. In contrast to SISR, MISR can simultaneously benefit from spatial and temporal correlations, being able to achieve far better reconstruction results theoretically. Either way, SR is an inherently ill-posed problem since a multiplicity of solutions exist for any given set of low-resolution images. So, it is an underdetermined inverse problem, of which solution is not unique. Our proposed methodology, based on a representation learning model, aims at generating a super-resolved image applying a function to the set of images:
where are model parameters learned with an iterative optimization process.
In other words, we propose a fully convolutional Residual Attention Multi-image Super-resolution network (RAMS) that can efficiently extract high-level features concurrently from LR images and fuse them exploiting a built-in visual attention mechanism. Attention directs the focus of the model only on most promising extracted features, reducing the importance of less relevant ones and mostly transcending limitations of the local region of convolutional operations. Moreover, extensive use of nested residual connections lets all the redundant low-frequency information, present in the set of LR images, flow through the network, leaving the model focusing its computation only on high-frequency components. Indeed, high-frequency features are more informative for HR reconstruction, while LR images contain abundant low-frequency information that can directly be forwarded to the final output . Finally, as the majority of the model for single-image super-resolution [38, 61, 10, 6], all computations in our network are efficiently performed in the LR feature space requiring only an upsample operation at the final stage of the model.
In the following paragraphs, we present the overall architecture of the network with a detailed overview of the main blocks. Finally, we conclude the methodology section with precise details of the optimization process for training the network.
Iii-a Network architecture
An overview of the RAMS network, with its main three blocks and two branches, is depicted in Fig. 1. As a high-level description, the model takes as input a single set of low-resolution images that can be represented as a tensor with shape where , and are the height, width, and channels of the single low-resolution images, respectively. The upper global residual path proposes a simple SR solution, making an aware fusion of the input images. On the other hand, the central branch exploits 3D convolutions residual-based blocks in order to extract spatial and temporal correlations from the same set of LR images and provide a refinement to the residual simple SR image.
More in detail, in the first part of the main path of the model, we use a simple 3D convolutional layer, with each filter of size , to extract shallow features from the input set of LR images. Then, we apply a cascade of N residual feature attention blocks that increasingly extract higher-level features, exploiting local and non-local, temporal, and spatial correlations. Moreover, we make use of a long skip connection for the shallow features and several short skip connections inside each feature attention block to let flow all redundant low-frequency signals and let the network focus on more valuable high-frequency components. The three dimensions , and
are always preserved through reflecting padding. The first part of the main branch can be modeled as a single functionthat maps each tensor to a new higher dimensional one with shape :
In the second part of the main branch, we further process the output tensor with temporal reduction blocks. In each block, we intersperse a residual feature attention block with 3D convolutional layer without padding on the temporal dimension (TR-Conv). So, , and remain invariant and only the temporal dimension is reduced. The output of this second block is a new tensor with shape , where the temporal dimension is reduced to :
Finally, the output tensor is processed by a further TR-Conv layer that reduces to one and an upscale function that generates a tensor of shape where is the scaling factor.
The overall output of the main branch sums with the trivial solution provided by the global residual. Indeed, the global path simply weights the LR images of the input tensor with a residual temporal attention block with filters of size . Then it produces an output tensor of shape that is added to the one of the main branch. So, the final SR prediction of the network is the sum of the two contribution:
The upscaling procedure is identical for both branches; after several trials with different methodologies, such as transposed convolutions , bi-linear resizing and nearest-neighbor upsampling , we adopted a sub-pixel convolution layer as explained in detail in . So, for either branch, the last 2D or 3D convolution generates features in order to produce the final tensors of shape for the residual sum.
In conclusion, the overall model takes as input a tensor with shape , works always efficiently in the LR space and generates only at the final stage an output tensor with shape .
In the following sub-paragraphs, the three major functional blocks, residual feature attention, residual temporal attention, and temporal reduction blocks are further explained and analyzed.
Iii-B Residual attention blocks
Residual attention blocks are at the core of the RAMS model; their specific architecture allows it to focus on relevant high-frequency components and let redundant, low-frequency information flow through the residual connections of the network. Inter-dependencies among features, in the case of feature attention blocks, or temporal steps, in the case of temporal attention blocks, are taken into account computing for each of them, relevant statistics that take into account local and non-local, temporal and spatial correlations. Indeed, either 3D or 2D convolution filters operate with local receptive fields loosing the possibility to exploit contextual information outside of their limited region of view.
Iii-B1 Residual feature attention
Except for the global residual path, all residual attention blocks are residual feature attention blocks, as shown in Fig. 1. Indeed, each block of features is weighted up in order to trace most promising high-frequency components, and a residual connection lets low-frequency information flow through the network.
More formally, the output of a residual feature attention block with a generic tensor, , is equal to:
where is the feature attention function and is the output of two stacked 3D convolutional layers.
where , and , represent the filters with size and biases respectively and, ’’ denotes the 3D convolution operation. The number of filters is always equal to as the ones extracted by the first 3D convolutional layer.
So, all low-frequency components in can flow through the residual connection and can focus the attention of the network to more valuable high-frequency signals. To this end, the feature attention block takes the feature-wise global spatial and temporal information into a feature descriptor by using a global average pooling. So, from the tensor with shape we extract feature statistics using the following equation:
Statistics of the feature can be viewed as a collection of descriptors, whose values contribute to express the whole stack of temporal images .
In Fig.2, it is possible to observe the global pooling operation which output is a tensor with shape and last dimension equal to . In addition, the output tensor is further processed by a stack of two 3D convolutional layers with a ReLU 
and sigmoid activation function, respectively. Indeed, as discussed in, the stack of two convolutional layers with the filter of size concur to create a non-linear mapping function which is able to deeply capture feature-wise dependencies from the aggregated information extracted by the global pooling operation. The first 3D convolutional layer reduces the feature size by a factor of
, and then the second layer restores the original dimension and constraints its values from zero to one with a sigmoid function in a non-mutually exclusive relationship.
Finally, the original tensor is weighted up by the processed attention statistics as shown in Eq. 5.
Iii-B2 Residual temporal attention
The primary purpose of the global residual path is to generate a starting trivial solution for the upsampling problem. More accurate is this starting prediction, and more simplified is the role of the main branch of the network, leading to a lower reconstruction error. However, the input of the model has different LR images that have to be merged. Intuitively, for each input sample , there are some LR images more similar to each other. So, giving them more relevance when merging the
LR images would most probably lead to higher quality predictions. In this context, the aim of the residual temporal attention block is to make an aware weighing of the different input temporal images, letting the network to make an upsample solution with primarily the most similar temporal steps. That is accomplished with an asymmetrical mechanism to the one employed in the residual feature attention blocks and can be summarized by the following formula:
where is the temporal attention function and is the product of a stack of two 2D convolutional operations as depicted in Fig. 3 with and as filter size and number of features, respectively. Then, as already introduced with the feature attention blocks, the temporal block takes the temporal-wise global spatial information into a feature descriptor by using a global average pooling operation. Finally those statistical descriptors are processed by a stack of 2D convolutional layers with ReLU and sigmoid as activation function, respectively, scaling the channels of the input tensor, as shown in Eq. 8. As for feature attention blocks, the first convolutional layer reduces the number of the last dimension by a factor of , giving the network the possibility to fully capture temporal-wise dependencies from the aggregated output information of the global average pooling operation.
Iii-C Temporal reduction blocks
The aim of the last block of the main branch is to slowly reduce the number of temporal steps so that the temporal depth eventually reduces to one. Indeed, the output tensor of the N residual feature attention blocks has temporal dimensions that need to be merged. To this end, we further process the incoming tensors with temporal reduction blocks. Each one is composed of a residual feature attention block and a 3D convolutional layer without any reflecting padding in the temporal dimension, denoted TR-Conv. So, at each TR-Conv layer we reduce of . The attention blocks allow the network to learn the best space to decouple image features, ”highlighting” more promising features to maintain when reducing the temporal dimension. The output of the last temporal reduction block is a tensor with shape where the temporal dimension is reduced to . The last TR-Conv, before the upsampling function , reduces to one the number of temporal steps and generates features for the sub-pixel convolutional layer.
Iii-D Training process
Learning the end-to-end mapping function requires the estimation of model parameters . That is achieved by minimizing a loss between the reconstructed super-resolved images and the corresponding ground truth high-resolution images .
Several loss functions have been proposed and investigated for the SISR problem, such as[31, 38, 32, 61], [41, 9, 50, 29] and perceptual and adversarial losses [34, 22]. However, in typical MISR remote-sensing problems, LR images are taken within a certain time window and they could have an undefined spatial misalignment one to each other. So, we must take into account that the super-resolved output of the model will be inherently not registered with the target image . Moreover, since we can have very different conditions among the images part of the same scene, it is important to make the loss function independent from possible intensity biases between the super-resolved and the target . Indeed, if we get a super-resolved image , with constant and low enough to avoid numerical saturation, we can consider its reconstruction perfect, since it represents the scene with the same level of detail of the ground truth.
With these premises, inspired by the metric proposed in , we defined as the super-resolved output cropped of pixels on each border and we consider each possible patch of size extracted from the ground truth . We compute the mean biases between the cropped and the patches as follows:
The loss is then defined as the minimum mean absolute error ( loss) between and each possible alignment patch . We use the MAE instead of the most used MSE since we experimentally find that provides better results for image restoration problems, as proved by the previous works[64, 38, 63].
where represents the norm of a matrix, i.e. the sum of its absolute values.
In this section, we test the proposed methodology in an experimental context, training it on a dataset of real-world satellite images and evaluating its performance in comparison with other approaches, including a state-of-the-art SISR algorithm, to demonstrate the superiority of Multi-image models. We first present the dataset and the preprocessing stages, we define all the parameters used during the experimentation, and then we propose quantitative and qualitative results. We also perform an ablation study to demonstrate the contribution of the global residual branch that implements a temporal attention mechanism. To implement our network, we use the TensorFlow framework. The complete code with a pre-trained version of our model is available online222https://github.com/EscVM/RAMS.
Iv-a The Proba-V Dataset
To train our model, we exploit the dataset released by the Advanced Concept Team of the European Space Agency (ESA) . This dataset has been specifically conceived for MISR problems, and it is composed of several images taken by the Proba-V satellite 333https://esa.int/Applications/Observing_the_Earth/Proba-V. in the two different spectral bands RED and NIR (near-infrared). Proba-V satellite has been launched by ESA in 2013 and is specifically designed for land covering and vegetation growth monitoring across almost the entire globe. The satellite provides images in two resolutions with different revisit frequency. HR images have a 100m per pixel spatial resolution and are released roughly every five days, while LR images have 300m per pixel resolution and are available almost daily. The characteristics of the Proba-V imagery make it particularly suitable for MISR algorithms since it provides both resolutions natively, allowing for the application of the SR process without the need for artificially degrading and downsampling the HR images.
The dataset has been released for the Proba-V Super Resolution challenge 444https://kelvins.esa.int/proba-v-super-resolution. and is composed of two main parts: the train part provides both LR and HR images, while the test part LR images, only. In order to verify the effectiveness of our approach, we consider the train part and not the test part, since it has been conceived for the challenge evaluation only and it does not include the ground truths. Thus, we subdivide the train part in training and validation sets. To ease the comparison with previous methods, we use the same validation images used in . In total, we have 415 scenes for training and 176 for validation for the RED band and 396 for training and 170 for validation for NIR.
Each scene is composed of several LR images (from 9 to 35, depending on the scene) with a dimension of 128x128 pixels and a single HR ground truth with a dimension of 384x384 pixels. The images are encoded as 16-bit png files, even if the actual signal bit-depth is 14 bits. Additionally, each image features a binary mask that distinguishes reliable pixels from unreliable ones (e.g., due to cloud coverage). This information is vital since the images are not taken in the same weather and temporal conditions, but a maximum period of 30 days can be covered in a single scene. For this reason, non-trivial changes in the landscape can occur between different LR images and their HR counterpart and are essential to understand which pixels carry meaningful information and which do not. Trying to infer the value of pixels that are concealed by clouds would mean being able to predict the weather in an unknown time by merely looking at the condition in other unspecified moments. For this reason, it is essential to train the network so that unreliable pixels do not influence the SR process. To assess the quality of each image, we defineas the clearance of the image, i.e. the fraction of reliable pixels in the correspondent binary mask.
Iv-B Data pre-processing
Before training the model, we pre-process the dataset with the following steps:
register each LR image using as reference the one with maximum clearance
select the clearest images from each scene that are above a certain clearance threshold
pre-augment the training dataset with temporal permutations of the LR input images
normalize the images by subtracting the dataset mean intensity value and dividing by the standard deviation
Since each LR image is taken at a different time and with some intrinsic spatial misalignment with respect to the others, it is important to resample each pixel value in order to have a coherent reference frame. For each scene of the dataset, we consider as a reference image the one with the maximum clearance . During the registration process, we consider translation as transformation model, which computes the necessary shifts to register each image for both the axes. Masks are taken into consideration during this process in order to avoid bad registration caused by unreliable pixels. The registration is performed in the Fourier domain using normalized cross-correlation as in . After computing the shifts, both LR images and the correspondent masks are shifted accordingly. We use a reflect padding to add pixels to LR images and a constant zero padding for masks. In this way, these extra pixels will be considered unreliable.
For each scene, we must select some LR images in order to match the temporal dimension of the network. We set a threshold on the clearance for an image to be accepted to avoid using awful images that can worsen the SR performance. The acceptable images are then sorted in order of clearance, and the best are selected. In the case of a scene with less than images, we sample randomly from the set of acceptable images until are reached. If a scene is only composed of clearances under , it is entirely removed from the dataset. This selection process is performed after the registration so that heavily bad registered images are also removed, even if they had an initial clearance above the threshold. Since each scene of the dataset contains at least 9 LR images, we set to fully exploit all the available information for most of the scenes.
One of the characteristics of the Proba-V dataset is that the LR images of a particular scene have no clear temporal order. Therefore, there is no reason to prefer a specific order in the input images to another. The training dataset is, therefore, pre-augmented by performing random temporal permutations of the selected input images to help generalization. In this way, we can train the algorithm to identify the best temporal image independently on the position inside the input tensor. We set this permutation parameter to , reaching a total of 2905 training data-points for RED and 2751 for NIR.
Finally, each image is normalized by subtracting the mean pixel intensity value computed on the entire dataset and dividing by the standard deviation. After the forward pass in the network, all the tensors are then denormalized, and the subsequent evaluations are performed on the 16 bits unsigned integer arrays.
Iv-C Experimental settings
The scaling factor of the Proba-V dataset is . Since we have different scenes for RED and NIR data, we treat the problem for the two bands separately. For this reason, we have , since we consider images with a single channel. We set and as number of filters and kernel size respectively for each convolutional layer. Therefore, the number of temporal reduction blocks is , since each block reduces the temporal dimension of 2. In all the residual attention blocks, we set as the reduction factor. After testing different values with a grid search, we set as the number of residual feature attention blocks in the main branch of the network. We find that decreasing this number causes a loss of performance while increasing it gives a little improvement in the results at the cost of a high increase in the number of parameters. is, therefore, the best compromise between network size and prediction results. In total, our model has slightly less than 1M parameters.
In most of the SR applications present in literature, LR images are obtained from the artificial degradation of the target HR images. In contrast, the real-world nature of the dataset, in which LR images are obtained independently from HR images, causes an unavoidable misalignment between the super-resolved output and the ground truth. To take into account this problem, the authors of the dataset consider a maximum shift of pixels on each axis between and target , computed on the basis of the geolocation accuracy of the Proba-V satellite . When computing the loss function presented in Sec. III-D, we can therefore set . Besides, since the Proba-V dataset also provides binary mask that marks with one reliable pixel and with 0 unreliable (e.g., concealed by clouds) ones, we adapt the loss function to use this information to refine the training process. During the loss computation, we want pixels marked as unreliable in the target binary mask not to influence the loss computation. Practically, we can simply multiply the cropped super-resolved image , and the HR patch by the correspondent cropped mask and average all the quantities over the number of clear pixels. The bias computation is therefore adapted from Eq. 9 as:
where represents the norm of a matrix, i.e. the sum of its absolute values. In the same way, the loss is adapted from Eq. 10 as:
To train the model, we extract from each LR image 16 patches with a size of pixels and the corresponding HR and masks patches with a size of . We further check every single patch and remove those that have a target mask with less than 0.85 clearance. The total number of training data points obtained is 41678 for RED and 40173 for NIR. During the training process, we further perform data augmentation with random rotations of , and and random horizontal flips.
We set the batch size to . Therefore, during training, we have an input tensor with shape and an output tensor with shape . We optimize the loss function with Adam algorithm  with default parameters , and . We set an initial learning rate and we reduce it with a linear decay down to
. We train two different networks for RED and NIR spectral bands on a workstation with an Nvidia RTX 2080Ti GPU with 11GB of memory and 64GB of DDR4 SDRAM. We use the TensorFlow 2.0 framework with CUDA 10. In total, we train the models for 100 training epochs for about 16 hours.
Iv-D Quantitative results
To evaluate the obtained results, we need to use a slightly modified version of PSNR and SSIM  criteria to take into consideration all the aspects we considered in the previous section to obtain a proper loss function. Martens et al.  propose a corrected version of the PSNR, called cPSNS, that is obtained from a corrected mean squared error (cMSE). The computation of the cMSE is performed in the same way as we did for the loss in Eq. 12: it is the minimum MSE between and the HR patches :
where represents the mean squared error computed only on pixels marked as clear in the binary mask . Again, we can simply multiply the matrices by the mask to make unreliable pixels irrelevant:
where represents the Frobenius () norm of a matrix, i.e. the square root of the sum of its squared values. We can then compute the cPSNR as:
where is the maximum pixel intensity for an image encoded on 16 bits.
In the same way, we can define a corrected version of the SSIM metric: cSSIM is the maximum SSIM between and the HR patches , again multiplied for the mask .
Iv-D1 Temporal self-ensemble (RAMS+)
As in Sec. IV-B, during the training process images are augmented with random permutation in the temporal axis. For this reason, it is possible to maximize the performance of the model, by adopting a self-ensemble mechanism during inference, similarly to what done in previous super-resolution works [38, 53, 63]. For each input scene, we consider a certain number of random permutations on the temporal axis and we denote as the resulting set of inputs. The output of the inference process is therefore the average of the predictions on the whole set. We call this methodology RAMS+, where is the number of random permutations performed:
Fig. 4 shows cPSNR results on the testing dataset for a different number of permutated predictions. The trend clearly shows how increasing results in better performance on both the spectral bands, with an effect that tends to saturate for . For the following evaluation, we select to present the results for RAMS+. It is worth noting that, even if this method allows to increase the performance of the network sharply, inference time grows linearly with , with RAMS+ taking roughly 20 times as long as RAMS. Another aspect to highlight is that the permutations are performed randomly, so different results can be achieved even with the same value of .
Iv-D2 Comparison with state-of-the-art methods
Tab. I shows the comparison of cPSNR and cSSIM metrics with several methods on the validation set. We consider as the baseline the bicubic interpolation of the best image of the scene selected considering the clearance, i.e., the number of clear pixels as marked by the binary masks.
IBP and BTV methods are tested with the same methodology presented in Molini et al. . They achieve slightly better results than the baseline with both the metrics.
RCAN  is currently one of the Single-image Super-resolution state-of-the-art networks. We trained from scratch two models, one for each spectral band, setting and , as the number of residual groups and residual channel attention blocks respectively, for a total of about 2 million parameters. We train the two models from scratch on the Proba-V dataset, selecting the best image per scene as input. RCAN shows better performance with respect to classical methods but is far beyond the other MISR networks, showing how the additional information coming from both spatial and temporal correlations is vital to boost the super-resolution process.
VSR-DUF has been developed to upsample video signals using a temporal window of several frames. We train two models from scratch, one for each spectral bands, using 9 LR images as input as in our methodology. The authors consider three different architectures depending on the number of convolutional layers and find better results, increasing the depth of the model. We select the baseline 16 layers deep architecture, that already has more than double parameters with respect to RAMS, with the same number of input images.
HighRes-net algorithm got the second place in the Proba-V challenge and featured a single network for both spectral bands that recursively reduce the temporal dimension to fuse the input LR images. We train the model on our training dataset with default architectures. Since the authors designed the architecture to have an input temporal dimension multiple of 2, we set it to 16, as it is closest to 9.
DeepSUM  is the algorithm winner of the original Proba-V challenge, and the authors have further developed it with DeepSUM++. We train our RAMS on the same training dataset as these two works.
Results clearly show how the proposed methodology can obtain the best results with the two metrics on both the spectral bands and thus represents the current state-of-the-art for Multi-image Super-resolution for remote sensing applications. Using temporal self-ensemble, RAMS+ is able to achieve even higher performance. We show the value for RAMS+, setting as the size of the ensemble, which is the value at which we experimentally find that the resulting gain starts to saturate. However, further increasing the ensemble size can result in even better performance, though at the cost of a significant inference speed drop.
It is worth mentioning that our methodology reaches a result of 0.9336790819983855 on the test set of the Proba-V challenge as provided by the official site and places at the top of the leaderboard available after the end of the official challenge555https://kelvins.esa.int/proba-v-super-resolution-post-mortem/leaderboard.. This score is computed as the mean ratio between the cPSNR values of the challenge baseline on each testing scene, and the correspondent submitted cPSNR for both the spectral bands. This result has been obtained by retraining the two networks with both training and validation datasets together.
Fig. 5 shows a direct comparison between the cPSNR results of RAMS and the bicubic interpolation baseline and RCAN (SISR state-of-the-art). Each cross represents a scene of the validation dataset of the corresponding spectral band. The graphs on the left show how our method strongly beats the bicubic upsampling on almost all the scenes, 98% for RED and 91% for NIR. That is coherent with a general worse behavior of all the methods on the NIR images, probably due to an intrinsic worse information quality of the NIR dataset. The graphs on the right show, on the other hand, the enormous potential of MISR with respect to SISR methods. It can be observed how again RAMS outperforms RCAN an almost all the scenes, with results only slightly worse than to bicubic interpolation, 92% for RED, and 91% for NIR. That is reasonable since RCAN results are someway in the middle between bicubic and RAMS.
Iv-D3 Importance of the residual temporal attention branch
As a final analysis, we perform an ablation study to demonstrate the importance of the global residual branch that implements a temporal attention mechanism. We train two alternative networks, one for each spectral band, that have the same architecture of RAMS, except that we delete the residual temporal attention (RTA) branch. These reduced networks are trained from scratch independently from the complete ones. Tab. II shows a significant drop in the results obtained without the global residual branch and demonstrates the importance of selecting the best temporal views to ease the super-resolution process of the main branch. We find this difference particularly relevant for the RED band, since the training repeatedly failed without the RTA branch, with a diverging behavior after some epochs. The result reported in the table is computed with the last valuable parameters before the divergence starts.
|without RTA||with RTA|
Iv-E Qualitative results
A visual comparison between some of the methods taken in the exam is shown in Fig. 6 and 7 for a RED and NIR image respectively. We provide a zoomed patch of the best LR input image of the scene, its bicubic interpolation, and the inference output of RCAN, VSR-DUF, DeepSUM, RAMS and RAMS+, together with the target HR ground truth. cPSNR and cSSIM scores for the image under analysis are also provided. From this comparison, MISR methods clearly show a better performance with respect to bicubic and SISR (RCAN). However, it is not trivial to understand which method is the better among MISR algorithms with a visual inspection of the results, only. As found by Ledig et al. , the task of achieving pleasant-looking results is a different optimization problem from maximizing the fidelity of the reconstructed information. Therefore, results with high content-related metrics as PSNR and SSIM frequently appear less photo-realistic to a human eye. However, in the context of remote sensing, the fidelity of the pixels content is vital to ensure that the super-resolved image are meaningful, thus the quality of results should be inferred by using content-related metrics, rather than by visual inspection.
In this paper, we proposed a novel representation learning model to super-resolve remotely sensed multi-temporal LR images by exploiting concurrently spatial and temporal correlations. We introduced a feature and temporal attention mechanisms with 3D convolutions that, coupled with nestled residual connections, let the network focus on high-frequency components, flow redundant low-frequency information and transcend the local region of convolutional operations. Extensive experiments on the open-source Proba-V MISR dataset, either with single image and multi-image SR methodologies, demonstrated the effectiveness of our proposed methodology. In both NIR and RED spectral bands, our efficient and straightforward solution achieved considerably better results than other literature methodologies obtaining 48.51 dB and 50.44 dB of cPSNR, respectively for the two channels. That is further proved by the score of the official post-mortem Prova-V challenge where RAMS claimed the first place in the leaderboard. Future work may investigate the performance of the RAMS architecture on hyperspectral remote sensing imaging.
Conceptualization, V.M., M.C.; Methodology, V.M.; Software, F.S. and V.M.; Validation, F.S. and V.M.; Data curation, F.S. and V.M.; Writing-original draft, F.S., V.M., and A.K.; Writing-review and editing, F.S., V.M., A.K. and M.C.; Project administration, V.M. and M.C.; Funding acquisition, M.C. All authors have read and agreed to the published version of the manuscript.
-  (2010) Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence 33 (5), pp. 898–916. Cited by: §II-C.
-  (2018) Evaluating super-resolution reconstruction of satellite images. Acta Astronautica 153, pp. 15–25. Cited by: §I.
-  (1998) Super-resolution from image sequences-a review. In 1998 Midwest Symposium on Circuits and Systems (Cat. No. 98CB36268), pp. 374–378. Cited by: §II-A.
Real-time video super-resolution with spatio-temporal networks and motion compensation.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4778–4787. Cited by: §II-C.
-  (2019) Bidirectional convolutional lstm neural network for remote sensing image super-resolution. Remote Sensing 11 (20), pp. 2333. Cited by: §II-B.
-  (2019) Second-order attention network for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11065–11074. Cited by: §I, §III.
-  (2020) HighRes-net: recursive fusion for multi-frame super-resolution of satellite imagery. arXiv preprint arXiv:2002.06460. Cited by: §II-C, §III-A, §IV-D2, TABLE I.
-  (2014) Learning a deep convolutional network for image super-resolution. In European conference on computer vision, pp. 184–199. Cited by: §II-A.
-  (2015) Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38 (2), pp. 295–307. Cited by: §I, §II-A, §III-D.
-  (2016) Accelerating the super-resolution convolutional neural network. In European conference on computer vision, pp. 391–407. Cited by: §II-A, §III.
-  (2019) Transferred multi-perception attention networks for remote sensing image super-resolution. Remote Sensing 11 (23), pp. 2857. Cited by: §II-B.
-  (2001) A fast super-resolution reconstruction algorithm for pure translational motion and common space-invariant blur. IEEE Transactions on image Processing 10 (8), pp. 1187–1193. Cited by: §II-C.
-  (2004) Fast and robust multiframe super resolution. IEEE transactions on image processing 13 (10), pp. 1327–1344. Cited by: §II-C, §IV-D2, TABLE I.
-  (2019) Fast super-resolution of 20 m sentinel-2 bands using convolutional neural networks. Remote Sensing 11 (22), pp. 2635. Cited by: §II-B.
-  (2019) Deep residual squeeze and excitation network for remote sensing image super-resolution. Remote Sensing 11 (15), pp. 1817. Cited by: §II-B.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §II-A.
-  (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §III-B1, §III-B1.
-  (1993) Motion analysis for image enhancement: resolution, occlusion, and transparency. J. Visual Communication and Image Representation 4 (4), pp. 324–335. Cited by: §II-A.
-  (1990) Super resolution from image sequences. In  Proceedings. 10th International Conference on Pattern Recognition, Vol. 2, pp. 115–120. Cited by: §II-A.
-  (1991) Improving resolution by image registration. CVGIP: Graphical models and image processing 53 (3), pp. 231–239. Cited by: §II-A, §II-C, §IV-D2, TABLE I.
-  (2018) Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3224–3232. Cited by: §II-C, Fig. 6, Fig. 7, §IV-D2, TABLE I.
-  (2016) Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp. 694–711. Cited by: §II-A, §III-D.
-  (2016) Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging 2 (2), pp. 109–122. Cited by: §II-C.
-  (2017) Double sparsity for multi-frame super resolution. Neurocomputing 240, pp. 115–126. Cited by: §II-C.
Evolving imaging model for super-resolution reconstruction.
Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 284–285. Cited by: §II-C.
-  (2019) Deep learning for multiple-image super-resolution. IEEE Geoscience and Remote Sensing Letters. Cited by: §II-C.
-  (2016) Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1646–1654. Cited by: §II-A.
-  (2016) Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1637–1645. Cited by: §II-A.
-  (1990) Recursive reconstruction of high resolution image from noisy undersampled multiframes. IEEE Transactions on Acoustics, Speech, and Signal Processing 38 (6), pp. 1013–1027. Cited by: §II-A, §III-D.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-C.
-  (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 624–632. Cited by: §III-D.
-  (2018) Fast and accurate image super-resolution with deep laplacian pyramid networks. IEEE transactions on pattern analysis and machine intelligence 41 (11), pp. 2599–2613. Cited by: §III-D.
-  (2018) Super-resolution of sentinel-2 images: learning a globally applicable deep neural network. ISPRS Journal of Photogrammetry and Remote Sensing 146, pp. 305–319. Cited by: §II-B.
-  (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §II-A, §II-B, §III-D, §IV-E.
-  (2017) Super-resolution for remote sensing images via local–global combined network. IEEE Geoscience and Remote Sensing Letters 14 (8), pp. 1243–1247. Cited by: §II-B.
-  (2002) High resolution image formation from low resolution frames using delaunay triangulation. IEEE Transactions on Image Processing 11 (12), pp. 1427–1441. Cited by: §II-C.
-  (2016) Single-image super resolution for multispectral remote sensing data using convolutional neural networks. ISPRS-International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 41, pp. 883–890. Cited by: §II-B.
-  (2017) Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 136–144. Cited by: §I, §II-A, §III-D, §III-D, §III, §IV-D1.
-  (2019) Super-resolution of remote sensing images via a dense residual generative adversarial network. Remote Sensing 11 (21), pp. 2578. Cited by: §II-B.
-  (2019) Super-resolution of proba-v images using convolutional neural networks. Astrodynamics 3 (4), pp. 387–402. Cited by: §III-D, §IV-A, §IV-C, §IV-D.
-  (2019) DeepSUM: deep neural network for super-resolution of unregistered multitemporal images. IEEE Transactions on Geoscience and Remote Sensing. Cited by: §II-C, §III-D, Fig. 6, Fig. 7, §IV-A, §IV-D2, §IV-D2, TABLE I.
-  (2020) DeepSUM++: non-local deep neural network for super-resolution of unregistered multitemporal images. arXiv preprint arXiv:2001.06342. Cited by: §IV-D2, TABLE I.
Rectified linear units improve restricted boltzmann machines.
Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814. Cited by: §III-B1.
-  (2011) Masked object registration in the fourier domain. IEEE Transactions on image processing 21 (5), pp. 2706–2718. Cited by: §IV-B.
-  (2017) Enhancenet: single image super-resolution through automated texture synthesis. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4491–4500. Cited by: §II-A.
-  (1996) Extraction of high-resolution frames from video sequences. IEEE transactions on image processing 5 (6), pp. 996–1011. Cited by: §II-C.
-  (2009) Super-resolution reconstruction algorithm to modis remote sensing images. The Computer Journal 52 (1), pp. 90–100. Cited by: §II-C.
-  (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1874–1883. Cited by: §III-A.
-  (1989) High-resolution image recovery from image-plane arrays, using convex projections. JOSA A 6 (11), pp. 1715–1726. Cited by: §II-C.
-  (2017) Memnet: a persistent memory network for image restoration. In Proceedings of the IEEE international conference on computer vision, pp. 4539–4547. Cited by: §II-A, §III-D.
-  (2017) Image super-resolution via deep recursive residual network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3147–3155. Cited by: §II-A.
-  (2007) Kernel regression for image processing and reconstruction. IEEE Transactions on image processing 16 (2), pp. 349–366. Cited by: §II-C.
-  (2016) Seven ways to improve example-based single image super resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1865–1873. Cited by: §IV-D1.
-  (1984) Multiframe image restoration and registration. Advance Computer Visual and Image Processing 1, pp. 317–339. Cited by: §II-A, §II-C.
-  (2016) Universal encoding of multispectral images. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4453–4457. Cited by: §I.
-  (2014) A novel rate control algorithm for onboard predictive coding of multispectral and hyperspectral images. IEEE Transactions on Geoscience and Remote Sensing 52 (10), pp. 6341–6355. Cited by: §I.
-  (2017) A learned representation of artist-specific colourisation. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 2907–2915. Cited by: §III-A.
-  (2018) Esrgan: enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §II-B.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §IV-D.
-  (2020) Sentinel-2 sharpening via parallel residual network. Remote Sensing 12 (2), pp. 279. Cited by: §II-B.
-  (2018) Wide activation for efficient and accurate image super-resolution. arXiv preprint arXiv:1808.08718. Cited by: §I, §II-A, §III-D, §III.
-  (2016) Image super-resolution: the techniques, applications, and future. Signal Processing 128, pp. 389–408. Cited by: §II-C.
-  (2018) Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 286–301. Cited by: §I, §II-A, §III-D, §III, Fig. 6, Fig. 7, §IV-D1, §IV-D2, TABLE I.
-  (2016) Loss functions for image restoration with neural networks. IEEE Transactions on computational imaging 3 (1), pp. 47–57. Cited by: §III-D.