Multi-Spectral Multi-Image Super-Resolution of Sentinel-2 with Radiometric Consistency Losses and Its Effect on Building Delineation

by   Muhammed Razzak, et al.
University of Oxford

High resolution remote sensing imagery is used in broad range of tasks, including detection and classification of objects. High-resolution imagery is however expensive, while lower resolution imagery is often freely available and can be used by the public for range of social good applications. To that end, we curate a multi-spectral multi-image super-resolution dataset, using PlanetScope imagery from the SpaceNet 7 challenge as the high resolution reference and multiple Sentinel-2 revisits of the same imagery as the low-resolution imagery. We present the first results of applying multi-image super-resolution (MISR) to multi-spectral remote sensing imagery. We, additionally, introduce a radiometric consistency module into MISR model the to preserve the high radiometric resolution of the Sentinel-2 sensor. We show that MISR is superior to single-image super-resolution and other baselines on a range of image fidelity metrics. Furthermore, we conduct the first assessment of the utility of multi-image super-resolution on building delineation, showing that utilising multiple images results in better performance in these downstream tasks.



There are no comments yet.


page 2

page 3

page 4

page 5

page 10

page 12

page 16

page 17


A new public Alsat-2B dataset for single-image super-resolution

Currently, when reliable training datasets are available, deep learning ...

Tracking Urbanization in Developing Regions with Remote Sensing Spatial-Temporal Super-Resolution

Automated tracking of urban development in areas where construction info...

Exploiting Digital Surface Models for Inferring Super-Resolution for Remotely Sensed Images

Despite the plethora of successful Super-Resolution Reconstruction (SRR)...

Spatial-Temporal Super-Resolution of Satellite Imagery via Conditional Pixel Synthesis

High-resolution satellite imagery has proven useful for a broad range of...

Permutation invariance and uncertainty in multitemporal image super-resolution

Recent advances have shown how deep neural networks can be extremely eff...

Monte-Carlo Siamese Policy on Actor for Satellite Image Super Resolution

In the past few years supervised and adversarial learning have been wide...

Multi-element microscope optimization by a learned sensing network with composite physical layers

Standard microscopes offer a variety of settings to help improve the vis...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Deep Learning has sparked a new wave of Super-Resolution (SR) algorithms that enhance the spatial resolution of images with impressive aesthetic results 

wang2020deep. Although the perceptual quality of those images is high, it is well-known that some of these SR models introduce artefacts into the SR image that are not present in real images bhadra2020hallucinations. In addition, most of these models do not enforce physically-based consistency between the super-resolved image and its low-resolution counterpart bahat2020explorable. This limits the applicability of SR models to domains such as remote sensing where the safety and consistency are critical, e.g. for scientific instrumentation and decision making.

Super-resolution models are divided in single-image super-resolution (SISR) and multi-image super-resolution (MISR) (aka.multi-frame or multi-temporal super-resolution). The former uses as input only one low-resolution image while the later takes several low-resolution images from the same scene. MISR seeks to further constrain the ill-posed problem of SR by conditioning on several low-res input images (aka. revisits). Therefore it is expected of MISR to produce better SR images, to be more robust and to produce fewer artifacts than SISR. In addition, MISR can be naturally applied in Earth observation since satellites often have frequent revisits of an area of interest. These multiple revisits can be fused with MISR to produce a super-resolved image. Despite its clear applicability, MISR has been scarcely applied in Remote Sensing and, as of yet, there are no studies that quantitatively compare MISR and SISR. Thus far, for remote sensing MISR has only been demonstrated only on RED and NIR bands of PROBA-V— a tiny fraction of the Sentinel-2 operation spectrum deudon_highres-net_2020. In addition, the practical utility of these super-resolved images for downstream tasks has been largely unexplored in real-world applications shermeyer2019effects.

Figure 1: Different co-aligned retrievals from PlanetScope and Sentinel-2 from the SpaceNet 7 dataset. First row: PlanetScope RGB. Second and third rows: Sentinel-2 RGB revisits.

In this work, we apply MISR to the multi-spectral multi-temporal satellite imagery from the European Space Agency’s Copernicus Sentinel-2 (S2) archive, and study the downstream utility of the super-resolved images. In particular, we train super-resolution models (both SISR and MISR) on the RGB bands of Sentinel-2 images, using as reference co-registered RGB PlanetScope images, with all imagery from within the same two-month period. This setting differs from the vast majority of previous remote sensing applications of SR, where low-res images are obtained by artificially downsampling the high-res counterpart shermeyer2019effects. The main benefit of our setting is that the trained model can be applied on new S2 RGB images to enhance their nominal resolution to

, i.e. it provides out-of-sample SR results without requiring simultaneous and co-registered VHR images. We compared this model with SISR models on an independent test set showing better qualitative and quantitative performance in terms of PSNR (Peak Signal to Noise Ratio) and SSIM (Structural Similarity Index Measure), see sec. 

6.1. In addition, we demonstrate that this resolution enhancement produce significant gains in tasks such as building delineation (sec. 4).

One of the drawbacks of training super-resolution models when the low-res and high-res images come from different sensors, is the difference in the spectral characteristics of the sensing instruments and in the calibration of their output images. This poses the additional challenge of disentangling the SR task from the cross-instrument calibration task. Figure 1 shows several co-located Sentinel-2 and PlanetScope images (for each PlanetScope image, we show four S2 revisits within the same two-month period). We can see that there are differences in the colours of these images, in particular, Sentinel-2 images are brighter and with higher contrast and colours seem better defined. This is because Sentinel-2 images benefit from a higher radiometric and spectral resolution compared to PlanetScope, and both products undergo different atmospheric correction procedures to recover surface reflectance. In order to produce super-resolved images while preserving the spectra of the low-res Sentinel-2 imagery, we propose a radiometric consistency module to the super-resolution model, that helps to maintain the radiometric consistency of a super-resolved image with its low-res counterpart.

The contributions of our work are summarized as follows:

  1. We curate a new multi-temporal dataset from many revisits of Sentinel-2 imagery co-located with PlanetScope imagery, originally sourced from the SpaceNet-7 competition van_etten_multi-temporal_2021, which includes a geo-diverse set of scenes from around the globe.

  2. We demonstrate, for the first time, the multi-image super-resolution of RGB satellite imagery.

  3. We show that MISR is better at super-resolving Sentinel-2 RGB images compared to SISR, both quantitatively in terms of the image fedility metrics (PSNR and SSIM) and qualitatively.

  4. We demonstrate the downstream utility of super resolution, through the task of building semantic segmentation and instance segmentation.

  5. Finally, we propose a radiometric consistency module which can be added to any vanilla super-resolution model. We show that this module helps to maintain the radiometric consistency of Sentinel 2 while enhancing its spatial resolution, and we show several instances of good and bad consistencies.

2 Materials and Methods

2.1 Datasets

In order to learn a super-resolution model to improve the spatial resolution of S2, we need higher-resolution images to use as a reference. Since VHR (Very High Resolution) images (less than ) are not free, we restricted our search to pre-released publicly-available datasets of high-resolution images. Among those, we chose the recently launched Multi-temporal urban development SpaceNet dataset of PlanetScope images (also known as SpaceNet-7, see sec. 2.1.1van_etten_multi-temporal_2021. With this dataset, we acquired co-located time series of Sentinel-2 images for each PlanetScope acquisition (sec. 2.1.2). Subsection 2.1.3 has a brief analysis of the S2-Planet dataset as well as details about the different train-test splits that we used for the results.

2.1.1 PlanetScope SpaceNet-7 dataset

Figure 2: Location of SpaceNet-7 image time series. Figure taken from van_etten_multi-temporal_2021.

SpaceNet-7 has monthly time series of PlanetScope images over a two-year time span period for approximately 100 different areas of interest (AOI) all over the world (see figure 2). Images are provided at nominal111The resolution reported in the GeoTIFF metadata. resolution with only three spectral channels (RGB). In addition, each of those images have manually derived polygon labels of building footprints. The challenge accompanying the release of this dataset consisted of delineating those buildings and identifying when new buildings were constructed on those areas. Figure 3 shows some PlanetScope acquisitions with their corresponding building annotations over different locations. We can see that delineating the buildings on some of these scenes is quite challenging even at the resolution of PlanetScope. In this study we restricted to images from December 2019 and January 2020 from the training set (building footprints in the test set have not been released yet). In total there are 46 different PlanetScope scenes for each month.

Figure 3: PlanetScope acquisitions (first row) and their corresponding building polygon masks (second row) from the SpaceNet-7 dataset.

2.1.2 Sentinel-2 acquisitions

The Sentinel-2 mission consists of two satellites carrying the same multi-spectral optical sensor which acquires images on 13 different bands of the electromagnetic spectrum, from the visible to the short-wave infrared. The nominal spatial resolution of those images is different for each set of bands: 4 bands (visible and near infra-red) have resolution; 6 bands in the very near infrared and short-wave infrared bands have resolution; the remaining 3 bands are used mainly for atmospheric correction and they have a spatial resolution of . Level 2A Sentinel-2 products, often referred as analysis ready data, consist of atmospherically corrected ortho-corrected 12-band images with bottom-of-atmosphere (BOA) calibrated reflectances. Providing good BOA calibrated images is one of the main goals of the Sentinel-2 mission, since an accurate radiometric and spectral calibration has large impacts on ocean (see e.g. ruescas_machine_2018) and vegetation products (e.g. wolanin_estimating_2019; svendsen_joint_2018). In section 5, we propose a SR model that seeks to maintain the spectral calibration of Sentinel-2.

Sentinel-2 images were downloaded from the ESA Open Access Hub. In order to obtain co-aligned time series of Sentinel-2 and PlanetScope images, we developed a custom pipeline which consists of the following steps:

  1. Download all Sentinel-2 level 2A products overlapping with each of the 46 PlanetScope scenes over December 2019 and January 2020.

  2. Crop all Sentinel-2 images to the PlanetScope scene bounds.

  3. Reproject all bands of S2 to the coordinate reference system of PlanetScope products at spatial resolution.

  4. When more than one S2 product was found for the same date and scene, we mosaiced those images.

Figure 1

shows different Planet (top) and Sentinel-2 (bottom) retrievals. As we can see, some of the Sentinel-2 images contain clouds. Sentinel-2 products have additional quality assessment bands that include cloud probabilities for each pixel. In this work, we used the SLC band to assess the presence of clouds; in particular, we used the value 9 which encodes "clouds high probability". This cloud indicator is used to inform the fusion model when merging different Sentinel-2 revisits.

Figure 4: We utilised a within-scene split, which allocates the top 80% of a scene as a source of training patches; the bottom-left and right 10% for validation and testing patches respectively.

2.1.3 Dataset Analysis and training splits

The images come from 46 locations, with a geodiverse set of features including vegetation, bare earth (flats, hills, ridges), desert, urban, and agriculture infrastructure (see Appendix Table 5). The number of revisits between December 2019 and January 2020 range from 5 to 13, but the percentage of usable revisits ( cloud coverage) ranges from 23% to 100%. Finally, we use a fixed partitioning of the scenes into training, validation and test datasets.

2.1.4 Training, validation and test sets splits

A well-thought split of the data into training and testing is critical to demonstrate the capacity of machine learning models to generalize. In remote sensing scenarios, extra-care must be taken to avoid train-test leakage due to spatial correlation. For instance,

ploton_spatial_2020 recently showed that lack of consideration to spatial correlation lead to over-inflated results of ML models that monitored forest biomass. Our approach splits each scene is patches avoiding spatial overlap between patches in the different subsets; with this approach we seek to explore the performance of the models in ideal conditions when training and testing patches come from similar distributions. Figure 4 shows the dataset partition for one scene.

2.2 Metrics

In this work, we look at two broad tasks: super-resolution and building delineation.

For super-resolution, the primary quantiative metrics of performance are the peak signal-to-noise ratio (PSNR) and the structural similarity index (SSIM) wang_image_2004 between the super-resolution image and the reference high-resolution image. Both measure the fidelity of the images compared, but with SSIM more focused on the structures contained in the images. Mathematically, PSNR is computed as follows:


where is the maximum pixel value of the images (e.g. for 8-bit images this is 255), MSE is the mean square error between the high-resolution image () and super-resolved image ().For the formula of the SSIM the reader is referred to the original work of wang_image_2004.

3 Multi-spectral multi-image super-resolution

In the context of remote sensing, our work is the first to tackle the MISR problem in multiband images (RGB bands). Although the ultimate goal is multi-spectral MISR in all 13 bands of Sentinel-2, such a task is predicated on the coverage of the same bands by the high-resolution instrument, at least in the proposed supervised learning setting. Unfortunately, most VHR (very high resolution) products available are limited to R,G,B, and NIR (near-infrared) bands.

Back in traditional on-the-ground imaging, MISR has seen several successful applications on colour photos. Most notably, wronski2019handheld achieved real-time on-board multi-image super-resolution in handheld Google Pixel cameras, by leveraging the user’s hand motion jitter during a burst-frame shot. With intimate knowledge of the handheld camera’s specifications (physical model), and no feature-learning involved, the authors could anticipate the amount of aliasing and phase shifting that images undergo after optical zoom-out and motion-jitter respectively. The fusion of burst-frames is ultimately a convex optimization problem (ADMM, see boyd2011distributed). More recently, a deep learning approach to bust-frame MISR was proposed in bhat2021deep, which is agnostic to camera specs.

3.1 MISR in Earth Observation

In handheld camera imaging, the problem of MISR is always learned with high-res reference images that comes from the same sensor. But in Earth Observation, the fact that any one instrument orbits at a fixed altitude, rules out any possibility of obtaining simultaneous low-res/high-res images pairs by the same sensor. Hence, this necessitates the use of a different higher resolution instrument, if MISR is to be learned in a supervised way. So unless the low-res and high-res pairs are acquired with different lens on-board the same satellite (e.g. PROBA-V, and a unique example at that, see mrtens2019superresolution), then any ML-based MISR model must also learn to calibrate its output to the spectra and noise of the high-res instrument. In addition, for remote sensing the multiple images are temporally spaced over days and weeks, rather than over a few seconds with handheld imagery.

Recently, molini_deepsum_2020 and deudon_highres-net_2020 tackled the MISR problem in Earth Observation, in single-band imagery, and on the rather controlled use-case of PROBA-V that was the topic of ESA’s MISR competition ending Q2-2019. In particular, deudon_highres-net_2020 was the first approach to tackle the different problems in MISR (input co-registration, fusion, and registration-at-the-loss) in an end-to-end manner, and with a small memory footprint (due to its reused fusion operator in the low-res domain). Since then, several deep learning approaches with refined architectures have repeatedly beaten the state-of-the-art in the PROBA-V "post-mortem“ leaderboard; most notably salvetti2020multi.

3.1.1 HighRes-net

This work applies the modified HighRes-net architecture of deudon_highres-net_2020 to the S-2 and PlanetScope RGB images described in sections 2.1.2 and 2.1.1. For a complete description of HighRes-net we refer the reader to the original paper deudon_highres-net_2020. Table 6 is a modular breakdown of the HighRes-net architecture. The hallmark feature of HighRes-net is its shared fusion operator that can be pairwise-applied recursively on an arbitrary set of encodings of low-res revisits. This technique can be easily parallelized on a GPU, through careful use of the torch.Tensor.view operator, that treats the different encoding-pairs as instances of a mini-batch.

The next step, ShiftNet222A reduction of HomographyNet from 8 parameters for homographies to 2 for translations, see detone2016deep., is not intrinsic to HighRes-net but an ancillary learned registration operator to account for the inevitable misalignments that are explained by shifts.

3.2 Single-Image Super-Resolution

Single-image super-resolution (SISR) has progressed significantly with the advent of deep learning, and has progressed even further with newer developments. Initial approaches to super-resolution with deep architectures were CNN-based. First, SRCNN dong_image_2015 followed by FSRCNN dong_accelerating_2016, which improved the speed of SRCNN along with a minor gain in performance. These approaches used some form of upsampling (transposed convolutions or bicubic upsampling) interspersed with convolutional layers to achieve the improved super-resolved imagery. The objective function used to train these networks is the mean square error between the high-resolution ground truth image and the super-resolved output image.

The work of ledig_photo-realistic_2017 provided a significant step forward in terms of photo-realistic and perceptually pleasing super-resolution. They introduced a far better CNN based super-resolution network called Super-Resolution Residual Networks (SRResNet). SRResNet was deeper than prior works and included residual blocks. Additionally, instead of transposed convolutions or bicubic upsampling SRResNet made use of pixel shuffling to upsample the imagery. However, the major contribution of ledig_photo-realistic_2017

, was the introduction of a generative adversarial network for super-resolution (SRGAN) that achieved perceptually pleasing results. SRGAN uses SRResNet as the generator and a simple CNN for the discriminator. In

ledig_photo-realistic_2017, they note that while SRResNet, trained with an MSE, achieves superior performance in terms of PSNR and SSIM to all other methods including SRGAN, SRGAN is able to achieve more perceptually pleasing results (capturing high-frequency content) as evaluated by Mean Opinion Score (MOS) of a panel of human evaluators.

More recently, there has been further work on improving super-resolution through improved GAN training wang2018esrgan; anwar_deep_2020; wang2020deep, and flow-based approaches lugmayr2020srflow.

3.2.1 SRResNet

While we examined many architectures as a SISR baseline to compare HighRes-Net against, we decided to restrict the comparison to SRResNet for a couple of reasons:

  1. It is amongst the best performing SISR methods in terms of PSNR and SSIM. These are the primary metrics we will be evaluating our results on. We are interested in accuracy over perceptually pleasing results, and also had no time or budget to evaluate with panel of humans to obtain an MOS.

  2. SRResNet in terms of representational capacity is similar to that of HighRes-Net. Both networks rely upon residual blocks to encode the images. SRResNet does however make use of pixel shuffling instead of transpose convolution. Pixel shuffling has been shown to do better upsampling in ledig_photo-realistic_2017.

Figure 5: SRResNet model architecture as illustrated in ledig_photo-realistic_2017.
Single-Image Selection

Given the high Sentinel-2 revisit, one has to choose a revisit on which to do super-resolution using an SISR model. One could choose a random revisit, however, we found that choosing the best revisits indicated by the lack of clouds achieved optimal results.

Objective Function

We evaluated three objective functions during training: Mean Square Error (MSE), Mean Absolute Error (MAE) and Structural Similarity Index Measure (SSIM). While MSE was able to obtain superior performance in terms of PSNR the imagery was far less sharp. Using SSIM as the objective function resulted in excellent PSNR performance, superior SSIM performance and imagery that was sharper and perceptually pleasing.

SRResNet’s model architecture is illustrated in Figure 5, the rest of the hyper-parameters used to train the network are described in the appendix C.1.

4 Utility of MISR for Building Delineation

The interplay between super-resolution techniques and a range of downstream task performance (object detection, instance segmentation, semantic segmentation) remains largely unexplored, particularly in the context of satellite imagery. In shermeyer_effects_2019, authors performed single image super-resolution using synthetically downscaled VHR WorldView-3 images. Afterwards, they compare the performance on an object detection task using images at different resolutions (real, downsampled and super-resolved images). They showed an increase in performance when using super-resolved images instead of artificially downsampled ones. This was far from a real-world scenario given the synthetic data used and moreover it was only assessed on a single task and dataset.

Detecting objects such as buildings from Sentinel-2 or even PlanetScope imagery, (resolution of 10m and 4m respectively in the visible spectrum), is a difficult task as these objects often cover a small amount of area in terms of pixels. In the case of buildings in urban areas, they are also densely packed making the task of delineating between buildings even more difficult. Increasing the spatial resolution in thus lead to better detection and delineation of these objects.

In this study, we apply the previously trained super-resolution models to the downstream task of the SpaceNet dataset: semantic and instance segmentation of buildings. We compare the models performance on the native (ground-truth) imagery, and super-resolved imagery of the same ground sampling distance and the low-resolution imagery with bicubic upsampling.

Ours is the first study to assess the utility of MISR, and the first study to assess the utility of SISR or MISR on real-world data, as opposed to synthetically downsampled imagery.

4.1 Downstream models and processing pipeline

The state-of-the-art for building footprint segmentation has converged to a common algorithm: identify instances and extract polygons. The identify instance step has been approached as instance segmentation task, in which instances are directly obtained from an instance segmentation network, or more commonly as a semantic segmentation task, in which instances are obtained using the connected components of the mask predicted by a semantic segmentation network. In both approaches, a polygon is then extracted for each instance by a simple vectorization routine.

We follow the second avenue (semantic segmentation), as this was the approach used among all four winners of the SpaceNet-7 challenge.

4.1.1 Input imagery

As input imagery we use the MISR and SISR models outputs, trained as previously described in section 3.1, in addition to a number of baselines.

  1. Bicubic upsampling: The clearest and less cloudy revisit of Sentinel-2 imagery was bicubicly upsampled.

  2. SISR: The clearest revisit of low-resolution Sentinel-2 imagery was fed through a trained SRResNet.

  3. Multi-Image Upsampling: The clearest four (4) revisit of low-resolution Sentinel-2 imagery was concatenated and bicubicly upsampled.

  4. MISR: All revisits of the low-resolution Sentinel-2 imagery were fed through HighRes-Net.

  5. PlanetScope: The high-resolution PlanetScope imagery.

4.1.2 Semantic Segmentation

Semantic segmentation is the task of assigning each pixel a semantically meaningful label. In this setting, it is specifically the task of assigning whether a pixel belongs to a building or not. In this work, the aim of this evaluation is to compare the same semantic segmentation model using the different inputs referred before to assess the utility of the super-resolved images. For this purpose, we chose to use HRNet hrnet as the semantic segmentation model which is trained using all the input configurations of subsec. 4.1.1. We chose HRNet because it is the network used by the winning solution in the SpaceNet-7 challenge, and is amongst the state of the art networks as evaluated on variety of benchmarks hrnet_v2. The architecture for the network is illustrated in figure 6, we list in appendix C.2 the hyper-parameters that we used for training the models; note that we did not perform a comprehensive hyper-parameter sweep.

Figure 6: HRNet Architecture: high-resolution representations for Semantic Segmentation hrnet.
Semantic segmentation metrics

The primary metric of semantic segmentation is Intersection-over-Union (IoU), which is evaluated for every class. The mean IoU over all the classes is the final reported metric. IoU is measure of overlap between the predicted per-pixel segmentation and the ground truth divided by the area of union between this segmentation and the ground truth. IoU can be calculated as follows using either set or confusion matrix notation:


where is the predicted mask of the buildings and is the ground-truth mask.

4.1.3 Instance Segmentation

In the instance segmentation task, we aim to delineate each building instance. We do this through a simple process of vectorization that derives polygons using the connected components of the output of the semantic segmentation model. This is the standard method of polygon extraction and has been used within the SpaceNet challenge along with industrial uses by the likes of Microsoft for their maps.

Instance segmentation metrics

The primary metrics for instance segmentation are precision, recall and F1. These metrics are calculated on the produced vector instances (polygons) rather than in the pixels. For a positive detection (true positive), the IoU between the prediction and ground-truth polygons must be greater than a threshold. For this particular dataset, the IoU threshold in the competition was set at 0.25. Consequently, we used that threshold in our evaluation too.

5 Color matching HighRes-net for Sentinel-2 MISR

The Sentinel-2 level 2A products consist of ortho-corrected images of BOA reflectance (see sec. 2.1.2). There are several land and ocean S2 applications that rely on well-calibrated reflectance to provide meaningful outputs (see for instance ruescas_machine_2018; wolanin_estimating_2019; svendsen_joint_2018). However, the proposed MISR training scheme it is optimized to minimize disagreement between the PlanetScope and the super-resolved output image. Since PlanetScope images have a different color calibration than S2 (PlanetScope images from the SpaceNet-7 dataset are not BOA corrected), we experienced that the SR images look like PlanetScope images in terms of color (see Figure 1 to appreciate the difference in color between S2 and PlanetScope). Hence, by using the proposed MISR model, we loose the radiometric calibration of S2. In this section we propose a modification of the MISR model and training procedure to produce spectrally-consistent S2 super-resolved images.

Figure 7: HighRes-net with a consistency loss between a downscaled version of the SR output and the low-res reference frame.

Figure 7 shows a diagram of the modified forward pass with the consistency loss. On the top of the image we have HighRes-Net as proposed in deudon_highres-net_2020-1: this network fuses a set of low-res revisits using the encoder (recursive fusion) and afterwards it applies the decoder to the fused feature-maps (upsample). However this output (SR in the figure), instead of being compared with the high-resolution image to compute the loss, it is further processed in two losses that are averaged together. For the first loss, that we called (color/spectral) consistency loss, we downsample the SR image back to the low-res size and compare it with the S2 reference frame. For the second loss, called super-resolution loss

, we apply a pixel-by-pixel fully connected neural network (implemented as 2 layers of 1x1 convolutions) to produce the

PlanetScope like output. We refer to this network as color matching network. By using this additional network we seek to disentangle MISR (controlled by HighRes-Net) from color matching (learn by the color matching network). We were inspired by the works of tasar_colormapgan_2020 and mateo-garcia_cross-sensor_2020 that use similar ideas in the context of domain adaptation.

5.1 Implementation Details

For the consistency loss, we used the MSE between the downsampled image and the low-res reference, the downsampling step uses adaptive average pooling. The color matching network consists of a

convolutional layer with 64 output channels, a ReLU activation and a

conv layer back to 3 output channels. This output is shifted using the ShiftNet network, explained in sec. 3.1.1, and compared with the PlanetScope high-res image using an SSIM loss. Finally, the consistency loss and the super-resolution loss are averaged together using a convex combination of and respectively.

6 Results

6.1 Multi-Image Super-Resolution Results

Quantitative results

This section focuses on performance of the models in terms of Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM). These are the most commonly used metrics for measuring super-resolution performance and image reconstruction quality.

Table 1 shows the super-resolution performance of our evaluated models with the within-scene split described in 2.1.4. Bicubic

method indicates the scores if LR images are just upscaled using bicubic interpolation to match the HR image size.

Subset Bicubic SRResNet HighResNet
Train 17.85 0.612 27.90 0.827 30.16 0.88
Validation 18.11 0.70 26.06 0.78 28.52 0.84
Test 18.54 0.70 26.83 0.80 29.40 0.85
Table 1: Average PSNR and SSIM scores. Higher scores are better.

In addition, we tested the same model applied to acquisitions in a different time period but over the same regions. In Table 2, we notice a drop in performance across all methods, but the super-resolution models still exceed the performance of the bi-cubic upsampling method. Additionally, we still see that MISR outperforms SISR.

Subset Bicubic SRResNet HighResNet
Train 17.27 0.63 22.66 0.74 23.28 0.76
Validation 18.14 0.69 22.49 0.73 22.56 0.73
Test 16.99 0.66 22.12 0.73 23.10 0.77
Table 2: Average PSNR and SSIM scores for testing on a different time period.

6.2 Building Detection Results

In this subsection, we present the results of the downstream tasks. Due to the manner in which the tasks are completed, they are very much linked.

6.2.1 Semantic Segmentation

In the semantic segmentation results, we see that, as expected, the model using MISR exceeds the performance of the bicubic upsampling and SISR. However, the model using multiple revisits (with bicubic upsampling) is able to match the performance of the MISR model. This result shows the utility of using multiple revisits, while at the same time showing that MISR (or at least our particular implementation) does not necessarily result the optimal representation for performance on the downstream task. Optimizing directly using the best 4 revisits results in equal or better performance.

Additionally, we see that SISR underperforms against bicubic upsampling. We suspect this is likely due to SISR introducing artifacts which results in the lower performance.

Finally, as expected, the PlanetScope imagery ground-truth resulted in a model with far superior performance than any of the other methods.

Model IoU (val) IoU (test)
S-2 (best revisit): Bicubic 0.60 0.60
S-2 (best revisit): SISR 0.58 0.56
S-2 (best 4 revisits): concat + bicubic 0.62 0.62
S-2: MISR (all revisits) 0.61 0.60
PlanetScope 0.66 0.69
Table 3: Performance of semantic segmentation models trained with different input imagery. On-par top accuracy shown in bold.

An example of the resulting semantic segmentation output given the inputs is shown in Figure 8.

Figure 8: A sample result of semantic segmentation on SpaceNet 7 dataset using MISR imagery from Sentinel-2.

6.2.2 Instance Segmentation

The results on the instance segmentation task largely mimic the results seen in the semantic segmentation results. We can, however, see that performance margin grows between the model using the PlanetScope imagery and those using Sentinel-2, indicating the performance of the fine point accuracy required for correct delineation for instance segmentation on this particularly dataset.

We also see that precision is generally lower with the super-resolution imagery than with their equivalent bicubicly upsampled baselines. This indicates the super-resolution models are producing predictions that are less pure.

One important note to make is that we used all revisits with the MISR model (including cloudy revisits), while the bicubicly upsampled model only used the best four revisits. While the MISR model was shown to be relatively robust to this, an avenue for further investigation should include using simply the best four revisits for the MISR model.

Model (IoU Threshold: 0.25) Precision Recall F1
S-2 (best revisit): Bicubic 0.24 0.56 0.33
S-2 (best revisit): SISR 0.21 0.56 0.30
S-2 (best 4 revisits): concat + bicubic 0.28 0.57 0.37
S-2: MISR (all revisits) 0.26 0.57 0.36
PlanetScope 0.41 0.68 0.51
Table 4: Performance of instance segmentation algorithm with different input imagery on the test set

Overall, we saw that MISR was shown to be useful for the resulting downstream tasks of semantic and instance segmentation. These results suggest that MISR images encode the information contained in the Sentinel-2 time series.

6.3 Colour Consistency Results

Averaged spatial Fourier power spectrum. Figure 9 shows the spatial Fourier power spectrum averaged over the images in the test and validation datasets. In particular, we show the spectrum of PlanetScope images, Sentinel-2 images upscaled with bicubic interpolation and the super-resolved images of the MISR and SISR methods. We can see that super-resolved images (red and green) exhibit a larger amount of higher-frequency components than Sentinel-2 images (orange). This shows that the SR models are adding higher-frequency content to the output. On the other hand, still, the amount of high-frequency information is lower than in PlanetScope (blue).

Figure 9: Averaged power spectrum of PlanetScope, Sentinel-2 and Sentinel-2 super-resolved images over the train and validation dataset. We can see that super-resolved images using MISR and SISR (green and red) have more high-frequency components than Sentinel-2 (orange).

Radiometric consistency. In order to evaluate the radiometric consistency method, Figure 10 shows the histograms over the test and val datasets of each of the RGB bands. In these histograms we can see that the color distribution of PlanetScope (first row) is very different of the colors of Sentinel-2 (last row). In the middle we show the histograms of the Sentinel-2 super-resolved images using the MISR method with consistency loss (sec.5). We can see that the color distribution of the SR images matches the color distribution of Sentinel-2.

Figure 10: Normalized histograms of the image values over the test and val dataset for each of the RGB bands (columns). First row shows the values of PlanetScope images, second row shows the values of the super-resolved Sentinel-2 images using the consistency loss. Third row shows the values of the Sentinel-2 images. We see that, using the color consistency loss, the color distribution of Sentinel-2 is maintained in the super-resolved images.

6.4 Qualitative results

Figure 17 shows a few examples of the visual quality of our MISR approach and Figure 30 shows the effect of the consistency loss in HighRes-net on a few test-set instances. Firstly, image 17 shows a Sentinel-2 low res, a super resolved image using HighRes-net and the PlanetScope HR image. Particularly, the first row shows an urban area where the super-resolved image (middle) is significantly sharper than the low-res Sentinel-2 image (left); in this image, it is clear that counting buildings should be easier in the former than in the later. Figure 30 shows examples of SR with colour consistency (second column); we can see that the proposed model enhance the spatial resolution of Sentinel-2 while preserving the color content. The output on the third column shows the output after the color matching step, which, as we see it is more similar to the PlanetScope ground truth.

(a) Low-res (S-2, 10m)
(b) Super-res (4.7m)
(c) High-res (PlanetScope, 4.7m)
Figure 17: Patch from a validation-set scene. How many buildings lie within the green polygon? Being more than just a pretty picture, the super-resolved output of HighRes-net (b)b also better delineates the buildings in urban scenes, hence enabling downstream tasks like building segmentation, with improved accuracy compared to prediction on a single S-2 image. This is evidenced qualitatively by the fact that the manual count of buildings in the green polygon in (b)b is easier to perform than in (a)a. (a)a(f)f Note that in both examples/rows, the spectra of the super-res output is similar to the high-res PlanetScope reference — an undesirable side-effect if the spectral information of the source low-res instrument is better than that of the high-res instrument.
(a) LR (S2, 10m)
(b) SR with S2 spectra (4.7m)
(c) SR with PS spectra (4.7m)
(d) HR ground-truth (4.7m)
Figure 30: Example outputs of HighRes-net with the consistency loss, to preserve the spectra of Sentinel-2. (a)a(d)d: Building are best delineated in (c)c but the spectra are no longer consistent with Sentinel-2. (b)b presents a trade-off that still delineates buildings better than (a)a while being consistent to S-2 spectra. (e)e(h)h: Example patch from Figure 17, now with S-2 spectra preserved in (f)f. (i)i(l)l: Example of a failure in an agricultural scene, where HighRes-net with the consistency loss (j)j has failed to capture the lack of vegetation originally shown in the left side of (i)i. We suspect this is because the high-res ground-truth contains vegetation. In other words, this instance highlights the compromise between the spectra of S-2 and content of PlanetScope.

7 Discussion and conclusions

In this work, we propose the first MISR model for multi-spectral imagery adapting the HighRes-net model of Deudon et al. deudon_highres-net_2020. We demonstrate this model over a new compiled dataset of high-res PlanetScope images (4.77m) and Sentinel-2 lower-res (10m) time series over 46 different locations using the SpaceNet-7 dataset as a base. Using this data, we show that the HighRes-net model produce better super-resolution images than SISR or bicubic upsampling in terms of the PSNR and SSIM metrics. These results are aligned with the outputs of the PROBA-V contest organized by the ESA that highlighted the potential of deep learning based super-resolution for Earth observation.

Additionally, we propose a modification of the HighRes-net architecture and training procedure to deal with the domain shift between PlanetScope and Sentinel-2 images. Taking into account the spectral differences between instruments has been overlooked by previous super-resolution works (e.g. in the SISR work of salgueiro_romero_super-resolution_2020 authors ignore this issue when working with WorldView and Sentinel-2 images). Nevertheless, the radiometric calibration of Sentinel-2 images is higher (12-bit depth) than the PlanetScope data (8-bit) and Sentinel-2 atmospheric correction is also more accurate due to the dedicated atmospheric correction bands (B1, B9 and B10). Hence, surface reflectance BOA values for Sentinel-2 images are more accurate than in PlanetScope. With our proposed solution we seek to have the best of both worlds: high spatial resolution and good radiometrically calibrated data.

Another important contribution of this work is the demonstration of MISR in a downstream task. In particular, we showed that MISR-fused revisits produce training images that yield better performance of building segmentation models, compared to SISR and bicubic upsampling, in terms of IoU, Recall / Precision / F1 scores. Interestingly, SISR performs worse than bicubic upsampling, possibly due to artefacts caused by the deep SISR image generator. We also tried a more simplified approach of fusion, by concatenating multiple revisits on the channels dimension, and forward-passing them with a Fully Convolutional Net333A CNN with no pooling on the spatial dimensions. for segmentation. We found that even with this rather simplistic approach, multiple revisits still assist the segmentation task (although we are using the 4 best revisits while HighRes-net receives all the inputs and have to learn to discard the cloudy acquisitions).

7.1 Future work

To attempt a fully multi-spectral MISR approach, on all bands of Sentinel-2, ideally a higher resolution reference with the same bands would be needed, at least in a supervised learning setting. Further research into unsupervised MISR is needed to unlock the super-resolution potential of any revisit archive, without depending on near co-located and co-temporal higher resolution imagery.

Although the colour consistency loss shows promising results in the preservation of Sentinel-2 spectra, we conclude that a more flexible approach is needed for fusing revisits from dynamic scenes, by attending and fusing features that are static with respect to a certain anchor revisit, possibly chosen by the user. This anchor revisit can be any one within the set of available revisits, and the anchored fusion model should enhance only the parts of the image that are static with respect to the anchor image, and ignore any dynamic features (e.g. due to weather, vegetation, urban development).

Conceptualization, M.R., F.K. and Y.G.; methodology, M.R., F.K. G.M.G, Y.G. ; data curation, M.R., F.K. G.M.G; writing—original draft preparation, M.R., F.K.; writing—review and editing, M.R., F.K. G.M.G, L.G.C. Y.G.; supervision, F.K, Y.G. L.G.C..

This work has been enabled by Frontier Development Lab (FDL) Europe, a public partnership between the European Space Agency (ESA) at Phi-Lab (ESRIN), Trillium Technologies and the University of Oxford; the project has been also supported by Google Cloud. G.M.-G. and L.G.-C. are funded by the Spanish ministry of Science DOI MCIN/AEI/10.13039/501100011033/ project TEC2016-77741-R.

The PlanetScope images used in this work are publicly available in Sentinel-2 images were obtained from the Copernicus Open Access Hub The co-registered dataset produced in this study will be made publicly available in

This work has been enabled by Frontier Development Lab (FDL) Europe, a public partnership between the European Space Agency (ESA) at Phi-Lab (ESRIN), Trillium Technologies and the University of Oxford; the project has been also supported by Google Cloud. The authors would like to thank the support of James Parr and Jodie Hughes from the Trillium team and to Nicolas Longépé from ESA PhiLab for discussions and comments throughout the development of this work. The authors declare no conflict of interest. no

Appendix A Dataset Area of Interest Breakdown

id Scene %clouds desert agri urban veg bare revisits val test
usable total %
1 0358E-1220N_1433_3310 70 1 1 1 8 13 62 1
2 1389E-1284N_5557_3054 69 1 1 1 8 13 62 1
3 0361E-1300N_1446_2989 66 1 1 8 13 62 1
4 1848E-0793N_7394_5018 66 1 1 1 7 13 54
5 0357E-1223N_1429_3296 65 1 1 1 6 13 46
6 1716E-1211N_6864_3345 62 1 5 13 38
7 1025E-1366N_4102_2726 55 1 1 1 5 13 38
8 1672E-1207N_6691_3363 53 1 1 5 13 38
9 1298E-1322N_5193_2903 56 1 1 1 4 13 31
10 1014E-1375N_4056_2688 49 1 1 1 4 13 31
11 1703E-1219N_6813_3313 58 1 1 1 3 13 23
12 1617E-1207N_6468_3360 24 1 1 3 13 23
13 1439E-1134N_5759_3655 61 1 1 1 5 10 50
14 0566E-1185N_2265_3451 96 1 1 6 7 86 1
15 0586E-1127N_2345_3680 83 1 1 1 6 7 86 1
16 1481E-1119N_5927_3715 81 1 1 1 6 7 86
17 0571E-1075N_2287_3888 81 1 1 6 7 86
18 1200E-0847N_4802_4803 80 1 1 1 6 7 86
19 1210E-1025N_4840_4088 95 1 1 1 5 7 71
20 1335E-1166N_5342_3524 84 1 1 1 5 7 71
21 1204E-1204N_4819_3372 74 1 1 1 5 7 71
22 0632E-0892N_2528_4620 67 1 1 1 5 7 71
23 1479E-1101N_5916_3785 67 1 1 5 7 71
24 0434E-1218N_1736_3318 84 1 4 7 57
25 1138E-1216N_4553_3325 71 1 1 4 7 57
26 0331E-1257N_1327_3160 68 1 1 4 7 57
27 1049E-1370N_4196_2710 59 1 1 4 7 57
28 1185E-0935N_4742_4450 33 1 1 1 2 7 29
29 0614E-0946N_2459_4406 29 1 1 2 7 29
30 0595E-1278N_2383_3079 34 1 1 7 14 NA NA
31 1209E-1113N_4838_3737 100 1 1 1 1 6 6 100 1
32 0977E-1187N_3911_3441 100 1 1 1 6 6 100
33 1289E-1169N_5156_3514 99 1 1 6 6 100
34 0368E-1245N_1474_3210 67 1 1 6 6 100
35 1015E-1062N_4061_3941 100 1 1 5 6 83
36 1438E-1134N_5753_3655 92 1 1 5 6 83 1
37 1276E-1107N_5105_3761 91 1 1 1 5 6 83 1
38 1296E-1198N_5184_3399 87 1 1 1 5 6 83
39 0924E-1108N_3699_3757 67 1 4 6 67
40 0487E-1246N_1950_3207 98 1 1 3 6 50
41 1538E-1163N_6154_3539 63 1 1 3 6 50
42 1748E-1247N_6993_3202 57 1 1 3 6 50
43 1172E-1306N_4688_2967 56 1 1 1 3 6 50
44 1709E-1112N_6838_3742 44 1 1 1 3 6 50
45 0683E-1006N_2732_4164 58 1 1 2 5 40
Table 5: The subset (45) of SpaceNet-7 AOIs that we used in this work, acquired between December 2019 and January 2020. The high-level breakdown of the types of terrain contained in each scene shows the overall geo-diversity of the dataset. Left to right: % clouds is the average cloud coverage (SCL=9) across all revisits; desert, agri(culture), urban, veg(etation), bare (soils) indicate the type of terrain; a usable revisit is at least %50 cloud-free; val/test indicate whether the scene is part of the validation or testing dataset (NA=not used).

Appendix B HighRes-net Architecture

Module Layers parameters
encode Conv2d(in=6, out=64, k=3, s=1, p=1) 1216
ResidualBlock(64) 73,858
ResidualBlock(64) 73,858
Conv2d(in=64, out=64, k=3, s=1, p=1) 36,928
fuse ResidualBlock(128) 295,170
Conv2d(in=128, out=64, k=3, s=1, p=1) 73,792
decode ConvTranspose2d(in=64, out=64, k=3, s=1) 36,928
PreLU 1
Conv2d(in=64, out=3, k=1, s=1) 65
(optional) Upsample(scale_factor=3.0, mode=‘bicubic’) 0
591,818 (total)
Table 6: HighRes-net

architecture: The ENCODE module converts each input low-res revisit into an encoding, and it inputs 6 channels, that is, 3 channels (RGB) per input image (low-res revisit + reference frame concatenated in the channels dimension). The FUSE module is an operator that is applied recursively of a pair of encodings, until one encoding remains. An encoding can be either the output of the ENCODE or the FUSE module. It inputs 64 channels per input encoding, that is, 128 channels for a pair of encodings concatenated on the channels dimension. The DECODE module is a learned upsampling operator (contrary to the non-learned bilinear or bicubic upsampling), through a transpose convolution layer that outputs 3 channels (RGB). Note that the ConvTranspose2d stride also decides the upsampling factor, which must be an integer, hence an optional Upsample layer can further upscale the super-resolved image by a fractional factor, if needed.)

Appendix C Hyper-parameters

c.1 Hyper-Parameters of SRResNet

We list a number of the other hyper-parameters used in training. We did not perform a comprehensive hyper-parameter sweep.

  1. Optimiser: Adam

  2. Learning Rate: 0.0007

  3. Learning Rate Decay: The learning rate was reduced when the loss on the validation set plateaued for 2 epochs.

  4. Epochs: 50

c.2 Hyper-Parameters of HRNet

We list here a number of the hyper-parameters used for training the HRNet hrnet. We did not perform a comprehensive hyper-parameter sweep.

  1. Objective Function: Binary Cross Entropy

  2. Optimizer: Adam

  3. Learning Rate: 0.0007

  4. Learning Rate Decay: The learning rate was reduced when the loss on the validation set plateaued for 2 epochs.

  5. Epochs: 50