One of the main issues in observing and analyzing Earth Observation (EO) images is to estimate its quality. However, this main issue is twofold. Firstly, images are captured with distinct image modifications and distortions, say optical diffractions and aberrations, detector spacings and footprints, atmospheric turbulences, platform vibrations, blurring, target motions, and postprocessing. Secondly, EO image resolution is very limited, due to the sensor’s optical resolution, the satellite and connection capacity to send high quality images to ground as well as the captured Ground Sampling Distance (GSD)[Leachtenauer and Driggers, 2022]. These limitatons make the image quality assessment (IQA) hard to evaluate for EO particularly, as there are no comparable fine-grained baselines in broad EO domains.
We will tackle these problems defining a network that acts as a no-reference (blind) metric, assessing quality and optimizing super-resolution on EO images at any scale and modification.
We briefly summarize below our main contributions:
We train and validate a novel network (QMRNet) for EO imagery being able to predict any type of based on its quality and distortion
(Case 1) We benchmark distinct Super-Resolution Models with QMRNet and compare results with Full-Reference, No-Reference and Feature-based Metrics
(Case 2) We benchmark distinct EO datasets with QMRNet scores
(Case 3) We propose to use QMRNet as a loss for optimizing quality of super-resolution models
Super-resolution (SR) consists on estimating a high resolution image (HR) given a low resolution one (LR). Initial work from [Zeyde et al., 2012] defined a technique named Sparse Coding, which consisted on defining dictionaries of patches specific to the image, which combined were able to reconstruct an HR image. Here the reconstruction error consists on estimating the difference between the reconstructed image (SR) from the LR image and the original HR image and therefore calculate the coefficients for every patch of the dictionary for that particular image in order to reconstruct it.
The first algorithms used for analyzing satellital images were based on multiscale or bilateral filters, only extracting low-level features of the image. Here the denoising problem (similarly what we want to obtain with SR) was tackled by FSRS [Farsiu et al., 2004]
, with an architecture based on lateral filters trying to minimize variance error with respect the HR. GA-FRSR[Benecki et al., 2018]
utilized an algorithm able to tune the hyperparameters and kernels from the FRSR. The case of SR-ADE[Zhu et al., 2016] utilized a low-level algorithm abse to increase the high-frequency features in order to obtain better image resolution. Another model is the RFSR [Shermeyer and Etten, 2019]
utilized 100 random forest regressors with a tree depth of 12.
) which its convolutional layers (feature extractor) encode the patches of the image in order to build a feature vector (encoder) from an image, then add deconvolutional layers to reconstruct the original image (decoder). The instances of the predicted images are compared with the original ones in order to re-train the Auto-Encoder network until converging to a HR objective.
are based on a network of 3 blocks (patch extraction and representation, nonlinear mapping and reconstruction) with 64 filters per layer and 3x3 kernels, using patches of 33x33. The authors also mention to use rotation, scaling and noise transformations as data augmentation prior to training of the network. SRCNN/FSRCNN is said to converge given 100 images using a total of 24.800 patches (crops). The authors use a downscaling using a low-pass filter to obtain LR images and use a bicubic interpolation for the upscaling during reconstruction to obtain the SR image (the predicted HR image). In contrast, for the case of VDSR[Kim et al., 2016]
, uses a network of 20 convolutional layers (64 filters of 3x3 kernels and ReLU) and a residual map as an additional layer which is able to represent the high-level feature differences between the HR and the interpolated SR image. This residual map is summed to the original LR image in order to obtain the SR image and calculate the reconstruction error and re-train the network. SRCNN has been used by MC-SRCNN[Müller et al., 2020] to super-resolve multi-spectral images, by changing the architecture’s input channels and adding pan-sharpening filters (modulating the smoothing/sharpening intensity).
These design principles used in autoencoders however have a drawback, that they work differently over feature size frequencies and features at distinct resolutions. For that, multi-scale architectures are proposed. The Multi-Scale Residual Network (MSRN)[Li et al., 2018]
uses residual connections in multiple residual blocks at different scale bands, non-exclusive to ResNets. It ables to equalize the information bottleneck in deeper layers (high-level features) where the spatial information in some cases tend to diminish or either vanish. Traditional convolutional filters in primary layers have a fixed and narrow field of view, which create dependencies to the learning of spatial long-range connections and deeper layers. However, multiscale blocks cope with this drawback by analyzing the image domain at different resolution scales to be later merged in a high dimensional multiband latent space. It allow a better abstraction at deeper layers and therefore reconstruct spatial information. This is a remarkable advantage when using EO images, which come with distinct resolutions and GSD.
Novel state-of-the-art SR models are based on Generative Adversarial Networks (GANs)
. These networks are composed of two networks, a generator that generates an image estimate (SR), and a discriminator which decides whether the generated image is real or fake under certain categorical or metric objective with respect classification of a set of images ”x”. Usually, the generator is a deconvolutional network which is fed with a latent vector ”z”, which represents the distribution for each image. First, the discriminator is trained using an existing database (in order to define the discrimination objective). Then, freezing the discriminator, the generator generates the SR image estimates initializing the latent vector at a random distribution. After that, the discriminator is fine-tuned given the generated SR from the latent space. Finally, the generator is re-trained given the loss obtained from the generator. The main objective of GANs is to maximize the probability of the generator to fool the discriminator. In the SR problem, the LR is considered as the input latent space ”z” while the HR image is considered as the real image ”x” to obtain the adversarial loss. For the case of the popular SRGAN[Ledig et al., 2017]
, it has been designed with adversarial loss through VGG and ResNet (SRResnet), with residual connections and perceptual loss. Its generator uses 16 residual blocks, and for each residual block there are 2 convolutional layers of 64 filters and 3x3 kernels, batch normalization and ReLU. The discriminator has 8 convolutional layers followed of 3x3 kernels with filters scaled with a factor of 2, from 64 to 512 filters, followed by fully-connected layers and a sigmoid for the calculation of the adversarial loss. The ESRGAN[Wang et al., 2019] is an improved version of the SRGAN although uses adversarial loss relaxation, adds training upon perceptual loss and some residual connections in its architecture.
The main intrinsic difference between GANs and other architectures resides on that the image probability distributions is intrinsically learned. This makes these architectures to suffer from unknown artifacts and hallucinations, however, their SR estimates are usually sharper than autoencoder-type architectures. Some mentioned generative techniques for SR, such as SRGAN/SRResnet, ESRGAN, or Enlighten-GAN[Jiang et al., 2021], and convolutional SR autoencoders, such as VDSR, SRCNN/FRSCNN, or MSRN, don’t adapt their feature generation to optimize a loss based on a specific quality standard that considers all quality properties of the image (both structural and pixel-to-pixel). However, predictions show typical distortions such as blurring (from downscaling the input) or GAN artifacts from the training domain objective. Most of these GAN-based models build the LR inputs of the network from downsampled data from the original HR. This LR generation from downsampling HR limits the training of these models to perform the reverse transformation of the modification, however, the type of distortions and variations from any test image be a combination of much more diverse modifications. The only way to mitigate this limitation, but only partially due to overfitting, is to augment the LR samples to distinct transformations simultaneously.
Some self-supervised techniques can learn to solve the ill-posed inverse problem from the observed measurements, without any knowledge of the underlying distribution assuming its invariance to the transformations. The Content Adaptive Resampler (CAR) [Sun and Chen, 2020] was proposed, in which a join-learnable downscaling pre-step block together with a upscaling block (SRNet) are trained separately. It is able to learn the downscaling step (through a ResamplerNet) learning the statistics of kernels from the HR image, then it learns the upscaling blocks with another net (SRNet/EDSR) to obtain the SR images. CAR has been able to improve experimental results of SR by considering the intrinsic divergences between LR and HR.
The Local Implicit Image Function (LIIF) [Chen et al., 2020]
is able to generate super-resolved pixels considering 2D deep features around these coordinates as inputs. In LIIF, an encoder is jointly trained in a self-supervised super-resolution task maintaining high fidelity in higher resolutions. Since the coordinates are continuous, LIIF can be presented in any arbitrary resolution. Here the main advantage is that the SR is represented in a resolution without resizing HR, making it invariant to the transformations performed to the LR. This ables LIIF to extrapolate SR upon factors up to x30.
1.2 Image Quality Assessment
In order to assess the quality of an image, there are distinct strategies. Full-Reference metrics consider the difference between an estimated or modified image (SR) and the reference image (HR). In contrast, no-reference metrics assess the specific statistical properties of the estimated image (SR) without any reference image. Other more novel metrics calculate high-level characteristics of the estimated (SR) image by comparing its distribution distance with respect a preprocessed dataset or either the reference (HR) image in a feature-based space.
Full-Reference Pixel-Level metrics
The similarity between predicted images (SR) and the reference high-resolution images (HR) is estimated by either looking at the pixel-wise differences responsive to reflectance, sharpness, structure, noise, etc. Very well-known examples of pixel-level metrics are the Root-Mean-Square Error (RMSE) [Pradham et al., 2008]
, the Peak Signal-to-Noise Ratio (PSNR)[Huynh-Thu and Ghanbari, 2008], Structural Similarity Metric (SSIM/MSSIM) [Wang et al., 2004], Haar Perceptual Similarity Index (HAARPSI) [Reisenhofer et al., 2018], Gradient Magnitude Similarity Deviation (GMSD) [Xue et al., 2014], Mean Deviation Similarity Index (MDSI) [Nafchi et al., 2016], Spectral Angle Mapper (SAM) [Kruse et al., 1993], Universal Image Quality Index (UQI/UIQ/UIQI) [Wang and Bovik, 2002], Human Visual System-based (HVS) [Sheikh and Bovik, 2005], or Visual Information Fidelity Criterion (IFC/VIF) [Sheikh and Bovik, 2006].
The RMSE evaluates the absolute pixel error between SR and HR. For the case of PSNR uses an estimate calculating the power of the signal (SR) considering the noise error with respect the HR image. Some metrics such as the SSIM specifically measure the means and covariances locally for each region at a specific size (e.g. 8x8 patches; multi-scale patches for MSSIM) affecting the overall metric score. The GMSD calculates the global variation similarity of gradient based on a local quality map combined with a pooling strategy. Most comparative studies use these metrics to measure the actual SR quality, mostly relying on PSNR, although there is no evidence that these measurements are the best for EO cases, as some of these are not sensitive to local perturbations (i.e. blurring, over-sharpnening) and local changes (i.e. artifacts, hallucinations) to the image. The HAARPSI and IFC/VIF calculate an index based on the difference (absolute or in mutual information) using the sum of a set of wavelet coefficients processed over the SR-HR images. Other cases of metrics combine some of the pinpointed parameters simultaneously. For instance, the MDSI compares jointly the gradient similarity, chromaticity similarity, and deviation pooling. Other cases such as the NQM [Damera-Venkata et al., 2000] and UQI/ UIQ /UIQI consider luminance, contrast, and structural statistics of the image, or VMAF [Aaron et al., 2015] that combines measurements of VIF, detail loss and luminance pixel differences.
Pixel-reference metrics have a main requirement, which is that the ground-truth HR images are needed to assess a specific quality standard. For the case of no-reference (or blind) metrics, no explicit reference is needed. These rely on a parametric characterization of the enhanced signal based on statistic descriptors, usually linked to noise or sharpness, embedded in high-frequency bands. Some examples are variance, entropy (He), or high-end spectrum (FFT). The main popular metric in EO is the Modulation Transfer Function (MTF), which measures impulse responses in the spatial domain and transfer functions in the frequency domain. This varies upon overall local pixel characteristics mostly present on contours, corners and sharp features in general[Lim et al., 2018]. Here the MTF is very sensitive to local changes such as aforementioned (e.g. optical diffractions and aberrations, blurring, motions, etc).
Other metrics would use statistics from image patches in combination with multivariate filtering methods to propose score indexes for a given predefined image given its geo-referenced parameter standards. Such methods include NIQE [Mittal et al., 2013], PIQE [Venkatanath et al., 2015] and GIQE [Leachtenauer et al., 1997]. The latter is considered for official evaluation of NIIRS ratings 111https://irp.fas.org/imint/niirs.htm, considering Ground Sampling Distance (GSD), signal-to-noise (SNR) and the relative edge response (RER) in distinct effective focal lengths of EO images [Thurman and Fienup, 2008, Kim et al., 2008, Li et al., 2014]. Note that RER measures the Line Spread Function (LSF) which corresponds to the absolute impulse response also computed by the MTF.
The Relative Edge Response measures the slope in the edge response (transition). The lower the metric, the blurrier the image is. Taking the derivative of normalized Edge Response produces the Line Spread Function (LSF). The LSF is a 1-D representation of the system Point Sparsity Function (PSF). The width of the LSF at half the height is called the Full Width at Half Maximum (FWHM). The Fourier Transform of the LSF produces the Modulation Transfer Function (MTF). MTF is determined across all spatial frequencies, but can be evaluated at a single spatial frequency, such as the Nyquist frequency. The value of the MTF at Nyquist provides a measure of resolvable contrast at the highest ‘alias-free’ spatial frequency.
Feature-based (ML) perceptual metrics
In [Benecki et al., 2018], the authors argued that the conventional IQA evaluation methods are not valid for EO as the degradation functions and operation hardware conditions do not meet operational conditions. From there it was defined the keypoint feature similarity (KFS) [Liu et al., 2021], which measures edge and keypoint detector statistics to extract information concerned with local features. Through advances in DL in that aspect, deeper network representations have been shown to improve perceptual quality on images, although with higher requirements. The existence of hallucinations and artifacts in predicted SR images is due to several factors related to insufficient training data, learning limitations and optimization functions of the network architecture itself or simply because of common overfitting problems. The concept of perceptual similarity is defined by the score reference on these trained features (i.e. the generator or reconstruction network). These metrics compare distances between latent features from the predicted image and the reference image. Some SoTA methods of perceptual similarity include the Learned Perceptual Image Patch Similarity (LPIPS) [Zhang et al., 2018], which measures the feature maps obtained by the n-th convolution after activation (image-reference layer n) and then calculates similarity using the Euclidean distance between the predicted SR model features and the reference image features. Some other metrics such as the Sliced Wasserstein Distance (SWD) [Kolouri et al., 2019] or the Fréchet Inception distance (FID) [Salimans et al., 2016] assume a non-linear space modelling for the feature representations to compare, and therefore can adapt better with larger variability or lack of samples in the training image domains.
1.3 EO Datasets and Related Work
Most non-feature based metrics shown below are fully unsupervised, namely that there are no current models that specifically can assess image quality invariably from the specific modifications made on images, specially for EO cases. The most novel strategy, ProxIQA [Chen et al., 2021] tries to evaluate the quality of an image by adapting the underlying distribution of a GAN given a compressed input. This method has shown to improve quality tested on images from compression datasets Kodak, Teknick and NFLX, although results may vary upon trained image distributions, as shown by JPEG2000, VMAFp and HEVC metrics.
Very few studies on SR use EO images obtained from current worldwide satellites such as DigitalGlobe WorldView-4 222https://earth.esa.int/eogateway/missions/worldview-4, SPOT 333https://earth.esa.int/eogateway/missions/spot, Sentinel-2 444https://sentinels.copernicus.eu/web/sentinel/missions/sentinel-2, Landsat-8 555https://www.usgs.gov/landsat-missions/landsat-8, Hyperion/EO-1 666https://www.usgs.gov/centers/eros/science/usgs-eros-archive-earth-observing-one-eo-1-hyperion, SkySat 777https://earth.esa.int/eogateway/missions/skysat, Planetscope 888https://earth.esa.int/eogateway/missions/planetscope, RedEye 999https://space.skyrocket.de/docs_dat/red-eye.htm, QuickBird 101010https://earth.esa.int/eogateway/missions/quickbird-2, CBERS 111111https://www.satimagingcorp.com/satellite-sensors/other-satellite-sensors/cbers-2/, Himawari-8 121212https://www.data.jma.go.jp/mscweb/data/himawari/, DSCOVR EPIC 131313https://epic.gsfc.nasa.gov/ or PRISMA 141414https://www.asi.it/en/earth-science/prisma/. In our study we selected a variety of subsets (see Table 1) from distinct online General Public Domain satellite imagery datasets with high resolution (around 30 cm/px). Most of these are used for land use classification tasks, with coverage category annotations and some with object segmentation. Inria Aerial Image Labeling Dataset (Inria-AILD) 151515https://project.inria.fr/aerialimagelabeling/ contains 180 train and 180 test images, covering 405+405
of US (Austin, Chicago, Kitsap County, Bellingham, Bloomington, San Francisco) and Austria (Innsbruck Eastern/Western Tyrol, Vienna) regions. Inria-AILD was used for semantic segmentation of buildings contest. Some land cover categories are considered for aerial scene classification in DeepGlobe161616http://deepglobe.org/ (Urban, Agriculture, Rangeland, Forest, Water or Barren); USGS 171717https://data.usgs.gov/datacatalog/ and UCMerced 181818http://weegee.vision.ucmerced.edu/datasets/landuse.html with 21 classes (i.e. agricultural, airplane, baseballdiamond, beach, buildings, chaparral, denseresidential, forest, freeway, golfcourse, harbor, intersection, mediumresidential, mobilehomepark, overpass, parkinglot, river, runway, sparseresidential, storagetanks and tenniscourt). The latter has been captured on many US regions, i.e. Birmingham, Boston, Buffalo, Columbus, Dallas, Harrisburg, Houston, Jacksonville, Las Vegas, Los Angeles, Miami, Napa, New York, Reno, San Diego, Santa Barbara, Seattle, Tampa, Tucson and Ventura. XView191919http://xviewdataset.org/ contains 1.400 RGB pan-sharpened images from DigitalGlobe WorldView-3 with 1 million labeled objects and 60 classes (e.g. Building, Hangar, Train, Airplane, Vehicle, Parking Lot) annotated both with bounding boxes and segmentation. Kaggle Shipsnet 202020https://www.kaggle.com/datasets/rhammell/ships-in-satellite-imagery contains 7 San Francisco Bay harbor images and 4000 individual crops of ships captured in the dataset.
|Dataset-Subset||#Set / #Total||GSD||Resol||Spatial Coverage||Year||Provider|
|USGS||279/279||30 cm/px||5000x5000||349 km (US regions)||2000||USGS (LandSat)|
|UCMerced-380||380/2100||30 cm/px||256x256||1022/5652 (US regions)||2010||USGS (LandSat)|
|UCMerced-2100||2100/2100||30 cm/px||232x232||5652 km (US regions)||2010||USGS (LandSat)|
|Inria-AILD-10-train||10/360||30 cm/px||5000x5000||22/810 km (US & Austria)||2017||arcGIS|
|Inria-AILD-180-train||180/360||30 cm/px||5000x5000||405/810 km (US & Austria)||2017||arcGIS|
|Inria-AILD-180-test||180/360||30 cm/px||5000x5000||405/810 km (US & Austria)||2017||arcGIS|
|Shipsnet-Scenes||7/7||3m/px||3000x1500||28 km (San Francisco Bay)||2018||Open California|
|Shipsnet-Ships||4000/4000||3m/px||80x80||28 km (San Francisco Bay)||2018||(Planetscope)|
|DeepGlobe||469/1146||31 cm/px||2448x2448||703/1.717 km (Germany)||2018||Worldview-3|
|Xview-train||846/1127||30 cm/px||5000x5000||1050/1.400 km (Global)||2018||WorldView-3|
|Xview-validation||281/1127||30 cm/px||5000x5000||349/1.400 km (Global)||2018||WorldView-3|
2 Proposed Method
2.1 IQUAFLOW Modifiers and Metrics
) we describe 5 modifiers we developed for our experimentation, 3 of which have been integrated from common libraries (Pytorch222222https://pytorch.org/vision/stable/transforms.html, PIL 232323https://pillow.readthedocs.io/en/stable/reference/Image.html), such as Blur, Sharpness and Resolution (GSD) and 2 which we specifically developed to represent RER and SNR blind metric modifications. For the case of Blur, we build a gaussian filter with kernel 7x7 and we parametrize the . For the case of Sharpness, similarly, we build a function that is modulated by a gaussian factor (similar to a ). If the factor is higher than 1.0 (i.e. from 1.0 to 10.0), the image is sharpened (high-pass filter, with negative values on the sides of the kernel). However, if the factor is lower than 1.0 (i.e. from 0.0 to 1.0) then the image is blurred through a gaussian function (low-pass filter with gaussian shape). For the case of GSD (Ground Sampling Distance), we apply a bilinear interpolation on the original image to a specific scaling (e.g. x1.5, x2) which will increase the GSD. In this case, an interpolated version of a 5000x5000 image of GSD 30cm/px will be 10000x10000 and its GSD 60cm/px, as its resolution has changed but the (oversampled / fake) sampling distance is doubled (worse). For the case of RER we get the real RER value from the ground truth and calculate the LSF and max value of edge response. From that we build a gaussian function that is adapted to the expected RER coefficients and then filter the image. For SNR, similarly to RER, we require annotation of base SNR from the original dataset. From that, we build a randomness regime that is adapted to a gaussian shape which will be summed to the original image (adding randomness with a specific slope probability).
|Gaussian Blur||Sigma ()||50||.0 to 2.5|
|Gaussian Sharpness||Sharpness Factor ()||9||1.0 to 10.0|
|Ground Sampling Distance||GSD||GSD or scaling||10||
|Relative Edge Response||RER (MTF-Sharpness)||40||.15 to .55|
|Signal-to-Noise Ratio||Noise (Gaussian) Ratio||40||15 to 30|
2.2 QMRNet: Classifier of Modifier Metric Parameters
We have designed the Quality Metric Regression Network (QMRNet) able to regress quality parameters upon the modification or distortion (see Table 2 and Figure 3) applied to single images. 242424Check the code for QMRNet in https://github.com/satellogic/iquaflow/tree/main/iquaflow/quality_metrics. Given a set of images, modified through a gaussian blur (sigma), sharpness (gaussian factor), a rescaling to a distinct GSD, noise (SNR), or any kind of distortion, images are annotated with that parameter. These annotations can be used by training and validating the network upon classifying the intervals corresponding to the annotated parameters.
QMRNet is a feedfworward neural network that takes architecture of a classifier with a parametizable classifier (see Figure1) upon numerical class intervals (can be set binary, categorical or continuous according to the N intervals). It trains upon predicted interval differences and the annotated parameters of the ground truth (GT), and requires a HEAD for each parameter to predict. We designed 2 mechanisms of assessing quality from several parameters (multiparameter prediction) Multibranch (MB). For MB it is required a single encoder and head for each parameter to predict, while MH requires a head for each parameter but only one encoder. The MH predicts all parameters simultaneously (faster) but its capacity is lower (can lead to lower accuracy) from the encoder part.
For our experiments with QMRNet we have used an Encoder based on ResNet18 (backbone) composed of a convolutional layer (3x3) and 4 residual blocks (each composed by 4 convolutional layers) of 64, 128, 256 and 512 pixels of resolution. Our network is scalable to distinct crop resolutions as well as regression parameters (N intervals) adapting the HEAD to the number of classes to predict. The output of the HEAD after pooling is a continuous value of probability of each class interval, and through softmax and thresholding we can filter (one-hot) which class or classes have been predicted (1) and not (0) for each image sample crop. By default we utilize the Binary Cross Entropy Loss (BCELoss) as classification error and Stochastic Gradient Descend as optimizer. For the case of Multiclass regression, in QMRNet-MB we train each network individually with its set of parametized modification intervals for each sample.
Note that for processing irregular or inequivalent crops in our design of the network input, in the case of having the encoder input resolution lower than the input image crops (e.g. 5000x5000 for GT and 256x256 for the network input), we crop the image to the QMRNet input by crops.
the number of crops to generate for each sample (e.g. 10, 20, 50, 100, 200). In the case of the crops being smaller than the encoder backbone input (e.g. 232x232 for the GT and 512x512 for the network input), we apply a circular padding on each border (width and/or height) to obtain a real image that preserves scaling and domain. The total number of hyperparameters to specify to design a specific QMRNet architecture is NxR and can be trained with distinct combination of hyperparameters (NxCxR).
To train the QMRNet’s regressor, we select a training set and generate a set of distorted cases, which are parameterizable through our modifiers. The total number of training samples (dataset size) can be calculated by the product of dataset images (I) and NxC (number parameter intervals and crops per sample). We can set distinct possible hyperparameters specific to train and validate, such as number of epochs (e), batch size (bs), learning rate (lr), weight decay (wd), momentum, soft thresholding, etc.
2.3 QMRLoss: Learning Quality Metric Regression as Loss in SR
We designed a novel objective function that is able to optimize super-resolution algorithms upon a specific quality objective using QMRNet (see Figure 2).
Given a GAN or AutoEncoder network, we can add an ad-hoc module based on a specific (or several) parameters of QMRNet. The QMRLoss is obtained by computing the classification error between the SR prediction and the original HR image. This classification error determines whether the SR image is distinct in terms of a quality parameter objective (i.e. , , GSD, or ) with respect the HR. The QMRLoss has been designed to use any classification error (i.e. BCE, L1 or L2) and can be summated to the Perceptual or Content Loss of the Generator (Decoder for Autoencoders) in order to tune the SR to the quality objective.
The objective function for image generation algorithms is based on minimizing the Generator (G) error (that compares HR and SR) while maximizing discriminator (D) error (that tests whether the SR image is true or fake).
During training, G is optimized upon , which considers and . We added a new term,
which will be our loss function based on quality objectives (QMRNet). Note that here we consideras the prediction image .
Here we defined the term , it calculates the parameter difference between HR images and SR images, regularized by the constant . This is done by computing the classification error (L1, L2 or BCE) between the output of the heads for each case:
3.1 Experimental Setup
For training the QMRNet we collected 30cm/pixel data from INRIA Aerial Image Labeling Dataset (both training and validation using Inria-AILD sets). For testing our network, we selected all the 11 subsets from disctinct EO datasets, USGS, UCMerced, Inria, DeepGlobe, Shipsnet and Xview (see Table 1).
In order to validate the training regime, we set several evaluation metrics that provide interval-dependencies for each prediction, namely, that intervals that are closer to the target interval are considered better predictions that further ones. This means that given an unblurred image () the prediction of will be a worse prediction than predictions closer to the GT (e.g. , ). For this, we considered retrieval metrics such as medR or recall rate K (R@K) [Carvalho et al., 2018, Salvador et al., 2017]
as well as performance statistics (Precision, Recall, Accuracy, F-Score) at different intervals close to target (Precision@K, Recall@K, Accuracy@K, F-Score@K) and overall Area Under ROC (AUC). The retrieval metric medR measures the median abolute interval difference between classes, namely, that for 10 classes and modifier GSD (, , , …, 60), if the targets (modified) are and predictions are then there’s a medR of 1.0, while if predictions are then medR is 9.0. R@K measures the total recall (whether prediction in an interval distance from target is lower than K) over a target window (i.e. if there are 40 classes and K is 10, only the 10 classes around the target label are considered for evaluation).
In Tables 5 we add another quality metric in addition to the modifier-based ones, which is the . For this we defined a basis that describes the overall quality ranking (set from 0.0 to 1.0) of an image or dataset. This is calculated by measuring the weighted mean of metrics, each metric with its own objective target (min or max) as described in Table 2 columns 5-6.
For a specific quality metric we define the total of the metric (i.e. for would be 2.5-1, namely, 1.5), an objective value (i.e. for would be the minimum, as , namely, 1.0) and an which defines the weights for the total weighted sum of the (by default if we keep same importance for each metric, , where is the total number of modifiers, for our case ).
Training and Validation
We trained our network with Inria-AILD sets of 10 and 180 images respectively for short, train and test subsets (Inria-AILD-10, Inria-AILD-train, Inria-AILD-test) with splits of 80/20%, selecting 100 images for training and 20 for validation (proportionally to 45% and 12% from the total respectively). We processed all samples of the dataset with distinct intervals for each modifier (thus, we annotated each sample with that modification interval) and built our network with distinct heads: , , , , . We selected a distinct set of crops for each resolution (CxR), in this case 10 crops of 1024x1024, 20 crops of 512x512, 50 crops of 256x256, 100 crops of 128x128 and 200 crops of 64x64. Thus, we generate datasets with different input resolutions but adapting the total domain capacity. The total number of trained images becomes 180xx (e.g. 64x64 images set contains 1.8M crop samples).
We ran our training and validation experiments for 200 epoch with distinct hyperparameters: , , momentum and soft threshold 0.3 (to filter-out soft to hard/one-hot labels). Due to computational capacity, the training batch sizes have been selected according to the resolution for each set: , , and .
In Table 4 we show validation results (Inria-AILD-180-test) for training QMRNet using ResNet18 backbone with Inria-AILD-180-train data. We can observe that the overall medRs are around 1.0 (predictions are about one interval of distance with respect targets) and recall rates (exact match) are for top-1 (R@1) around 70% and R@5 and R@10 (prediction is in an interval below 5 or 10 of distance with respect target, respectively) around 100%. This means our network is able to predict parameter data ( , sharpness , GSD, , ) with a very high retrieval precision, even when there are 50 classes of intervals. In terms of crop size, best results are mostly in higher input resolution () .
In Figure 3 we can see that most worst predictions for blur, sharpness, rer and snr appear mainly when attempting to predict over crops with sparse features, namely, when most part of the image has limited or few pixel information (i.e. with similar pixel values), such as sea or flat terrain surfaces. This is because preprocessed samples have few or no dissimilarities in each modifier parameter. This has an effect on evaluating datasets, when surfaces are more sparse, predictions get harder.
3.2 Results on QMRNet for IQA: Benchmarking Image Datasets
We ran our QMRNet with ResNet18 backbone over all the sets described in Table 1 252525Check the EO dataset evaluation use case in https://github.com/dberga/iquaflow-qmr-eo. Given our network trained uniquely on Inria-AILD-180-train, we see how our network is able to adapt due to the prediction of feasible quality metrics ( , GSD, sharpness , and ) over each of the distinct datasets. We see that with finetuning QMRNet over Inria-AILD-180-train, overall for most of the datasets appears to be (originally unblurred from the ground truth) except USGS279 and Inria-AILD-test180, which is around . For the case of sharpness factor , overall values for most datasets is (unblurred unsharpened) but for cases such as UCMerced380 and Shipnset, appear to be oversharpened ( and respectively). Most datasets present an overall predicted of and of . The highest datasets are Inria-AILD-180-train, UCMerced2100 and USGS279, here considering the same weight for each modifier metric .
3.3 Results on QMRNet for IQA: Benchmarking Image Super-Resolution
Here we selected a set of Super-Resolution algorithms that have been previously tested to super-resolve high quality real image SR dataset benchmarks such as BSD100, Urban100 or Set5x2 262626https://paperswithcode.com/task/image-super-resolution but here we want to apply them with EO data and metrics. For this we want to benchmark their performance considering Full-Reference, No-Reference and our QMRNet-based metric 272727Check the use case of Super-Resolution benchmark in https://github.com/dberga/iquaflow-qmr-sisr. QMRNet allows us to check the amount of each distortion for every transformation (LR) done to the original image (HR), either if it is the usual x2, x3, x4 downsampling or either a specific distorion such as blurring.
Concretely, we tested our UCMerced subset of 380 images with crops of 256x256 with AutoEncoder algorithms (FSRCNN and MSRN) and GAN-based and self-supervised architectures such as SRGAN, ESRGAN, CAR and LIIF. All model checkpoints are selected as vanilla (default hyperparameter settings) except for input scaling (x2,x3,x4) and also for the case of MSRN, which we computed the vanilla (untrained) and two other with finetuning (on inria-aid-train180, architecture with 4 scales), and finetuning with added noise .
In Table 6 we have evaluated each type of modifier parameter for every single Super-Resolution Algorithm as well as the overall score for all quality metric regression. Here we tested the algorithms considering x2, x3 and x4 downsampling input, as well as considering the case of adding a blur filter with a fixed sigma. Here the QMRNet is able to predict that LR gives the worst ranking for most metrics. FSRCNN and SRGAN give similar results in most metrics, being SRGAN slightly better in blur and sharpness metrics. MSRN shows best results in SNR in most resoltution cases, similarly to SRGAN. For overall scores, CAR presents best results in most metrics, with the highest Score ranking in most downsampling cases. However, CAR has worst ranking in noise metrics, as we mentioned earlier presents oversharpening and hallucidations, which can trick some metrics that measure blur but gets worse for those that predict signal-to-noise ratios. In contrast, LIIF presents worst results in blur and rer metrics, as it appears to be slightly blurred, but has overall good metrics for the rest of modifiers.
In Table 7 we show a benchmark of known Full-Reference metrics. In super-resolving x2, MSRN (concretely, and ) gets best results for Full-Reference metrics, including SSIM, PSNR, SWD, FID, MSSIM, HAARPSI and MDSI. In x3 and x4, LIIF and CAR get best results for most Full-Reference metrics, including PSNR, FID, GMSD and MDSI, being top-3 with most metric evaluation. Here we have to pinpoint LIIF do not perform as well when the input (LR) has been blurred, see here that CAR is able to deblur the input better than other algorithms. In Table LABEL:tab:sisr_snr we show No-Reference Metric results, here for SNR, RER, MTF and FWHM. SRGAN, MSRN and LIIF present significantly better results for SNR than other algorithms. This means these algorithms in general do not add noise to the input, namely, the generated images do not contain artifacts that were not present in the original HR image. In this case, CAR gets wost results for SNR but gets best in RER, MTF and FWHM.
In our results for downsampling LR x3 (used for training and validation), we can qualitatively see in road, building and land crops shown in Figures LABEL:fig:srimages1-LABEL:fig:srimages2 that FSRCNN, SRGAN and present blurred output, similiar to the LR. For ESRGAN it presents a much sharper output however it seems to add extra noise in the edges. CAR however seems to acquire better results but it appears in some cases to be oversharpened (see tennis courts of Figure LABEL:fig:srimages2 columns 8-10). In contrast, LIIF presents a better output with a slight blur.
In Figures LABEL:fig:srrimages1-LABEL:fig:srrimages2 we super-resolve the original UCMerced images x3 and we can observe some alrogithms such as FSRCNN, SRGAN, , ESRGAN and LIIF present a similar (blurred) output to the GT, while others such as , and CAR present a higher noise and oversharpening of borders, trying to enhance features of the image (here attempting to generate features with a GSD lower than 30 cm/px).
In this study, we implement an open-source tool (integrated in the IQUAFLOW framework) developed for assessing quality and modifying EO images. We propose a network architecture (QMRNet using VGG19) that predicts the amount of distortion for each parameter as a no-reference metric. We also benchmark distinct super-resolution algorithms and datasets with both full-reference and no-reference metrics, and propose a novel mechanism for optimizing super-resolution training regimes using QMRLoss, integrating QMRNet metrics with SR algorithm objectives.
On assessing image quality of datasets we observe similar overall score for most datasets, with dissimilarities in scores of snr and rer. On assessing single image super resolution we see significatively better results for CAR, LIIF, and . On optimizing MSRN with QMRLoss (snr, rer and blur) improves results on both full-reference and no-reference metrics with respect default vanilla MSRN.
We have to pinpoint that our proposed method can be applied to any other kind of distortion or modification. QMRNet allows to predict any parameter of the image and also several parameters simultaneously. For instance, training QMRNet to assess compression parameters could be another use case of interest. We also tested the usage of QMRNet as Loss for optimizing SR of MSRN, but it could be extended with distinct algorithm architectures and uses, as QMRLoss allows to reverse or denoise any modification on the original image. In addition, it is also possible to implement a variation to the QMRLoss objective by forcing the head to be on a specific interval (with maximum quality and minimal distortion for each parameter) so that the algorithm maximizes toward a specific metric or score regardless of the output of QMRNet on GT.
The project was financed by the Ministry of Science and Innovation (MICINN) and by the European Union within the framework of FEDER RETOS-Collaboration of the State Program of Research (RTC2019-007434-7), Development and Innovation Oriented to the Challenges of Society, within the State Research Plan Scientific and Technical and Innovation 2017-2020, with the main objective of promoting technological development, innovation and quality research.
- [Aaron et al., 2015] Aaron, A., Li, Z., Manohara, M., Lin, J. Y., Wu, E. C.-H., and Kuo, C.-C. J. (2015). Challenges in cloud based ingest and encoding for high quality streaming media. In 2015 IEEE International Conference on Image Processing (ICIP). IEEE.
- [Benecki et al., 2018] Benecki, P., Kawulok, M., Kostrzewa, D., and Skonieczny, L. (2018). Evaluating super-resolution reconstruction of satellite images. Acta Astronautica, 153:15–25.
- [Carvalho et al., 2018] Carvalho, M., Cadène, R., Picard, D., Soulier, L., Thome, N., and Cord, M. (2018). Cross-modal retrieval in the cooking context. In The 41st International ACM SIGIR Conference on Research Development in Information Retrieval. ACM.
- [Chen et al., 2021] Chen, L.-H., Bampis, C. G., Li, Z., Norkin, A., and Bovik, A. C. (2021). ProxIQA: A proxy approach to perceptual optimization of learned image compression. IEEE Transactions on Image Processing, 30:360–373.
- [Chen et al., 2020] Chen, Y., Liu, S., and Wang, X. (2020). Learning continuous image representation with local implicit image function. arXiv preprint arXiv:2012.09161.
- [Damera-Venkata et al., 2000] Damera-Venkata, N., Kite, T., Geisler, W., Evans, B., and Bovik, A. (2000). Image quality assessment based on a degradation model. IEEE Transactions on Image Processing, 9(4):636–650.
- [Dong et al., 2016] Dong, C., Loy, C. C., He, K., and Tang, X. (2016). Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2):295–307.
- [Farsiu et al., 2004] Farsiu, S., Robinson, D., Elad, M., and Milanfar, P. (2004). Advances and challenges in super-resolution. International Journal of Imaging Systems and Technology, 14(2):47–57.
- [Gallés et al., 2022] Gallés, P., Takáts, K., Hernández-Cabronero, M., Berga, D., Pega, L., Riordan-Chen, L., Garcia-Moll, C., Becker, G., Garriga, A., Bukva, A., Serra-Sagristà, J., Vilaseca, D., and Marín, J. (2022). Iquaflow: A new framework to measure image quality. arXiv preprint arXiv:XXXX.XXXXX.
- [Huynh-Thu and Ghanbari, 2008] Huynh-Thu, Q. and Ghanbari, M. (2008). Scope of validity of PSNR in image/video quality assessment. Electronics Letters, 44(13):800.
- [Jiang et al., 2021] Jiang, Y., Gong, X., Liu, D., Cheng, Y., Fang, C., Shen, X., Yang, J., Zhou, P., and Wang, Z. (2021). EnlightenGAN: Deep light enhancement without paired supervision. IEEE Transactions on Image Processing, 30:2340–2349.
- [Kim et al., 2016] Kim, J., Lee, J. K., and Lee, K. M. (2016). Accurate image super-resolution using very deep convolutional networks. In .
- [Kim et al., 2008] Kim, T., Kim, H., and Kim, H.-D. (2008). Image-based estimation and validation of niirs for high-resolution satellite images.
- [Kolouri et al., 2019] Kolouri, S., Nadjahi, K., Simsekli, U., Badeau, R., and Rohde, G. (2019). Generalized sliced wasserstein distances. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- [Kruse et al., 1993] Kruse, F., Lefkoff, A., Boardman, J., Heidebrecht, K., Shapiro, A., Barloon, P., and Goetz, A. (1993). The spectral image processing system (SIPS)—interactive visualization and analysis of imaging spectrometer data. Remote Sensing of Environment, 44(2-3):145–163.
- [Leachtenauer and Driggers, 2022] Leachtenauer, J. C. and Driggers, R. G. (2022). Surveillance and Reconnaissance Imaging Systems.
- [Leachtenauer et al., 1997] Leachtenauer, J. C., Malila, W., Irvine, J., Colburn, L., and Salvaggio, N. (1997). General image-quality equation: GIQE. Applied Optics, 36(32):8322.
- [Ledig et al., 2017] Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., and Shi, W. (2017). Photo-realistic single image super-resolution using a generative adversarial network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
- [Li et al., 2018] Li, J., Fang, F., Mei, K., and Zhang, G. (2018). Multi-scale residual network for image super-resolution. In Computer Vision – ECCV 2018, pages 527–542. Springer International Publishing.
- [Li et al., 2014] Li, L., Luo, H., and Zhu, H. (2014). Estimation of the image interpretability of ZY-3 sensor corrected panchromatic nadir data. Remote Sensing, 6(5):4409–4429.
- [Lim et al., 2018] Lim, P.-C., Kim, T., Na, S.-I., Lee, K.-D., Ahn, H.-Y., and Hong, J. (2018). ANALYSIS OF UAV IMAGE QUALITY USING EDGE ANALYSIS. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XLII-4:359–364.
- [Liu et al., 2021] Liu, C., Xu, J., and Wang, F. (2021). A review of keypoints’ detection and feature description in image registration. Scientific Programming, 2021:1–25.
- [Mittal et al., 2013] Mittal, A., Soundararajan, R., and Bovik, A. C. (2013). Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters, 20(3):209–212.
[Müller et al., 2020]
Müller, M. U., Ekhtiari, N., Almeida, R. M., and Rieke, C. (2020).
SUPER-RESOLUTION OF MULTISPECTRAL SATELLITE IMAGES USING CONVOLUTIONAL NEURAL NETWORKS.ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, V-1-2020:33–40.
- [Nafchi et al., 2016] Nafchi, H. Z., Shahkolaei, A., Hedjam, R., and Cheriet, M. (2016). Mean deviation similarity index: Efficient and reliable full-reference image quality evaluator. IEEE Access, 4:5579–5590.
- [Pradham et al., 2008] Pradham, P., Younan, N. H., and King, R. L. (2008). Concepts of image fusion in remote sensing applications. In Image Fusion, pages 393–428. Elsevier.
- [Reisenhofer et al., 2018] Reisenhofer, R., Bosse, S., Kutyniok, G., and Wiegand, T. (2018). A haar wavelet-based perceptual similarity index for image quality assessment. Signal Processing: Image Communication, 61:33–43.
- [Salimans et al., 2016] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X., and Chen, X. (2016). Improved techniques for training gans. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
- [Salvador et al., 2017] Salvador, A., Hynes, N., Aytar, Y., Marin, J., Ofli, F., Weber, I., and Torralba, A. (2017). Learning cross-modal embeddings for cooking recipes and food images. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
- [Sheikh and Bovik, 2006] Sheikh, H. and Bovik, A. (2006). Image information and visual quality. IEEE Transactions on Image Processing, 15(2):430–444.
- [Sheikh and Bovik, 2005] Sheikh, H. R. and Bovik, A. C. (2005). Information theoretic approaches to image quality assessment. In Handbook of Image and Video Processing, pages 975–989. Elsevier.
- [Shermeyer and Etten, 2019] Shermeyer, J. and Etten, A. V. (2019). The effects of super-resolution on object detection performance in satellite imagery. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE.
- [Sun and Chen, 2020] Sun, W. and Chen, Z. (2020). Learned image downscaling for upscaling using content adaptive resampler. IEEE Transactions on Image Processing, 29:4027–4040.
- [Thurman and Fienup, 2008] Thurman, S. T. and Fienup, J. R. (2008). Analysis of the general image quality equation. In ur Rahman, Z., Reichenbach, S. E., and Neifeld, M. A., editors, SPIE Proceedings. SPIE.
- [Venkatanath et al., 2015] Venkatanath, N., Praneeth, D., Bh, M. C., Channappayya, S. S., and Medasani, S. S. (2015). Blind image quality evaluation using perception based features. In 2015 Twenty First National Conference on Communications (NCC). IEEE.
- [Wang et al., 2019] Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Qiao, Y., and Loy, C. C. (2019). ESRGAN: Enhanced super-resolution generative adversarial networks. In Lecture Notes in Computer Science, pages 63–79. Springer International Publishing.
- [Wang and Bovik, 2002] Wang, Z. and Bovik, A. (2002). A universal image quality index. IEEE Signal Processing Letters, 9(3):81–84.
- [Wang et al., 2004] Wang, Z., Bovik, A., Sheikh, H., and Simoncelli, E. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612.
- [Xue et al., 2014] Xue, W., Zhang, L., Mou, X., and Bovik, A. C. (2014). Gradient magnitude similarity deviation: A highly efficient perceptual image quality index. IEEE Transactions on Image Processing, 23(2):684–695.
- [Yamanaka et al., 2017] Yamanaka, J., Kuwashima, S., and Kurita, T. (2017). Fast and accurate image super resolution by deep CNN with skip connection and network in network. In Neural Information Processing, pages 217–225. Springer International Publishing.
- [Zeyde et al., 2012] Zeyde, R., Elad, M., and Protter, M. (2012). On single image scale-up using sparse-representations. In Curves and Surfaces, pages 711–730. Springer Berlin Heidelberg.
- [Zhang et al., 2018] Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE.
- [Zhu et al., 2016] Zhu, H., Song, W., Tan, H., Wang, J., and Jia, D. (2016). SUPER RESOLUTION RECONSTRUCTION BASED ON ADAPTIVE DETAIL ENHANCEMENT FOR ZY-3 SATELLITE IMAGES. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences, III-7:213–217.