Satellite imagery has been widely used in various areas including traffic monitoring [eslami2010automatic], land use/land cover change analysis [viana2019long], precision agriculture [yang2012using], natural disaster warning and management [verstappen1995aerospace], etc. For all these applications, spatial resolution of the imagery is a key factor.
So far the lowest ground sample distance (GSD), which corresponds to the highest spatial resolution, of commercial satellite imagery products is 30cm. At the time of writing this paper, 30cm GSD is only available from satellite WorldView-3. Other sub-meter imagery products are typically of 50cm (e.g. WorldView-2, GeoEye-1, Pleiades) or 80cm (e.g. IKONOS, SkySat series) GSD. Although they are all considered very high resolution (VHR) satellites, their GSD is still not low enough for the applications mentioned above. For example, in traffic monitoring vehicles are represented by only a short number of pixels and hence the detection algorithm is very sensitive to the surrounding context [eslami2010automatic]. On the other hand, enhancing resolution via imaging hardware improvement is expensive and technically challenging [hajlaoui2010satellite], which makes software-based image super-resolution (SR) techniques attractive in practice.
In recent years, most successful single image SR algorithms are learning based [yang2019tmm]. The first convolutional neural network (CNN) based SR was developed by Dong et al. [dong2015image], which only contains 3 convolutional layers and outputs a high-resolution (HR) image from its low-resolution (LR) input directly. Kim et al. [kim2016accurate] use a deep CNN with 20 layers and it is applied to generate high-frequency components (residual image) of the HR output. Then, a generative adversarial network (GAN) is introduced into the training process to make the outputs photo-realistic [ledig2017photo]. Johnson et al. [Johnson2016eccv] proposed perceptual losses to get visually pleasing SR results. These works train models on synthetic data thus do not generalize well for real-world applications like zoom for mobile phone camera or complicated degradation. So the current trend of single image SR is to solve real-world problems [Chen_2019_CVPR] or consider more sophisticated degradations [Zhang_2018_CVPR].
There are also several SR approaches specifically designed for satellite imagery. Some of them are basically implementation of existing learning based approaches with some modifications, and use satellite images as training data [liebel2016single, pouliot2018landsat]. GANs are also used to improve low-resolution texture restoration [reshad2019deep, bosch2018super]. Jiang et al. [jiang2019edge] developed a method that combines residual image enhancement and GAN together. However, none of them paid much attention to the actual image degradation model of satellite images. For learning based SR, the degradation model is embedded in training data, especially in the LR image simulation. In [liebel2016single, reshad2019deep] the LR training data is created via simple down-scaling of HR images, which basically assumes the following degradation model:
where is the observed LR image, and denotes the latent HR image. represents a fixed blur kernel (e.g. bicubic kernel), represents the down-sampling operation, and is a convolution operator. This model has been widely used in many natural image SR approaches, which treat SR as a non-blind deconvolution problem.
Satellite image noise analysis. (a) Noise sample extracted from a Pleiades image. (b) FFT of (a). (c) Blur kernel estimated from (b). (d) Noise simulated by convolving WGN with (c).
Unfortunately, the model in (1) is not realistic for commercial satellite images. First of all it lacks noise, and most satellite images are highly noisy. Secondly, the point spread function (PSF) in the satellite imaging system needs to be considered. Bicubic kernel is not a good approximation of the real PSFs. In fact, PSFs of most time delay and integration (TDI) sensors on satellites are spatially variant due to the imaging hardware limitation [hajlaoui2010satellite]. There also exists motion blur caused by satellite movement or sensor scanning. In other words, satellite image super-resolution is more like a blind deconvolution problem. Thirdly, it forgets that most commercial satellite image products are processed after imaging by their providers, and the process usually includes resampling, which further blurs images, and changes the distribution of noise so that it can no longer be treated as white Gaussian noise (WGN). All these factors need to be considered in training data generation.
In this paper, we propose a realistic training data generation model with spatially variant PSFs based on our analysis of commercial satellite images. We also proposed a CNN-based super-resolution model which is able to handle variant PSFs and different degrees of aliasing. We use a residual CNN architecture similar to [ledig2017photo] with some modifications to make the model more efficient. In the experiments section we will show its performance on real satellite imagery products.
2.1 Degradation Model
We use the following model to describe the degradation process of commercial satellite images.
is the PSF of the imaging system, and it is assumed to be variant spatially and over images, though it should vary slowly and can be viewed invariant in small local regions of [hajlaoui2010satellite]. denotes additive noise from the imaging system, which can be approximated as WGN. represents a resampling kernel introduced in the post-process on the ground, and denotes a resampling operation. Resampling is needed in the post-process in order to align pixels from multiple channels to a target coordinate grid, which is associated with the image’s camera model. Locally can be approximated as spatial shifting.
By merging the post-process and imaging model together we can rewrite (2) as
where denotes the HR image on the up-sampled target coordinate grid, and represents a kernel mixing the effect of and together. is spatially varying within and across images from a same satellite sensor, but in a local image area it can be treated as invariant. denotes the final noise effect, and can be treated as WGN convolved by .
(a) shows a noise sample extracted from a flat area of a 16-bit Pleiades image. Its 2-D Fourier transformation in (b) indicates that it is colored. Our analysis further shows that such noise can be simulated by convolving WGN with an estimated kernel (See Fig.1 (c) and (d)), which matches the proposed model in (2).
PSFs are also analyzed. We estimated PSFs from local image regions via a shock filter based method similar to [money2008total]. Three PSFs estimated from GeoEye-1 images are shown in Fig. 2, and it turns out they are variant not only in their spread but also in their shape. Mild motion blur is observed (see the 3rd PSF in Fig. 2), and since it is along column direction it could be introduced by sensor scanning.
2.2 Training Data Generation
Instead of using satellite images with relatively low SNR and potential motion blur to generate training data as most existing methods did, we choose Google owned aerial images as the source to create synthetic LR images. LR images are generated via (3). Noise is simulated by convolving WGN with an estimated kernel (such as Fig. 1(c)). Each satellite product type has a corresponding noise kernel.
is simulated via a 2-D elliptical Gaussian mixture model:
where each controls the shape of the -th Gaussian component, and denotes its contribution. is the overall normalization factor. Several PSFs with various shapes are estimated from real images, and for each PSF a set of is derived via least square fitting. When generating a LR image, a set of is randomly selected. We then add a little noise to the selected to further varies the shape of the synthetic PSF. We also varies the down-scaling factor within a small range to simulate the blur and aliasing variation we observed in real satellite images.
2.3 Network Architecture
Our CNN architecture is shown in Fig. 3. This CNN is similar to the model used by Ledig et al. [ledig2017photo]. All intermediate layers have convolutional filters of size
, followed by Leaky ReLU activations with slope offor negative values. We use identical residual blocks, that account for a large filter footprint of size . This allows for a more effective model, specially when dealing with large PSFs from satellite images.
Input and output of our network are 3-channel pan-sharpened RGB images in linear space. We also tried YCbCr color space, where only Y channel was fed to the network. It turns out that using RGB images in our training leads to better noise suppression.
Note that input image needs to be upscaled to the target HR image size with bicubic interpolation before being fed to the CNN model. This enables the network to handle flexible upscaling factors, and hence to accurately adjust the GSD of its output images.
2.4 Training Loss
Similar to the framework of [Talebi2018noref, talebi2018learned], our training loss has two terms: a fidelity loss, and a perceptual loss:
The fidelity loss enforces closeness of the output image to the ground truth high-resolution image z. denotes our CNN model with trainable weights W. is the bicubic upscaled input image. The fidelity loss is a pseudo-Huber function [pseudoHuber]
that is a smooth approximation of the Huber loss function which combines thesquared-loss for small differences and the absolute-loss for large ones, while being strongly convex.
The perceptual loss is a neural network trained for no-reference image quality assessment [Talebi2018noref]. This network is differentiable, and can be plugged into our training framework. Function is inversely related to predicted quality score as , where is the predicted positive score for image x with as its maximum possible value. We observed that setting an appropriate weight can improve upon the fidelity loss by adding more fine-grained details into the output image. We fix as during training.
3 Experimental Results
Training data is generated via the process described in Section 2.1. Aerial images are in linear color space, and they are divided into patches of size . Each satellite sensor has a corresponding training dataset with the estimated noise kernel and PSF parameters, and one set contains 20,000 patches.
We first show the results with synthetic images generated through the same process as the training data. Three examples with different blur and aliasing conditions are given in Fig. 4. All the restored images look close to their HR version. This illustrates our neural network’s ability to blindly remove blur (including motion blur) and aliasing artifacts.
The trained neural network is then applied to real satellite images. Results from several 50cm GeoEye-1 images are shown in Fig. 2. The target GSD is 25cm, which is lower than any existing commercial satellite. A lot of useful high-frequency image components including vehicle details, pedestrian crossing lines, solar panel cells, and pipe infrastructure on building roofs have been successfully restored. Results from the same network but trained with the bicubic down-scaling model (1) are also given as comparison. Though their image sharpness also get improved compared with bicubic interpolation, the improvement is much limited.
To quantitatively evaluate the proposed training data generation model’s performance, we randomly sampled 364 GeoEye-1 images (of size ), which are then up-scaled () using bicubic interpolation, SR with generation model (1), and SR with the proposed model (3) respectively. We sent the images to ordinary viewers for visual quality evaluation, and the mean opinion score (MOS) of each image was then derived from 30 responses. The MOS histograms of three methods are shown in Fig. 6, where the proposed model significantly outperforms the other two.
We proposed a realistic SR training data generation model for commercial satellite images. The model includes not only the imaging process on satellites but also the post-process on the ground. A SR neural network is also developed to apply this model. Experiments show that our method is able to recover fine details from real satellite images.
So far the parameters of the training data generation model need to be manually estimated and tuned from satellite image samples. In the future, we will explore to use GAN to automatically generate the parameters given HR source images and the target LR image samples.