Super-resolution reconstruction (SRR) is aimed at generating a high-resolution (HR) image from a low-resolution (LR) observation (a single image or multiple images) . SRR is a deeply explored research topic of considerable practical potential, as developing effective SRR techniques may allow for overcoming the spatial resolution limitations of the imaging sensors, which is a common problem in remote sensing.
1.1 Related work
Existing single-image SRR methods can be categorized into: (i) frequency-domain techniques , (ii) reconstruction-based methods which exploit prior knowledge on the object appearance , and (iii) algorithms that learn the mapping between LR and HR . Recently, we have witnessed a breakthrough in the learning-based single-image SRR, attributed to the use of deep convolutional neural networks (CNNs). Deep learning SRR originates from sparse coding , aimed at creating a dictionary of LR patches, associated with their HR counterparts. The reconstruction consists in exploiting that dictionary for converting each LR patch from the input image into HR.
Super-resolution CNN (SRCNN) , followed by its faster version (FSRCNN) , was proposed for learning the LR-to-HR mapping from a number of LR–HR image pairs. Despite relatively simple architecture, SRCNN outperforms the state-of-the-art example-based methods. In , SRCNN was trained with Sentinel-2 images, which according to the authors improved its capacities of enhancing satellite data. Certain limitations of SRCNN were addressed with a very deep super-resolution network , trained relying on fast residual learning. The domain expertise was exploited in a sparse coding network , achieving high training speed and model compactness. Recently, generative adversarial networks are being actively explored for SRR . They are composed of a generator (ResNet in ), trained to perform SRR, and a discriminator which tries to distinguish the ResNet reconstruction outcomes from real HR images.
Deep CNNs for SRR are trained from a dataset of corresponding LR–HR patches. As deep networks commonly require huge amounts of training data, LR images are obtained by subjecting the original HR images to a degradation procedure based on an assumed imaging model. In most works [3, 7, 8], bicubic downsampling is applied to transform HR into LR, and in some cases , the training set () is additionally augmented with translation, rotation, and scaling. However, it has not been analyzed whether and how (including the degradation procedure) influences the reconstruction accuracy.
In this paper, our contribution consists in addressing the aforementioned research gap. We investigate the influence of , used for training a CNN, on the reconstruction performance. We trained two different CNNs (Section 2) with natural images from the DIV2K single-image SRR benchmark, and with Sentinel-2 images. The trained CNNs are tested in two settings: for reconstructing artificially-degraded satellite images (original images are treated as reference HR data), as well as in a real-world scenario—for original Sentinel-2 images, matched with SPOT and Digital Globe WorldView-4 images of the same region. The results of our extensive experiments (reported in Section 3) indicate that the degradation procedure used for creating plays a pivotal role here. Not only does it have a larger impact on the SRR performance than the domain of images exploited for training (natural vs. satellite), but it is also more important than the choice of the CNN architecture.
2 Deep learning for super-resolution
In this work, we exploit two CNNs of different complexity, namely: FSRCNN , which is a relatively shallow CNN, and a much deeper residual network (SRResNet ), to investigate their behavior in different training scenarios.
Figure 1 shows the architecture of FSRCNN —the network is composed of five major parts aimed at: (i) feature extraction, realized by the first convolutional layer (denoted as Conv) with kernels of size , (ii) shrinking, performed using kernels () to reduce the number of features (from to ), (iii) non-linear mapping using multiple () convolutional layers with kernels (), (iv) expansion
which inverses the shrinking and increases the dimensionality of the feature vectors fromback to , and (v) deconvolution which produces the reconstructed HR image. FSRCNN can be trained faster than SRCNN and it offers real-time performance after training [3, 4].
) benefits from the residual connections between the layers. The residual blocks (RBs) are the groups of layers stacked together with the input of the block added to the output of the final layer contained in this block. In SRResNet, each block encompasses two convolutional layers, each followed by a batch normalization (BN) layer that neutralizes the internal co-variate shift. The upsampling blocks (UBs) allow for image enlargement by pixel shuffling (PS) layers that increase the resolution of the features. The number of both RBs and UBs is variable—by increasing the number of RBs, the network may model a better mapping, whereas by changing the number of UBs, we may tune its scaling factor. However, by adding more blocks, the architecture of the network becomes increasingly complex, which makes it harder to train.
3 Experimental study
We trained FSRCNN and SRResNet using natural images from the DIV2K dataset111Available at https://data.vision.ee.ethz.ch/cvl/DIV2K, and Sentinel-2 images. From these images, the patches were extracted randomly to create and validation set (), as specified in Table 1. LR images were obtained from HR ones using different downsampling techniques: nearest neighbor (NN), bilinear, bicubic, and Lanczos. We also created a mixed set—the downsampling technique was randomly selected for each image. For Lanczos, we additionally applied Gaussian blur with (Lanczos-B), Gaussian noise with (Lanczos-N), and both blur and noise with (Lanczos-BN). Examples of patches in are shown in Fig. 3
. We used Python with Keras to implement the CNNs. The experiments were run on an Intel i9 4 GHz computer with 64 GB RAM, and two RTX 2080 8 GB GPUs. We used ADAM optimizer with learning rate of
. The optimization stops, if after 50 epochs the accuracy overdoes not increase.
|Dataset||No. of patches in||No. of patches in||LR patch size||HR patch size|
|Artificially-degraded (AD) satellite images||Real satellite (RS) images|
|SRR method||FSRCNN ||SRResNet ||FSRCNN ||SRResNet |
After training, the nets were tested using two kinds of test sets (): (i) artificially-degraded (AD) images—10 HR images of size pixels, bicubically downsampled to pixels, and (ii) real satellite (RS) images acquired at different resolution—we used three Sentinel-2 scenes as LR, two of which are matched with SPOT images and one is matched with Digital Globe WorldView-4 image. We evaluate the reconstruction accuracy relying on peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), visual information fidelity (VIF), universal image quality index (UIQI), and keypoint features similarity (KFS) . For all these metrics, higher values indicate higher similarity between the reconstruction outcome and the reference image.
In Table 2, we show the reconstruction accuracy obtained with FSRCNN and SRResNet trained using different ’s. We highlight PSNR and SSIM scores (in gray) for AD after training with bicubically downsampled , as this is the scenario most often reported in the literature. From these scores, SRResNet is slightly better than FSRCNN, and using satellite data for training appears to be beneficial. However, from all the scores, it is clear that the degradation procedure is more significant than both the type of images in and the network architecture. Actually, the nets trained with ’s based on NN perform in the best way, which can also be assessed qualitatively from Fig. 4. If is blurred (Lanczos-B), then the image sharpening is too strong, resulting in many high-frequency artifacts. A surprising outcome can be observed for SRResNet (Bicubic, Sentinel)—the details in the sea area are lost after reconstruction, but the land area is reliably restored.
For RS images, it is not clear from the reported metrics (Table 2), which is the best. The values are much lower than for AD, as the HR images used for reference are acquired using a different sensor, so even if an image is well reconstructed, it is substantially different from HR. From Fig. 5, it can be seen that NN downsampling (best for AD) results in a blurry outcome. Interestingly, Lanczos-B (very poor for AD), delivers better results in this case (and it is consistently picked by the KFS metric—the similarity to HR in the domain of the detected keypoints is the highest here). Similarly to AD, severe artifacts in the sea area can be observed for SRResNet trained with some ’s (Mixed and NN, for Sentinel). In our opinion, visually most plausible results are obtained using bilinear downsampling (for both Sentinel and DIV2K), which is also reflected with the highest UIQI scores in Table 2.
In this paper, we reported our experimental study on preparing the data to train deep CNNs for satellite image SRR. The results indicate that the degradation procedure used to generate the training data has a tremendous impact on the SRR performance, which is usually neglected in the literature. Furthermore, it is worth noting that much deeper architecture of SRResNet does not seem to outperform a relatively simple FSRCNN, when appropriate is used.
Currently, we are exploring how to combine different degradation procedures, including data augmentation techniques, to create training sets which better reflect the actual imaging conditions. We expect that this will allow deep CNNs to increase their performance for real satellite images.
-  P. Benecki, M. Kawulok, D. Kostrzewa, and L. Skonieczny, “Evaluating super-resolution reconstruction of satellite images,” Acta Astronautica, vol. 153, pp. 15–25, 2018.
-  H. Demirel and G. Anbarjafari, “Discrete wavelet transform-based satellite image resolution enhancement,” IEEE TGRS, vol. 49, no. 6, pp. 1997–2004, 2011.
-  C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE TPAMI, vol. 38, no. 2, pp. 295–307, 2016.
-  C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network,” in Proc. ECCV. Springer, 2016, pp. 391–407.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE CVPR, 2016, pp. 770–778.
-  J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution using very deep convolutional networks,” in Proc. IEEE CVPR, 2016, pp. 1646–1654.
-  C. Ledig, L. Theis, F. Huszár et al., “Photo-realistic single image super-resolution using a generative adversarial network.” in Proc. CVPR, vol. 2, no. 3, 2017, p. 4.
-  L. Liebel and M. Körner, “Single-image super resolution for multispectral remote sensing data using CNNs,” in Proc. ISPRSC, 2016, pp. 883–890.
-  D. Liu, Z. Wang, B. Wen et al., “Robust single image super-resolution via deep networks with sparse prior,” IEEE TIP, vol. 25, no. 7, pp. 3194–3207, 2016.
-  K. Nasrollahi and T. B. Moeslund, “Super-resolution: a comprehensive survey,” Machine vision and applications, vol. 25, no. 6, pp. 1423–1468, 2014.
-  J. Sun, Z. Xu, and H.-Y. Shum, “Image super-resolution using gradient profile prior,” in IEEE CVPR. IEEE, 2008, pp. 1–8.
-  J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE TIP, vol. 19, no. 11, pp. 2861–2873, 2010.