The Effects of Super-Resolution on Object Detection Performance in Satellite Imagery

12/10/2018 ∙ by Jacob Shermeyer, et al. ∙ In-Q-Tel, Inc. 16

We explore the application of super-resolution techniques to satellite imagery, and the effects of these techniques on object detection algorithm performance. Specifically, we enhance satellite imagery beyond its native resolution, and test if we can identify various types of vehicles, planes, and boats with greater accuracy than native resolution. Using the Very Deep Super-Resolution (VDSR) framework and a custom Random Forest Super-Resolution (RFSR) framework we generate enhancement levels of 2x, 4x, and 8x over five distinct resolutions ranging from 30 cm to 4.8 meters. Using both native and super-resolved data, we then train several custom detection models using the SIMRDWN object detection framework. SIMRDWN combines a number of popular object detection algorithms (e.g. SSD, YOLO) into a unified framework that is designed to rapidly detect objects in large satellite images. This approach allows us to quantify the effects of super-resolution techniques on object detection performance across multiple classes and resolutions. We also quantify the performance of object detection as a function of native resolution and object pixel size. For our test set we note that performance degrades from mAP = 0.5 at 30 cm resolution, down to mAP = 0.12 at 4.8 m resolution. Super-resolving native 30 cm imagery to 15 cm yields the greatest benefit; a 16-20 in mAP. Super-resolution is less beneficial at coarser resolutions, though still provides a 3-10

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 6

page 7

Code Repositories

RFSR

Random Forest Super-Resolution (RFSR)


view repo

VDSR4Geo

TensorFlow implementation of "Accurate Image Super-Resolution Using Very Deep Convolutional Networks" adapted for working with geospatial data


view repo

VDSR4Geo

TensorFlow implementation of "Accurate Image Super-Resolution Using Very Deep Convolutional Networks" adapted for working with geospatial data


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The interplay between super-resolution techniques and object detection frameworks remains largely unexplored, particularly in the context of satellite or overhead imagery. Intuitively, super-resolution methods should increase object detection performance, as an increase in resolution should add more distinguishable features that an object detection algorithm can use for discrimination. Detecting small objects such as vehicles in satellite imagery remains an exceedingly difficult task for multiple reasons [37]and an artificial increase in resolution may help to alleviate some of these issues. Some of the issues present include:

  1. Objects such as cars in satellite imagery have a small spatial extent (as low as 10 pixels) and are often densely clustered.

  2. All objects exhibit complete rotation invariance and can have any orientation.

  3. Training example frequency is low versus other disciplines. Few datasets exist that have appropriate labels for objects within satellite imagery. The most notable are: SpaceNet [38], A Large-scale Dataset for Object DeTection in Aerial Images (DOTA) [41], Cars Overhead With Context (COWC) [27], and xView [18].

  4. Most satellite imagine sensors cover a broad area and contain hundreds of megapixels, thereby producing the equivalent of an ultra-high resolution image. For example, the native imagery used in this study was on average

    times larger than benchmark super-resolution datasets Set5, Set14, BSD100, and Urban100. When working with modern neural network architectures these images must be tiled into smaller chunks for both training and inference.

Although several studies have been conducted using SR as a pre-processing step [1, 11, 12, 33, 43, 3, 10, 5], none have quantified its affect on object detection performance in satellite imagery across multiple resolutions. This study aims to accomplish that task by training multiple custom object detection models to identify vehicles, boats, and planes in both native and super-resolved data. We then test the models performance on the native (ground-truth) imagery and super-resolved imagery of the same Ground Sample Distance (GSD: the distance between pixels measured on the ground). Additionally, this is the first study to demonstrate the output of super-resolved 15 cm GSD satellite imagery. Although no native 15 cm satellite imagery exists for comparison, this data can be compared against coarser resolutions to test the benefits provided by super-resolution.

Figure 1: The effects of super-resolution on a plane and neighboring objects. As resolution degrades super-resolution becomes a less tractable solution.

The cost-benefit analysis of such a study is enormous. Satellite manufacturers spend the majority of their budget on the design and launch of satellites. For example, the DigitalGlobe WorldView-4 satellite cost an estimated

million dollars when one includes spacecraft, insurance, and launch [8]. Ideally, one could couple an effective SR enhancement algorithm with a smaller, cheaper satellite that captures images in coarser resolution. The process of capturing and subsequently enhancing coarser data could drastically reduce launch cost, expand satellite field of view, reduce the number of satellites in orbit, and improve downlink speeds between satellites and ground control stations.

2 Related Work

2.1 Super-Resolution Techniques and Application to Overhead Imagery

Single-image Super-Resolution (SR) is the process for deriving high-resolution (HR) images from a single low-resolution (LR) image. Although super-resolution remains an ill-posed and difficult problem, recent advances in neural networks and machine learning have enabled more robust SR algorithms that exhibit effective performance. These techniques use high-resolution image pairs to learn the most likely HR features to map to the LR image features and create an output SR product.

Over the past five years, convolutional neural network approaches have been used to produce state of the art super-resolution results. Dong et al.

[7]

was the first to establish a deep learning approach with SRCNN. This has been followed up by several successive approaches, major alterations, and improvements. Very Deep Super Resolution (VDSR)

[15] exhibited state of the art performance and was one of the first to modify the SRCNN approach by creating a deeper network with 20 layers to learn a residual image and transform LR images into HR images. Developed concurrently, the Deeply-Recursive CNN (DRCN) [16] introduced a recursive neural network approach to super-resolve imagery. The Deeply Recursive Residual Network (DRRN) [34] builds upon the VDSR and DRCN advancements using a combination of the residual layers approach and recursive learning in a compact network.

More complex methods followed, such as the Laplacian Pyramid Super-Resolution Network (LapSRN) [17]. Adversarial training has also been employed and the SR Generative Adversarial Network (SRGAN) [19] produces photo-realistic enhanced images. The use of wider and deeper networks has also been proposed. The most notable being Lim et al. [21], which proposed Enhanced Deep Residual Networks (EDSR). Most recently, the Deep Back Projection Network (DBPN) [9] showed state of the art performance for an enhancement by connecting a series of iterative up- and down-sampling stages. Newer block based methods such as the Information Distillation Network (IDN) [14] was developed as a compact network that could gradually extract common features for fast the reconstruction of HR images. In another example, the Residual Dense Network (RDN) [44] uses residual dense blocks to produce strong performance.

Although new and powerful single image SR techniques continue to be developed, these techniques have been infrequently applied to overhead imagery. One of the most notable applications of super resolution to satellite and overhead imagery remains the recent paper by Bosch et al. [2]. The authors analyze several sources of satellite imagery for this research and quantify their success in terms of PSNR for an enhancement using a GAN. In another example, [22] use deep neural networks for simultaneous

super-resolution and colorization of satellite imagery. Several papers

[42, 25, 35, 20, 28] modify or leverage SRCNN [7] and/or VDSR [15] to successfully super-resolve Jilin-1, SPOT, Pleiades, Sentinel-2, and Landsat imagery.

Ultimately, a few specific papers are direct precursors for this work: In the first, [3]

use fine resolution aerial imagery and coarser satellite imagery with a coupled dictionary learning approach to super enhance vehicles and detect them with a simple linear Support Vector Machine model. Their results showed that object detection performance improves when using SR as a pre-processing step versus the native coarser imagery. Xu et al.

[43] use sparse dictionary learning to generate synthetic and super-resolved imagery from Landsat and MODIS image pairs. Their results show an increase performance for land-cover change mapping when using the super-resolved imagery. Although these approaches are similar to ours, they fail to use newer neural network based approaches, and are narrower in scope. Finally, [10] super-resolve imagery using DBPN [9] and detect various objects in traditional photography using SSD [23]

. They quantify their success in terms of mAP and also add a novel element to this work of designing a loss function to optimize SR for object detection performance. Their results show that end-to end training of these algorithms gave a performance boost for object detection tasks, and is a promising avenue to explore for future research.

Overall, one would assume that SR techniques would naturally improve object detection performance, particularly when using satellite imagery, however no such study has been conducted. To address this question, our study investigates the relationship between object detection performance and resolution, spanning five unique GSD resolutions, with six SR outputs per resolution. Ultimately, we investigate 35 separate resolution profiles for object detection performance.

2.2 Object Detection Techniques

A number of recent papers have applied advanced machine learning techniques to aerial or satellite imagery, yet have focused on a slightly different problem than the one we attempt to address. For example, [24] demonstrated the ability to localize objects in overhead imagery; yet application to larger areas would be problematic, with an inference speed of 10 - 40 seconds per pixel image chip. Efforts to localize surface to-air-missile sites [26]

with satellite imagery and sliding window classifiers work if one only is interested in a single object size of hundreds of meters. Running a sliding window classifier across a large satellite image to search for small objects of interest quickly becomes computationally intractable, however, since multiple window sizes will be required for each object size. For perspective, one must evaluate over one million sliding window cutouts if the target is a 10 meter boat in a DigitalGlobe image. Application of rapid object detection algorithms to the remote sensing sphere is still relatively nascent, as evidenced by the lack of reference to SSD

[23], Faster-RCNN [30], R-FCN [6], or YOLO [29] in a recent survey of object detection in remote sensing [4]. While tiling a large image is still necessary, the larger field of view of these frameworks (a few hundred pixels) compared to simple classifiers (as low as 10 pixels) results in a reduction in the number of tiles required by a factor of over 1000. This reduced number of tiles yields a corresponding marked increase in inference speed. In addition, object detection frameworks often have much improved background differentiation (compared to sliding window classifiers) since the network encodes contextual information for each object.

As we seek to study the effect of super-resolution on object detection performance in real-world satellite imagery, and for all of the reasons listed above - rapid object detection frameworks are the logical choice for this study. The premier rapid object detection algorithms (SSD, Faster-RCNN, R-FCN, and a modified version of YOLO called YOLT [36]) were recently incorporated into the unified framework of SIMRDWN [37] that is optimized for ingesting satellite imagery, typically several hundred megapixels in size. The SIMRDWN paper reported the highest performance stemmed from the YOLT algorithm, followed by SSD, with Faster R-CNN and RFCN significantly behind.

3 Dataset

(a)
(b)
(c)
Figure 2: Issues with xView ground truth labels. Red = car, green=truck, orange=bus, yellow=airplane, purple=boat. Note, the incorrectly sized cars in (a), the erroneous “boat” ground truth labels in (b), and the missing cars in (c).

The xView Dataset [18] was chosen for the application of super-resolution techniques and the quantification of object detection performance. Imagery consists of 1,415 km of DigitalGlobe WorldView-3 pan-sharpened RGB imagery at 30 cm native GSD resolution spread across 56 distinct global locations. The labeled dataset for object detection contains 1 million object instances across 60 classes annotated with bounding boxes, including various types of buildings, vehicles, planes, trains, and boats. For our purposes, we ultimately discarded classes such as “Building,” “Hangar,” and “Vehicle Lot” because we found that such objects are better represented by polygonal labels rather than bounding boxes for foundational mapping [38] purposes.

Figure 3: Object size histograms (in pixels), recall that each pixel is 30 cm in extent.

We chose an aggregation schema due to inconsistent labeling within the dataset. Unfortunately, many objects are mislabeled or simply missed by labelers (see Figure 2). This leads to an increase in false positive detection rates and objects being inaccurately tagged as mis-classifications after inference. In addition, many xView classes have a very low number of training examples (e.g. Truck w/Liquid has only 149 examples) that are poorly differentiated from similar classes (e.g. Truck w/Box has 3653 examples and looks very similar to Truck w/Liquid). The question of how many training examples are necessary to disentangle similar classes is beyond the scope of this paper.

Category Mean Size Counts
(meters) Train Test Total
Boat 16.5 2379 2347 4726
Large Aircraft 36.9 424 294 718
Small Aircraft 13.2 264 178 442
Bus/Truck 8.8 19337 13269 32606
Small Vehicle 4.7 129438 89923 219361
Table 1: Object Counts

Our classes ultimately consist of the following (original xView classes listed in parentheses): Small Aircraft (Fixed-wing Aircraft, Small Aircraft), Large Aircraft (Cargo Plane), Small Vehicle (Passenger Vehicle, Small Car, Pickup Truck, Utility Truck), Bus/Truck (Bus, Truck, Cargo Truck, Truck w/Box, Truck w/Flatbed, Truck w/Liquid, Dump Truck, Haul Truck, Cement Mixer, Truck Tractor), and Boat (Motorboat, Sailboat, Yacht, Maritime Vessel, Tugboat, Barge, Fishing Vessel, Ferry). See Table 1 for dataset details and Figure 3 for object size histograms.

3.1 Simulation of Optics and Sensors

All data were preprocessed consistently to simulate coarser resolution imagery and test the affects of our SR techniques on a range of resolutions. We intend our results to showcase what can be reasonably accomplished given coarser satellite imagery, rather than simply what is possible given the ideal settings (no blurring, bicubic decimation) under which most SR algorithms are introduced. We attempt to simulate coarser resolution satellite imagery as accurately as possible by simulating the optical point-spread function (PSF) and using a more robust decimation algorithm. This is important because the optics of the telescope greatly impact the appearance of very small objects. The common practice of simply resizing an image by reducing its dimensions by a factor of two will simulate a different sensor containing the number of pixels; yet this approach ignores the different optics present in a properly designed telescope that would be coupled to such a sensor. A properly designed sensor should have pixel size determined by the Nyquist sampling rate: half the size of the mirror resolution determined by the diffraction limit. Given the cost and complexity of launching satellite imaging constellations to orbit, we assume that all imaging satellites will have properly designed sensors. We can use the assumption of Nyquist sampling to determine the PSF of the telescope optics, which can be approximated by a Gaussian of appropriate width.

(1)

For our study, data were degraded from the native 30 cm GSD using a variable Gaussian blur kernel to simulate the point-spread function of the satellite depending upon our desired output resolution (Equation 1. We then used inter-area decimation to reduce the dimensions of the blurred imagery to the appropriate output size (e.g. 60 cm imagery will have the number of pixels as 30 cm imagery over the same field of view). We repeat the above procedure to simulate resolutions of 60, 120, 240, and 480 cm. The ground truth data and the outputs from the super-resolution algorithms were randomly split into training (60) and validation (40) categories for object detection. The same images are contained in both the training and test sets regardless of resolution to maintain consistency when comparing validation scores.

4 Super-Resolution Techniques

For this study, super-resolution is conducted with two techniques for enhancement levels of , , over five distinct resolutions ranging from 30 cm to 4.8 meters. We also create 15cm GSD output imagery using the models trained to super-resolve imagery from 60 cm to 30 cm and 120 cm to 30 cm.

Our first method is a convolutional neural network derived technique called Very Deep Super-Resolution (VDSR) [15]. VDSR has been featured as a baseline for the majority of recent super-resolution research and was one of the first to modify the initially proposed convolutional neural network method SRCNN [7]. This architecture was chosen due its ease of implementation, ability to train for multiple levels of enhancement, use as a standard baseline when introducing new techniques, and favorable performance in the past. We use the standard network parameters as set in the original paper [15]

and train for 60 epochs. We chose a patch size of 41

41 pixels and augment by rotations (4) and flipping (2) for eight unique combinations per patch. This process is repeated for each enhancement level (2, 4, and 8), and each is fed into the same network for concurrent training. Average training time for a 2, 4, and 8 enhancement on  200 million pixel examples is 55.9 hours. Inference speed on a 544 544 pixel image is seconds.

The second method is an approach that we have called Random-Forest Super-Resolution (RFSR) and was designed for this work; it requires minimal training time and exhibits high inference speeds. RFSR is an adaptation of other random forest super-resolution techniques such as SRF [32] or SRRF [13] and can process both georeferenced satellite imagery or traditional photography. We chose to include this simpler, less computationally intensive algorithm that does not require GPUs to test its effectiveness against a near state of the art SR solution. The hypothesis is that even a simple technique may improve object detection performance.

Our method uses a random forest regressor with a few standard parameters. The number of estimators is set to 100, the maximum depth to 12, and the minimum samples to split an internal node equal to 200. Finally, we use bootstrapping and out-of-bag samples to estimate the error and R scores on randomly selected unseen data during training. These parameters were finely tuned using empirical testing to maximize PSNR scores (see Section 6 for details on metrics) while maintaining minimal training time (4 hours or less per level of enhancement on a 64GB RAM CPU). It should be noted that PSNR scores could be mildly improved using a deeper tree with more estimators, at the cost of training time.

Like several other SR techniques, RFSR is trained only using the luminance component from a YCbCr converted image. HR images are degraded to create LR and HR image pairs. The degraded LR image is then shifted by one and then two pixels in each direction versus the HR image and then compressed into a 3-dimensional array. The original up-sampled LR image is then subtracted from the 3-D LR array, and from the HR image for a residual training schema. This normalizes the LR stack and HR image pair and also removes homogeneous areas, emphasizing important edge effects. After training and inference the interpolated LR image is then added back to the models’ output image to create the super-resolved output. RFSR can only produce one level of enhancement (2, 4, or

) at a time. Average training time for all three enhancements on million pixel examples is 10.8 hours. Average inference speed on a 544 pixel image is 0.7 seconds (Table 2).

VDSR RFSR
Inference Time
  (per image) 0.16 seconds 0.7 seconds
Training Time
  (for 2, 4, ) 55.9 hours 10.8 hours
Table 2: Average inference time per 544 pixel image and training time for a set of 1,500 images at native 30 cm GSD resolution. RFSR used a 64GB RAM CPU and VDSR used a NVIDIA Titan Xp GPU for inference and training.

5 Object Detection Techniques

As discussed in Section 1, advanced object detection frameworks have only recently been applied to large satellite imagery via the SIMRWN framework. In the SIMDRDN paper, the authors reported the highest performance stemmed from the YOLT algorithm, followed by SSD, with Faster R-CNN and RFCN significantly behind. Therefore, we opt to utilize the YOLT and SSD models within SIMRDWN for this study. For the YOLT model we adopt the dense 22-layer network of [36] with a momentum of 0.9, and a decay rate of 0.0005. We use a pixel training input size (corresponding to

meters). Training occurs for 150 epochs. For the SSD model we follow the TensorFlow Oject Detection API implementation with the Inception V2 architecture. We adopt a base learning rate of 0.004 and a decay rate of 0.95. We train for 30,000 iterations with a batch size of 16, and use the same

pixel input size as YOLT. For both YOLT and SSD we train models on the “native” imagery (original 30 cm data, the convolved and resized imagery described in Section 3.1), as well as on the outputs of RFSR and VDSR applied to the object detection training set. This approach yields a multitude of models across the myriad architectures, super-resolution techniques, and resolutions, thus enabling a detailed study of performance.

6 Metrics

Overall, super-resolution remains an active field of research with rather limited direct focus on end application. Typical performance metrics include Peak Signal-to-Noise Ratio (PSNR) or the Structural SIMilarity (SSIM) Index (which we report in Section 7.1), however these measures do not quantify the enhancement to object detection performance [40]. The perceptions of humans and machines learning techniques can vary widely, and although these images may be more visually appealing as a result of super-resolution, such techniques may have little impact on object detection performance.

For object detection metrics, we compare the ground truth bounding boxes to the predicted bounding boxes for each test image. For comparison of predictions to ground truth we define a true positive as having an intersection over union (IOU) of greater than a given threshold. An IoU of 0.5 is often used as the threshold for a correct detection, though we adopt a lower threshold of 0.25 since most of our objects are very small (e.g.: cars are only  10 pixels in extent). This mimics Equation 5 of ImageNet

[31]

, which sets an IoU threshold of 0.25 for objects 10 pixels in extent. Precision-recall curves are computed by evaluating test images over a range of probability thresholds. At each of 30 evenly spaced thresholds between 0.05 and 0.95, we discard all detections below the given threshold. Non-max suppression for each object class is subsequently applied to the remaining bounding boxes; the precision and recall at that threshold is tabulated from the summed true positive, false positive, and false negatives of all test images. Finally, we compute the average precision (AP) for each object class and each model, along with the mean average precision (mAP) for each model. One-sigma error bars are computed via bootstrap resampling, using 500 samples for each scenario.

7 Experimental Results

7.1 Super-Resolution Performance

As expected, super-resolution performance was strongest for the VDSR method, although RFSR produces comparable results in some circumstances (Table 1). As in other studies, the metrics degrade as the amount of enhancement increases. Both techniques performed the strongest on the 60 cm imagery, likely because initial bicubic interpolation scores are high and the fact that the image resolution is situated between a coarse and fine scale where the image features are easier to detect and enhance.

Figure 4: Examples of 15 cm GSD super-resolved output from RFSR and VDSR versus the original 30 cm GSD native imagery.
GSD Scale Bicubic VDSR RFSR
30cm 38.68 / 0.8108 42.39 / 0.8925 39.79 / 0.8582
30cm 35.86 / 0.6610 38.79 / 0.7795 35.85 / 0.7064
30cm 33.82 / 0.5394 35.69 / 0.6117 34.32 / 0.5874
60cm 41.26 / 0.9275 45.08 / 0.9635 43.03 / 0.9408
60cm 36.98 / 0.8082 40.50 / 0.8904 37.41 / 0.8330
60cm 33.99 / 0.6771 35.44 / 0.7293 33.78 / 0.6799
1.2m 36.73 / 0.9151 39.33 / 0.9497 38.17 / 0.9448
1.2m 32.49 / 0.7738 35.25 / 0.8633 33.47 / 0.8332
1.2m 29.41 / 0.6097 30.58 / 0.6709 29.84 / 0.6700
2.4m 35.26 / 0.8848 41.50 / 0.9624 36.67 / 0.9250
2.4m 31.09 / 0.6898 33.75 / 0.8117 32.00 / 0.7659
2.4m 28.46 / 0.5004 30.78 / 0.6089 28.87 / 0.5572
4.8m 34.14 / 0.8404 37.01 / 0.9097 35.45 / 0.8953
4.8m 30.42 / 0.6079 33.13 / 0.7527 31.24 / 0.6934
4.8m 27.98 / 0.4013 30.22 / 0.5110 28.39 / 0.4488
Table 3: Average PSNR / SSIM scores for scale , , and across five super-resolution output GSDs. All test imagery is the xView validation dataset (281 images). Bicubic indicates the scores if LR images are just upscaled using bicubic interpolation to match the HR image size.

A few specific examples of super resolution performance are visible in Figure 1, where we test the effects of our algorithm on a large object like a plane. Visually, VDSR and RFSR both perform strongly at 30 cm for both a (60 cm input 30 cm SR output) and (120 cm input 30 cm SR output) enhancement, where both the fine details of the plane, and small neighboring objects can be accurately recovered. Recovering the plane at coarser resolutions is extremely difficult, particularly at 4.8 m with an enhancement. In this case the input for the SR algorithm is 38.4 m GSD; at this resolution the satellite is simply insufficiently sensitive to resolve finer objects. Overall, we observe that when the imagery possesses fewer fine features to identify in coarser resolutions, algorithms are unable to hallucinate and recover all object types. A different algorithm such as a GAN may be able to hallucinate visually finer features, however previous studies [2] have shown that these algorithms are unable to exactly recover specific features of various object types.

Finally, in Figure 4 we demonstrate the visual enhancement provided by simulated 15 cm super-resolved output from both VDSR and RFSR. Both methods improve the visual quality by reducing pixelization and enhancing the clarity of features and characters. RFSR appears to produce slightly brighter edge effects than VDSR.

7.2 Object Detection Performance

Figure 5: Example output of YOLT model at native 30 cm resolution. Cars are in green, buses/trucks in blue, and airplanes in orange.

For each model we compute mean average precision (mAP) performance on a 338-image test set at each resolution. We also train and test a model on native resolution imagery at double the sampling rate (with bicubic upsampling), giving a window size of 82 meters (versus 164 meters for 30 - 480 cm resolution). We perform this test in order to disentangle the issues of resolution and window size for the 15 cm super-resolved data. Comparing the 15 cm super-resolved predictions to the oversampled imagery will indicate whether any difference in performance is due to the smaller window size or the super-resolution technique at 15 cm. Example precision recall curves are shown in Figure 6. The YOLT model is clearly superior to SSD, particularly for small objects.

(a) YOLT
(b) SSD
Figure 6: Precision-recall curves for native 30 cm imagery for both YOLT and SSD.

Repeating the computation shown in Figure 6 for all models allows us to determine the degradation of performance as a function of resolution, as shown in Figure 7. In this plot we display bootstrap error bars for each model group. Sensor resolution is reported on the lower X-axis of this figure; for the super-resolution models this serves as the resolution of the input data, which is subsequently enhanced . The performance of the “native” imagery (solid blue line) demonstrates how performance degrades with decreasing sensor resolution. We also plot the results of the super-resolution models. It is apparent that RFSR is slightly more robust than both native imagery and VDSR at lower resolution, though VDSR provides a significant boost at the highest resolution. We observe worse performance when training on native imagery and testing on SR data (purple and red curves), or training on SR models and testing on native imagery (not shown). We also train YOLT models for enhancement, and include results in Table 7 for , , , and cm. We note similar results to the enhancement.

Figure 7: Performance of various YOLT models as a function of sensor resolution. The lower axis indicates the sensor resolution, while the upper axis indicates the pixel extent an object of average size. sampled native imagery is plotted at 20 cm GSD.
Model Data 30 cm ( sample) 30 cm 60 cm 120 cm 240 cm 480 cm
YOLT Native
YOLT RFSR (+0.1) (+0.5) (+0.6) (+0.5) (+1.7) (+0.5)
YOLT VDSR (+2.2) (+2.7) (+0.2) (+0.4) (-1.4) (+0.6)
YOLT RFSR (+1.5) (+0.5) (+1.0) (+0.5)
YOLT VDSR (+1.8) (+0.2) (+1.8) (-0.7)
SSD Native
SSD RFSR (+2.6) (+0.7) (+0.7) (+1.1) (+3.2) (+7.0)
SSD VDSR (+3.5) (+2.3) (-0.0) (+2.3) (+3.9) (+8.7)
Table 4: Performance for each data type. For RFSR and VDSR at each resolution we note the error and statistical difference from the baseline model (e.g. +0.5). For the 30 cm ( sample) column we note the sigma difference between the native oversampled imagery to the 15 cm SR imagery.
Figure 8: Performance of SSD at each resolution.

Results for SSD models are significantly worse than YOLT models (see Figure 8), with a mAP of 0.34 at native 30 cm resolution. The YOLT model (mAP = 0.50) at this resolution is 47% better than SSD, which which aligns fairly well with the findings of [37]. SSD struggles with small objects, but super-resolution helps significantly at the highest GSDs; this is primarily due to a boost from super-resolution for large aircraft identification.

While performance improvements are not very statistically significant, Table 7 indicates that super-resolution techniques do provide an improvement at most resolutions. For YOLT, the greatest benefit is achieved at the highest resolutions, as super-resolving native 30 cm imagery to 15 cm with VDSR yields a 2.7 () improvement. VDSR yields little improvement at lower resolution, averaging a improvement for 60 - 480 cm. VDSR combined with the YOLT model yields a +7% improvement averaged over all resolutions. For the YOLT model, RFSR yields a +16% improvement at 30 cm, and provides a +10% improvement on average for all lower resolutions; RFSR combined with the YOLT model yields a +9% improvement averaged over all resolutions.

For SSD the VDSR model performs at least 2.3 better than native imagery for all but 60 cm resolution. For SSD the improvement at 480 cm is statistically quite significant, though this is primarily due to the mAP of 0.0 for native imagery. Inspection of Figure 7 indicates that performance increases significantly once objects are greater than pixels in extent. This trend extends across object classes, as shown in the performance curves for individual object classes (Figure 9, see Supplemental Material for further details). Also apparent in Figure 9(b) is that a smaller field of view is preferred for detection of densely packed object such as trucks, as evidenced by the sharp uptick at 15 cm and sampled native 30 cm (blue line).

An intentional byproduct of this study is the establishment of the object detection performance curve as a function of sensor resolution. The solid lines of Figure 7 and 9 indicate that object detection performance decreases by when resolution degrades from 30 cm to 120 cm, and another from 120 cm to 480 cm when looking across broad object classes.

(a) Small Aircraft
(b) Buses/Trucks
Figure 9: Performance curves for individual object classes.

8 Conclusions

In this paper we undertook a rigorous study of the utility provided by super-resolution techniques towards the detection of objects in satellite imagery. We paired two super-resolution techniques (VDSR and RFSR) with advanced object detection methods and searched for objects in a satellite imagery dataset with over 250,000 labeled objects in a diverse set of environments. In order to establish super-resolution effects at multiple sensor resolutions, we degrade this imagery from 30 cm to 60, 120, 240, and 480 cm resolutions. Our baseline tests with both the YOLT and SSD models of the SIMRDWN object detection framework indicate that object detection performance decreases by when resolution degrades from 30 cm to 120 cm.

While super-resolution is not a direct replacement for actual imagery, the application of SR techniques as a preprocessing step does provide an improvement in object detection performance at most resolutions (Table 7). For both models, the greatest benefit is achieved at the highest resolutions, as super-resolving native 30 cm imagery to 15 cm yields a improvement in mAP. Super-resolution techniques provide lesser gains at coarser resolutions for enhancement ( improvement), though enhancement yields a improvement. These findings indicate that if the data input to SR algorithms is too coarse, the algorithms are less effective and unable to find enough unique discriminating features to adequately reconstruct higher resolution images. It is apparent from this research that super-resolution for satellite imagery should be quantified in terms of GSD gained, not in terms of enhancement level. These techniques can effectively improve resolution by tens of centimeters up to 1 or perhaps 2 meters. Creating an enhancement of  5 or more meters is unrealistic using SR, and consequently not an ideal approach for improving the value of coarser satellite imagery.

Given the relative ease of applying SR techniques, the general improvement observed in this study is noteworthy and could be a valuable preprocessing step for future object detection applications with satellite imagery, particularly when searching for certain objects with few distinguishing features (specifically boats, small aircraft, and buses/trucks, see Supplemental Material). However, at the highest resolutions these techniques are less effective for small vehicles and large airplanes. This is likely due to the relationship between the window size (the amount of imagery an object detection algorithm can see), the types of objects, and the number of pixels of each object. Cars require less fine detail and large planes begin to be clipped in half and lose the required neighboring context in the finest resolutions. Detailed statistics and plots for each class are provided in the supplemental material.

Furthermore, this study showcases the value of data quality. Previous research [37] has shown that for small vehicles, an average precision of 0.91 can be achieved given similar 30cm oversampled imagery with YOLT. In this study, in a nearly identical testing scenario with 30cm oversampled imagery, an average precision for detecting small vehicles with YOLT declined to 0.81. This drop-off is likely indicative of the labeling issues common of the xView dataset and shows that precise and exhaustive labeling is a requirement for the best and most consistent object detection performance.

In conclusion, additional testing of different super-resolution methods with different datasets should provide further insight into the precise relationship between super-resolution and object detection. For example, newer techniques such as ESRGAN [39] optimize for human visual perception in contrast to the traditional PSNR-based approaches. Such techniques may be more effective at enhancing object detection performance. Contrastingly, task-driven super-resolution [10] and end to end training of a super-resolution and object detection pipeline may be the most useful approach. Such an approach should enhance discriminating features that an object detection framework finds most useful, and potentially provide a significant performance boost in this domain. Finally, working with a dataset that featured different objects or locations would provide further context about the value and interplay of these techniques.

References

  • [1] E. Bilgazyev, B. Efraty, S. Shah, and I. Kakadiaris.

    Sparse Representation-Based Super Resolution for Face Recognition At a Distance.

    In Procedings of the British Machine Vision Conference 2011, pages 52.1–52.11, Dundee, 2011. British Machine Vision Association.
  • [2] M. Bosch, C. M. Gifford, and P. A. Rodriguez. Super-Resolution for Overhead Imagery Using DenseNets and Adversarial Learning. In

    2018 IEEE Winter Conference on Applications of Computer Vision (WACV)

    , pages 1414–1422, Mar. 2018.
  • [3] L. Cao, C. Wang, and J. Li.

    Vehicle detection from highway satellite images via transfer learning.

    Information Sciences, 366:177–187, Oct. 2016.
  • [4] G. Cheng and J. Han. A survey on object detection in optical remote sensing images. CoRR, abs/1603.06201, 2016.
  • [5] D. Dai, Y. Wang, Y. Chen, and L. Van Gool. Is Image Super-resolution Helpful for Other Vision Tasks? arXiv:1509.07009 [cs], Sept. 2015. arXiv: 1509.07009.
  • [6] J. Dai, Y. Li, K. He, and J. Sun. R-FCN: object detection via region-based fully convolutional networks. CoRR, abs/1605.06409, 2016.
  • [7] C. Dong, C. C. Loy, K. He, and X. Tang. Image Super-Resolution Using Deep Convolutional Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2):295–307, Feb. 2016.
  • [8] Eric Shear. ULA preparing to launch WorldView-4 satellite from Vandenberg, Sept. 2016.
  • [9] M. Haris, G. Shakhnarovich, and N. Ukita. Deep back-projection networks for super-resolution. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2018.
  • [10] M. Haris, G. Shakhnarovich, and N. Ukita. Task-Driven Super Resolution: Object Detection in Low-resolution Images. arXiv:1803.11316 [cs], Mar. 2018. arXiv: 1803.11316.
  • [11] P. H. Hennings-Yeomans, S. Baker, and B. V. Kumar.

    Simultaneous super-resolution and feature extraction for recognition of low-resolution faces.

    In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
  • [12] P. H. Hennings-Yeomans, B. V. Kumar, and S. Baker. Robust low-resolution face identification and verification using high-resolution features. In Image Processing (ICIP), 2009 16th IEEE International Conference on, pages 33–36. IEEE, 2009.
  • [13] J. Huang and W. Siu. Practical application of random forests for super-resolution imaging. In 2015 IEEE International Symposium on Circuits and Systems (ISCAS), pages 2161–2164, May 2015.
  • [14] Z. Hui, X. Wang, and X. Gao. Fast and accurate single image super-resolution via information distillation network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [15] J. Kim, J. K. Lee, and K. M. Lee. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1646–1654, Las Vegas, NV, USA, June 2016. IEEE.
  • [16] J. Kim, J. K. Lee, and K. M. Lee. Deeply-Recursive Convolutional Network for Image Super-Resolution. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1637–1645, Las Vegas, NV, USA, June 2016. IEEE.
  • [17] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang. Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5835–5843, Honolulu, HI, July 2017. IEEE.
  • [18] D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli, M. Klaric, Y. Bulatov, and B. McCord. xView: Objects in Context in Overhead Imagery. arXiv:1802.07856 [cs], Feb. 2018. arXiv: 1802.07856.
  • [19] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 105–114, Honolulu, HI, July 2017. IEEE.
  • [20] L. Liebel and M. Körner. Single-Image Super Resolution for Multispectral Remote Sensing Data Using Convolutional Neural Networks. ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XLI-B3:883–890, June 2016.
  • [21] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced Deep Residual Networks for Single Image Super-Resolution. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1132–1140, Honolulu, HI, USA, July 2017. IEEE.
  • [22] H. Liu, Z. Fu, J. Han, L. Shao, and H. Liu. Single satellite imagery simultaneous super-resolution and colorization using multi-task deep neural networks. Journal of Visual Communication and Image Representation, 53:20–30, May 2018.
  • [23] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015.
  • [24] Y. Long, Y. Gong, Z. Xiao, and Q. Liu. Accurate object localization in remote sensing images based on convolutional neural networks. IEEE Transactions on Geoscience and Remote Sensing, 55(5):2486–2498, May 2017.
  • [25] Y. Luo, L. Zhou, S. Wang, and Z. Wang. Video Satellite Imagery Super Resolution via Convolutional Neural Networks. IEEE Geoscience and Remote Sensing Letters, 14(12):2398–2402, Dec. 2017.
  • [26] R. A. Marcum, C. H. Davis, G. J. Scott, and T. W. Nivin. Rapid broad area search and detection of Chinese surface-to-air missile sites using deep convolutional neural networks. Journal of Applied Remote Sensing, 11(4):042614, Oct. 2017.
  • [27] T. N. Mundhenk, G. Konjevod, W. A. Sakla, and K. Boakye. A Large Contextual Dataset for Classification, Detection and Counting of Cars with Deep Learning. Computer Vision – ECCV 2016, 9907:785–800, 2016.
  • [28] D. Pouliot, R. Latifovic, J. Pasher, and J. Duffe. Landsat Super-Resolution Enhancement Using Convolution Neural Networks and Sentinel-2 for Training. Remote Sensing, 10(3):394, Mar. 2018.
  • [29] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. CoRR, abs/1506.02640, 2015.
  • [30] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015.
  • [31] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • [32] S. Schulter, C. Leistner, and H. Bischof. Fast and accurate image upscaling with super-resolution forests. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3791–3799, Boston, MA, USA, June 2015. IEEE.
  • [33] S. Shekhar, V. M. Patel, and R. Chellappa. Synthesis-based recognition of low resolution faces. In Biometrics (IJCB), 2011 International Joint Conference on, pages 1–6. IEEE, 2011.
  • [34] Y. Tai, J. Yang, and X. Liu. Image Super-Resolution via Deep Recursive Residual Network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2790–2798, Honolulu, HI, July 2017. IEEE.
  • [35] C. Tuna, G. Unal, and E. Sertel. Single-frame super resolution of remote-sensing images by convolutional neural networks. International Journal of Remote Sensing, 39(8):2463–2479, Apr. 2018.
  • [36] A. Van Etten. You Only Look Twice: Rapid Multi-Scale Object Detection In Satellite Imagery. ArXiv e-prints, May 2018.
  • [37] A. Van Etten. Satellite Imagery Multiscale Rapid Detection with Windowed Networks. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), page In Press, Jan. 2019. arXiv: 1809.09978.
  • [38] A. Van Etten, D. Lindenbaum, and T. M. Bacastow. SpaceNet: A Remote Sensing Dataset and Challenge Series. arXiv:1807.01232 [cs], July 2018. arXiv: 1807.01232.
  • [39] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, C. C. Loy, Y. Qiao, and X. Tang. Esrgan: Enhanced super-resolution generative adversarial networks, 2018.
  • [40] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, 13(4):600–612, Apr. 2004.
  • [41] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang. Dota: A large-scale dataset for object detection in aerial images. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [42] A. Xiao, Z. Wang, L. Wang, Y. Ren, A. Xiao, Z. Wang, L. Wang, and Y. Ren. Super-Resolution for “Jilin-1” Satellite Video Imagery via a Convolutional Network. Sensors, 18(4):1194, Apr. 2018.
  • [43] Y. Xu, L. Lin, D. Meng, Y. Xu, L. Lin, and D. Meng. Learning-Based Sub-Pixel Change Detection Using Coarse Resolution Satellite Imagery. Remote Sensing, 9(7):709, July 2017.
  • [44] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu. Residual dense network for image super-resolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

1 Super-Resolution Outputs

Figure S1: Super-resolution example as applied to images containing cars. Smaller objects like cars rapidly degrade and become amorphous pixelated blobs as resolution degrades. For the 30 cm data and super-resolved outputs, VDSR and RFSR perform favorably using a enhancement (60 cm input 30 cm SR output). VDSR can generally recover the car shape of the 30 cm data with a enhancement (120 cm input 30 cm SR output), whereas RFSR struggles to do so.
Figure S2: Ground truth 30 cm imagery (right), and simulated 60 cm input imagery (left).
Figure S3: Super-resolution with the models (left = RFSR, right = 2VDSR).

2 Super-Resolution Scores- Bicubic Decimation with No Blurring

GSD Scale Bicubic VDSR RFSR
30cm ×2 40.30 / 0.8734 42.95 / 0.9104 40.90 / 0.8885
30cm ×4 36.83 / 0.7265 39.07 / 0.7939 36.81 / 0.7419
30cm ×8 34.69 / 0.5930 36.86 / 0.6605 35.06 / 0.6092
60cm ×2 43.69 / 0.9594 45.66 / 0.9717 44.25 / 0.9496
60cm ×4 38.61 / 0.8656 41.07 / 0.9055 38.89 / 0.8729
60cm ×8 35.20 / 0.7324 37.61 / 0.7861 35.18 / 0.7439
1.2m ×2 40.73 / 0.9640 43.14 / 0.9777 41.70 / 0.9716
1.2m ×4 34.73 / 0.8562 37.39 / 0.9006 35.23 / 0.8773
1.2m ×8 30.92 / 0.6862 33.16 / 0.7500 31.19 / 0.7197
2.4m ×2 39.33 / 0.9529 42.15 / 0.9720 40.36 / 0.9640
2.4m ×4 33.24 / 0.7998 36.00 / 0.8638 33.74 / 0.8326
2.4m ×8 29.68 / 0.5802 31.90 / 0.6566 29.92 / 0.6177
4.8m ×2 37.99 / 0.9320 40.94 / 0.9592 38.98 / 0.9491
4.8m ×4 32.23 / 0.7294 34.87 / 0.8070 32.66 / 0.7703
4.8m ×8 29.01 / 0.4786 31.15 / 0.5552 29.19 / 0.5131
Table 1: Average PSNR / SSIM scores for scale , , and across five super-resolution output GSDs. All test imagery is not blurred, decimated bicubically, and then bicubically upsampled. (Ideal or Traditional super-resolution settings). As in the main text the xView validation dataset is used (281 images). Bicubic indicates the scores if LR images are just upscaled using bicubic interpolation to match the HR image size.

3 Object Detection Performance

The figures below illustrate bounding boxes output by YOLT models at various resolutions. Cars are green, buses/trucks are blue, small aircraft are red, and large aircraft are yellow.

Figure S4: Performance of YOLT model trained and tested on native 30 cm imagery for image_id=1114 and a low detection threshold of 0.1 (this detection threshold yields fewer false negatives but more false positives).
Figure S5: Performance of YOLT model trained and tested on super-resolved 15 cm imagery from the VDSR model. We use a low detection threshold of 0.1 (this detection threshold yields fewer false negatives but more false positives).
Figure S6: Performance of YOLT model trained and tested on super-resolved 30 cm imagery from the RFSR model. We use a low detection threshold of 0.1 (this detection threshold yields fewer false negatives but more false positives).
Figure S7: Performance of YOLT model trained and tested on super-resolved 60 cm imagery from the RFSR model. We use a low detection threshold of 0.1 (this detection threshold yields fewer false negatives but more false positives).
Figure S8: Performance of YOLT model trained and tested on super-resolved 120 cm imagery from the VDSR model. We use a low detection threshold of 0.1 (this detection threshold yields fewer false negatives but more false positives).

4 Object Detection Performance Curves and Tables

To compute the statistical difference () between the super-resolved and baseline models, we follow the procedure used in Table 4 of the main text. The joint error estimate between the baseline () and super-resolved () data point can be estimated as:

(S1)

The statistical difference between two models is then simply:

(S2)

The tables and plots below show the performance of various models for each object class.

Figure S9: Performance of YOLT models on all object categories as a function of sensor resolution. The lower axis indicates the sensor resolution, while the upper axis indicates the pixel extent an object of average size. We plot the oversampled native imagery (solid blue line) at 20 cm.
Model Data 30 cm ( sample) 30 cm 60 cm 120 cm 240 cm 480 cm
YOLT Native
YOLT RFSR (+0.1) (+0.5) (+0.6) (+0.5) (+1.7) (+0.5)
YOLT VDSR (+2.2) (+2.7) (+0.2) (+0.4) (-1.4) (+0.6)
YOLT RFSR (+1.5) (+0.5) (+1.0) (+0.5)
YOLT VDSR (+1.8) (+0.2) (+1.8) (-0.7)
SSD Native
SSD RFSR (+2.6) (+0.7) (+0.7) (+1.1) (+3.2) (+7.0)
SSD VDSR (+3.5) (+2.3) (-0.0) (+2.3) (+3.9) (+8.7)
Table 2: Performance for all classes. For RFSR and VDSR at each resolution we note the error and statistical difference from the baseline model (e.g. +0.5). For the 30 cm ( sample) column we note the sigma difference between the native oversampled imagery to the 15 cm SR imagery.
Figure S10: Performance of YOLT models on boats as a function of sensor resolution. The lower axis indicates the sensor resolution, while the upper axis indicates the pixel extent an object of average size.
Model Data 30 cm ( sample) 30 cm 60 cm 120 cm 240 cm 480 cm
YOLT Native 0.39 0.05 0.32 0.03 0.26 0.03 0.14 0.03 0.06 0.02 0.01 0
YOLT RFSR (+0.3) 0.42 0.05 (+1.6) 0.31 0.03 (+1.1) 0.16 0.03 (+0.5) 0.07 0.02 (+0.3) 0.01 0 (-0.1)
YOLT VDSR (+0.6) 0.44 0.05 (+2) 0.29 0.04 (+0.5) 0.19 0.04 (+1) 0.03 0.01 (-1.8) 0.03 0.01 (+2.1)
YOLT RFSR 0.31 0.04 (+1.1) 0.16 0.03 (+0.6) 0.08 0.02 (+0.7) 0.03 0.01 (+1.4)
YOLT VDSR 0.31 0.05 (+0.9) 0.2 0.03 (+1.4) 0.09 0.03 (+0.8) 0.03 0.01 (+1.8)
SSD Native 0.13 0.03 0.12 0.02 0.14 0.03 0.03 0.01 0.01 0 0 0
SSD RFSR (+0.2) 0.14 0.03 (+0.6) 0.09 0.02 (-1.3) 0.05 0.01 (+0.9) 0.02 0.01 (+1.1) 0.01 0 (+1.8)
SSD VDSR (-0.5) 0.11 0.02 (-0.2) 0.1 0.02 (-0.9) 0.04 0.01 (-2.9) 0.03 0.01 (+2.8) 0.01 0 (+2.1)
Table 3: Performance for the boat class. For RFSR and VDSR at each resolution we note the error and statistical difference from the baseline model (e.g. +0.5). For the 30 cm ( sample) column we note the sigma difference between the native oversampled imagery to the 15 cm SR imagery.
Figure S11: Performance of YOLT models on small aircraft as a function of sensor resolution. The lower axis indicates the sensor resolution, while the upper axis indicates the pixel extent an object of average size.
Model Data 30 cm ( sample) 30 cm 60 cm 120 cm 240 cm 480 cm
YOLT Native 0.68 0.09 0.57 0.12 0.56 0.12 0.47 0.11 0.15 0.06 0 0
YOLT RFSR (+0.3) 0.72 0.1 (+1) 0.56 0.11 (+0) 0.49 0.1 (+0.1) 0.25 0.1 (+0.9) 0.01 0.01 (+1.6)
YOLT VDSR (+0.7) 0.78 0.1 (+1.3) 0.56 0.13 (+0) 0.48 0.11 (+0) 0.09 0.04 (-0.8) 0 0 (+1.2)
YOLT RFSR 0.73 0.1 (+1.1) 0.53 0.12 (+0.3) 0.13 0.04 (-0.3) 0.01 0.01 (+1.2)
YOLT VDSR 0.73 0.09 (+1.1) 0.47 0.12 (+0) 0.2 0.07 (+0.5) 0.01 0.01 (+1.1)
SSD Native 0.13 0.05 0.14 0.04 0.14 0.04 0.05 0.03 0 0 0 0
SSD RFSR (-1.4) 0.05 0.03 (-1.8) 0.24 0.07 (+1.2) 0.04 0.02 (-0.1) 0 0 (0) 0 0 (0)
SSD VDSR (+1.8) 0.37 0.12 (+1.8) 0.15 0.07 (+0.1) 0.08 0.03 (-1.2) 0.01 0.01 (+0.9) 0 0 (0)
Table 4: Performance for the small aircraft class For RFSR and VDSR at each resolution we note the error and statistical difference from the baseline model (e.g. +0.5). For the 30 cm ( sample) column we note the sigma difference between the native oversampled imagery to the 15 cm SR imagery.
Figure S12: Performance of YOLT models on large aircraft as a function of sensor resolution. The lower axis indicates the sensor resolution, while the upper axis indicates the pixel extent an object of average size. The performance dips at the highest resolution because for for the models we super-resolve 30 cm to 15 cm and half the field of view; for the model we half the field of view of the 60 cm model.
Model Data 30 cm ( sample) 30 cm 60 cm 120 cm 240 cm 480 cm
YOLT Native 0.25 0.05 0.67 0.05 0.7 0.04 0.69 0.04 0.59 0.04 0.48 0.05
YOLT RFSR (-0.6) 0.21 0.06 (-6.3) 0.74 0.04 (+0.7) 0.69 0.04 (+0) 0.67 0.04 (+1.4) 0.57 0.05 (+1.3)
YOLT VDSR (+4.3) 0.54 0.04 (-1.9) 0.68 0.04 (-0.4) 0.65 0.04 (-0.8) 0.53 0.05 (-1) 0.56 0.05 (+1.3)
YOLT RFSR 0.5 0.04 (-3.4) 0.75 0.04 (+0.9) 0.68 0.05 (+1.5) 0.56 0.05 (+1.1)
YOLT VDSR 0.54 0.04 (-2.8) 0.7 0.04 (+0.1) 0.71 0.04 (+2.2) 0.43 0.05 (-0.7)
SSD Native 0.46 0.04 0.7 0.04 0.68 0.04 0.69 0.04 0.36 0.05 0 0
SSD RFSR (+2.5) 0.58 0.03 (-2.4) 0.67 0.04 (-0.2) 0.65 0.04 (-0.7) 0.56 0.05 (+2.7) 0.34 0.05 (+6.9)
SSD VDSR (+1.2) 0.53 0.05 (-2.7) 0.63 0.04 (-1) 0.65 0.04 (-0.5) 0.55 0.04 (+2.8) 0.41 0.05 (+8.6)
Table 5: Performance for the large aircraft class For RFSR and VDSR at each resolution we note the error and statistical difference from the baseline model (e.g. +0.5). For the 30 cm ( sample) column we note the sigma difference between the native oversampled imagery to the 15 cm SR imagery.
Figure S13: Performance of YOLT models on small vehicles as a function of sensor resolution. The lower axis indicates the sensor resolution, while the upper axis indicates the pixel extent an object of average size. The uptick at 60 cm for the model is caused primarily from the smaller field of view when super-resolving to 15 cm (this is mimicked by the solid blue point at 20 cm which is the oversampled native imagery.
Model Data 30 cm ( sample) 30 cm 60 cm 120 cm 240 cm 480 cm
YOLT Native 0.81 0.01 0.6 0.02 0.57 0.02 0.39 0.02 0.19 0.01 0.03 0
YOLT RFSR (-0.5) 0.81 0.01 (+10.2) 0.59 0.02 (+0.6) 0.44 0.02 (+1.8) 0.2 0.01 (+0.5) 0.02 0 (-2.5)
YOLT VDSR (-2.7) 0.78 0.01 (+8.6) 0.59 0.02 (+0.8) 0.45 0.02 (+1.9) 0.19 0.01 (-0.1) 0.02 0 (-2.1)
YOLT RFSR 0.74 0.01 (+7.3) 0.36 0.02 (-0.9) 0.19 0.01 (-0.1) 0.02 0 (-1.5)
YOLT VDSR 0.75 0.01 (+7.6) 0.35 0.02 (-1.3) 0.19 0.01 (+0.2) 0.02 0 (-1.5)
SSD Native 0.61 0.02 0.48 0.01 0.46 0.02 0.27 0.02 0.04 0.01 0 0
SSD RFSR (+6.2) 0.75 0.01 (+14.3) 0.49 0.02 (+1.2) 0.31 0.02 (+1.7) 0.07 0.01 (+2.4) 0 0 (-2.5)
SSD VDSR (+4.4) 0.72 0.02 (+11.1) 0.49 0.02 (+1.2) 0.38 0.02 (-3.2) 0.08 0.01 (+3.4) 0 0 (-2.5)
Table 6: Performance for the small vehicle class. For RFSR and VDSR at each resolution we note the error and statistical difference from the baseline model (e.g. +0.5). For the 30 cm ( sample) column we note the sigma difference between the native oversampled imagery to the 15 cm SR imagery.
Figure S14: Performance of YOLT models on buses and trucks as a function of sensor resolution. The lower axis indicates the sensor resolution, while the upper axis indicates the pixel extent an object of average size. The uptick at 60 cm for the model is caused primarily from the smaller field of view when super-resolving to 15 cm (this is mimicked by the solid blue point at 20 cm which is the oversampled native imagery.
Model Data 30 cm ( sample) 30 cm 60 cm 120 cm 240 cm 480 cm
YOLT Native 0.44 0.02 0.34 0.02 0.32 0.01 0.21 0.01 0.07 0.01 0.01 0
YOLT RFSR (+0.3) 0.44 0.02 (+4.2) 0.33 0.01 (+0.6) 0.21 0.01 (-0.2) 0.08 0.01 (+1.1) 0.01 0 (+1.1)
YOLT VDSR (+0.9) 0.46 0.02 (+5) 0.35 0.01 (+1.2) 0.22 0.01 (+0.8) 0.08 0.01 (+1.4) 0.01 0 (+1.8)
YOLT RFSR 0.4 0.02 (+3.9) 0.2 0.01 (-0.4) 0.08 0.01 (+0.9) 0.01 0 (+1.9)
YOLT VDSR 0.42 0.02 (+4.5) 0.21 0.01 (+0) 0.07 0.01 (+0.7) 0.01 0 (+2.6)
SSD Native 0.18 0.01 0.28 0.01 0.16 0.01 0.06 0.01 0 0 0 0
SSD RFSR (+3.8) 0.26 0.02 (-0.8) 0.17 0.01 (+0.9) 0.14 0.01 (+6.1) 0 0 (+3.2) 0 0 (+0)
SSD VDSR (+6.2) 0.33 0.02 (+2) 0.2 0.01 (+2.7) 0.15 0.01 (-0.5) 0.01 0 (+4.4) 0 0 (+0)
Table 7: Performance for the truck and bus class. For RFSR and VDSR at each resolution we note the error and statistical difference from the baseline model (e.g. +0.5). For the 30 cm ( sample) column we note the sigma difference between the native oversampled imagery to the 15 cm SR imagery.