You Only Look Twice: Rapid Multi-Scale Object Detection In Satellite Imagery

by   Adam Van Etten, et al.
In-Q-Tel, Inc.

Detection of small objects in large swaths of imagery is one of the primary problems in satellite imagery analytics. While object detection in ground-based imagery has benefited from research into new deep learning approaches, transitioning such technology to overhead imagery is nontrivial. Among the challenges is the sheer number of pixels and geographic extent per image: a single DigitalGlobe satellite image encompasses >64 km2 and over 250 million pixels. Another challenge is that objects of interest are minuscule (often only 10 pixels in extent), which complicates traditional computer vision techniques. To address these issues, we propose a pipeline (You Only Look Twice, or YOLT) that evaluates satellite images of arbitrary size at a rate of >0.5 km2/s. The proposed approach can rapidly detect objects of vastly different scales with relatively little training data over multiple sensors. We evaluate large test images at native resolution, and yield scores of F1 > 0.8 for vehicle localization. We further explore resolution and object size requirements by systematically testing the pipeline at decreasing resolution, and conclude that objects only 5 pixels in size can still be localized with high confidence. Code is available at



There are no comments yet.


page 2

page 3

page 4

page 5

page 6

page 7


Satellite Imagery Multiscale Rapid Detection with Windowed Networks

Detecting small objects over large areas remains a significant challenge...

Rapid Detection of Aircrafts in Satellite Imagery based on Deep Neural Networks

Object detection is one of the fundamental objectives in Applied Compute...

xView: Objects in Context in Overhead Imagery

We introduce a new large-scale dataset for the advancement of object det...

Counting Cows: Tracking Illegal Cattle Ranching From High-Resolution Satellite Imagery

Cattle farming is responsible for 8.8% of greenhouse gas emissions world...

On Learning Where To Look

Current automatic vision systems face two major challenges: scalability ...

On Learning Vehicle Detection in Satellite Video

Vehicle detection in aerial and satellite images is still challenging du...

Satellite Image Forgery Detection and Localization Using GAN and One-Class Classifier

Current satellite imaging technology enables shooting high-resolution pi...

Code Repositories


You Only Look Twice: Rapid Multi-Scale Object Detection In Satellite Imagery

view repo


You Only Look Twice: Rapid Multi-Scale Object Detection In Satellite Imagery

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Computer vision techniques have made great strides in the past few years since the introduction of convolutional neural networks

(Krizhevsky et al., 2012)

in the ImageNet

(Russakovsky et al., 2015) competition. The availability of large, high-quality labelled datasets such as ImageNet (Russakovsky et al., 2015), PASCAL VOC (Everingham et al., 2010) and MS COCO (Lin et al., 2014) have helped spur a number of impressive advances in rapid object detection that run in near real-time; three of the best are: Faster R-CNN (Ren et al., 2015), SSD (Liu et al., 2015), and YOLO (Redmon et al., 2015) (Redmon and Farhadi, 2016). Faster R-CNN typically ingests pixel images, whereas SSD uses or pixel input images, and YOLO runs on either or pixel inputs. While the performance of all these frameworks is impressive, none can come remotely close to ingesting the input sizes typical of satellite imagery. Of these three frameworks, YOLO has demonstrated the greatest inference speed and highest score on the PASCAL VOC dataset. The authors also showed that this framework is highly transferrable to new domains by demonstrating superior performance to other frameworks (i.e., SSD and Faster R-CNN) on the Picasso Dataset (Ginosar et al., 2014) and the People-Art Dataset (Cai et al., 2015). Due to the speed, accuracy, and flexibility of YOLO, we accordingly leverage this system as the inspiration for our satellite imagery object detection framework.

The application of deep learning methods to traditional object detection pipelines is non-trivial for a variety of reasons. The unique aspects of satellite imagery necessitate algorithmic contributions to address challenges related to the spatial extent of foreground target objects, complete rotation invariance, and a large scale search space. Excluding implementation details, algorithms must adjust for:

Small spatial extent:

In satellite imagery objects of interest are often very small and densely clustered, rather than the large and prominent subjects typical in ImageNet data. In the satellite domain, resolution is typically defined as the ground sample distance (GSD), which describes the physical size of one image pixel. Commercially available imagery varies from 30 cm GSD for the sharpest DigitalGlobe imagery, to meter GSD for Planet imagery. This means that for small objects such as cars each object will be only pixels in extent even at the highest resolution.

Complete rotation invariance:

Objects viewed from overhead can have any orientation (e.g. ships can have any heading between 0 and 360 degrees, whereas trees in ImageNet data are reliably vertical).

Training example frequency:

There is a relative dearth of training data (though efforts such as SpaceNet111 are attempting to ameliorate this issue)

Ultra high resolution:

Input images are enormous (often hundreds of megapixels), so simply downsampling to the input size required by most algorithms (a few hundred pixels) is not an option (see Figure 1).

The contribution in this work specifically addresses each of these issues separately, while leveraging the relatively constant distance from sensor to object, which is well known and is typically km. This coupled with the nadir facing sensor results in consistent pixel size of objects.

Figure 1. DigitalGlobe km ( pixels) image at 50 cm GSD near the Panama Canal. One pixel sliding window cutout is shown in red. For an image this size, there are unique cutouts.

Section 2 details in further depth the challenges faced by standard algorithms when applied to satellite imagery. The remainder of this work is broken up to describe the proposed contributions as follows. To address small, dense clusters, Section 3.1 describes a new, finer-grained network architecture. Sections 3.2 and 3.3 detail our method for splitting, evaluating, and recombining large test images of arbitrary size at native resolution. With regard to rotation invariance and small labelled training dataset sizes, Section 4 describes data augmentation and size requirements. Finally, the performance of the algorithm is discussed in detail in Section 6.

2. Related Work

Deep learning approaches have proven effective for ground-based object detection, though current techniques are often still suboptimal for overhead imagery applications. For example, small objects in groups, such as flocks of birds present a challenge  (Redmon et al., 2015), caused in part by the multiple downsampling layers of all three convolutional network approaches listed above (YOLO, SDD, Faster-RCNN). Further, these multiple downsampling layers result in relatively course features for object differentiation; this poses a problem if objects of interest are only a few pixels in extent. For example, consider the default YOLO network architecture, which downsamples by a factor of 32 and returns a prediction grid; this means that object differentiation is problematic if object centroids are separated by less than 32 pixels. Accordingly we implement a unique network architecture with a denser final prediction grid. This improves performance by yielding finer grained features to help differentiate between classes. This finer prediction grid also permits classification of smaller objects and denser clusters.

Another reason object detection algorithms struggle with satellite imagery is that they have difficulty generalizing objects in new or unusual aspect ratios or configurations (Redmon et al., 2015)

. Since objects can have arbitrary heading, this limited range of invariance to rotation is troublesome. Our approach remedies this complication with rotations and augmentation of data. Specifically, we rotate training images about the unit circle to ensure that the classifier is agnostic to object heading, and also randomly scale the images in HSV (hue-saturation-value) to increase the robustness of the classifier to varying sensors, atmospheric conditions, and lighting conditions.

In advanced object detection techniques the network sees the entire image at train and test time. While this greatly improves background differentiation since the network encodes contextual (background) information for each object, the memory footprint on typical hardware (NVIDIA Titan X GPUs with 12GB RAM) is infeasible for a 256 megapixel image.

We also note that the large sizes satellite images preclude simple approaches to some of the problems noted above. For example, upsampling the image to ensure that objects of interest are large and dispersed enough for standard architectures is infeasible, since this approach would also increase runtime many-fold. Similarly, running a sliding window classifier across the image to search for objects of interest quickly becomes computationally intractable, since multiple window sizes will be required for each object size. For perspective, one must evaluate over one million sliding window cutouts if the target is a 10 meter boat in a DigitalGlobe image. Our response is to leverage rapid object detection algorithms to evaluate satellite imagery with a combination of local image interpolation on reasonably sized image chips (

meters) and a multi-scale ensemble of detectors.

To demonstrate the challenges of satellite imagery analysis, we train a YOLO model with the standard network architecture ( grid) to recognize cars in pixel cutouts of the COWC overhead imagery dataset (Mundhenk et al., 2016) (see Section 4 for further details on this dataset). Naively evaluating a large test image (see Figure 2) with this network yields a false positive rate, due to the downsampling of the test image. Even appropriately sized image chips are problematic (again, see Figure 2), as the standard YOLO network architecture cannot differentiate objects with centroids separated by less than 32 pixels. Therefore even if one restricts attention to a small cutout, performance is often poor in high density regions with the standard architecture.

Figure 2. Challenges of the standard object detection network architecture when applied to overhead vehicle detection. Each image uses the same standard YOLO architecture model trained on pixel cutouts of cars from the COWC dataset. Left: Model applied to a large pixel test image downsampled to a size of ; none of the 1142 cars in this image are detected. Right: Model applied to a small pixel cutout; the excessive false negative rate is due to the high density of cars that cannot be differentiated by the grid.

3. You Only Look Twice

In order to address the limitations discussed in Section 2, we implement an object detection framework optimized for overhead imagery: You Only Look Twice (YOLT). We extend the Darknet neural network framework (Redmon, 2017) and update a number of the C libraries to enable analysis of geospatial imagery and integrate with external python libraries. We opt to leverage the flexibility and large user community of python for pre- and post-processing. Between the updates to the C code and the pre and post-processing code written in python, interested parties need not have any knowledge of C to train, test, or deploy YOLT models.

Figure 3. Limitations of the YOLO framework (left column, quotes from (Redmon et al., 2015)), along with YOLT contributions to address these limitations (right column).

3.1. Network Architecture

To reduce model coarseness and accurately detect dense objects (such as cars or buildings), we implement a network architecture that uses 22 layers and downsamples by a factor of 16 Thus, a pixel input image yields a prediction grid. Our architecture is inspired by the 30-layer YOLO network, though this new architecture is optimized for small, densely packed objects. The dense grid is unnecessary for diffuse objects such as airports, but crucial for high density scenes such as parking lots (see Figure 2). To improve the fidelity of small objects, we also include a passthrough layer (described in (Redmon and Farhadi, 2016), and similar to identity mappings in ResNet (He et al., 2015)) that concatenates the final layer onto the last convolutional layer, allowing the detector access to finer grained features of this expanded feature map.

Each convolutional layer save the last is batch normalized with a leaky rectified linear activation, save the final layer that utilizes a linear activation. The final layer provides predictions of bounding boxes and classes, and has size:

, where is the number of boxes per grid (5 by default), and is the number of object classes (Redmon et al., 2015).

Layer Type Filters Size/Stride Output Size
0 Convolutional 32 33 / 1 41641632
1 Maxpool 22 / 2 20820832
2 Convolutional 64 33 / 1 208208 64
3 Maxpool 22 / 2 104104 64
4 Convolutional 128 33 / 1 104104128
5 Convolutional 64 11 / 1 10410464
6 Convolutional 128 33 / 1 104104128
7 Maxpool 22 / 2 525264
8 Convolutional 256 33 / 1 52 52256
9 Convolutional 128 11 / 1 52 52128
10 Convolutional 256 33 / 1 52 52256
11 Maxpool 22 / 2 26 26256
12 Convolutional 512 33 / 1 26 26512
13 Convolutional 256 11 / 1 26 26256
14 Convolutional 512 33 / 1 26 26512
15 Convolutional 256 11 / 1 26 26256
16 Convolutional 512 33 / 1 26 26512
17 Convolutional 1024 33 / 1 26 261024
18 Convolutional 1024 33 / 1 26 261024
19 Passthrough 10 20 26 261024
20 Convolutional 1024 33 / 1 26261024
21 Convolutional 11 / 1 2626
Table 1. YOLT Network Architecture

3.2. Test Procedure

At test time, we partition testing images of arbitrary size into manageable cutouts and run each cutout through our trained model. Partitioning takes place via a sliding window with user defined bin sizes and overlap ( by default), see Figure 4. We record the position of each sliding window cutout by naming each cutout according to the schema:

ImageName|row column height width.ext

For example:

panama50cm|1370 1180 416 416.tif

Figure 4. Graphic of testing procedure for large image sizes, showing a sliding window going from left to right across Figure 1. The overlap of the bottom right image is shown in red. Non-maximal suppression of this overlap is necessary to refine detections at the edge of the cutouts.

3.3. Post-Processing

Much of the utility of satellite (or aerial) imagery lies in its inherent ability to map large areas of the globe. Thus, small image chips are far less useful than the large field of view images produced by satellite platforms. The final step in the object detection pipeline therefore seeks to stitch together the hundreds or thousands of testing chips into one final image strip.

For each cutout the bounding box position predictions returned from the classifier are adjusted according to the row and column values of that cutout; this provides the global position of each bounding box prediction in the original input image. The overlap ensures all regions will be analyzed, but also results in overlapping detections on the cutout boundaries. We apply non-maximal suppression to the global matrix of bounding box predictions to alleviate such overlapping detections.

4. Training Data

Training data is collected from small chips of large images from three sources: DigitalGlobe satellites, Planet satellites, and aerial platforms. Labels are comprised of a bounding box and category identifier for each object. We initially focus on five categories: airplanes, boats, building footprints, cars, and airports. For objects of very different scales (e.g. airplanes vs airports) we show in Section 6.2 that using two different detectors at different scales is very effective.

Figure 5. YOLT Training data. The top row displays imagery and labels for vehicles. The top left panel shows airplanes labels overlaid on DigitalGlobe imagery, while the middle panel displays boats overlaid on DigitalGlobe data. The top right panel shows aerial imagery of cars from the COWC dataset (Mundhenk et al., 2016), with the red dot denoting the COWC label and the purple box our inferred 3 meter bounding box. The lower left panel shows an airport (orange) in downsampled Planet imagery. The lower middle panel shows SpaceNet building footprints in yellow, and the lower right image displays inferred YOLT bounding box labels in red.

The Cars Overhead with Context (COWC) (Mundhenk et al., 2016) dataset is a large, high quality set of annotated cars from overhead imagery collected over multiple locales. Data is collected via aerial platforms, but at a nadir view angle such that it resembles satellite imagery. The imagery has a resolution of 15 cm GSD that is approximately double the current best resolution of commercial satellite imagery (30 cm GSD for DigitalGlobe). Accordingly, we convolve the raw imagery with a Gaussian kernel and reduce the image dimensions by half to create the equivalent of 30 cm GSD images. Labels consist of simply a dot at the centroid of each car, and we draw a 3 meter bounding box around each car for training purposes. We reserve the largest geographic region (Utah) for testing, leaving 13,303 labelled training cars.

Building Footprints:

The second round of SpaceNet data consists of 30 cm GSD DigitalGlobe imagery and labelled building footprints over four cities: Las Vegas, Paris, Shanghai, and Khartoum. The labels are precise building footprints, which we transform into bounding boxes encompassing of the extent of the footprint. Image segmentation approaches show great promise for this challenge; nevertheless, we explore YOLT performance on building outline detection, acknowledging that since YOLT outputs bounding boxes it will never achieve perfect building footprint detection for complex building shapes. Between the four cities there are 221,336 labelled buildings.


We label eight DigitalGlobe images over airports for a total of 230 objects in the training set.


We label three DigitalGlobe images taken over coastal regions for a total of 556 boats.


We label airports in 37 Planet images for training purposes, each with a single airport per chip. For objects the size of airports, some downsampling is required, as runways can exceed 1000 pixels in length even in low resolution Planet imagery; we therefore downsample Planet imagery by a factor of four for training purposes.

The raw training datasets for airplanes, airports, and watercraft are quite small by computer vision standards, and a larger dataset may improve the inference performance detailed in Section 6.

We train with stochastic gradient descent and maintain many of the hyper parameters of

(Redmon and Farhadi, 2016): 5 boxes per grid, an initial learning rate of , a weight decay of 0.0005, and a momentum of 0.9. Training takes days on a single NVIDIA Titan X GPU.

5. Test Images

To ensure evaluation robustness, all test images are taken from different geographic regions than training examples. For cars, we reserve the largest geographic region of Utah for testing, yielding 19,807 test cars. Building footprints are split 75/25 train/test, leaving 73,778 test footprints. We label four airport test images for a total of 74 airplanes. Four boat images are labelled, yielding 771 test boats. Our dataset for airports is smaller, with ten Planet images used for testing. See Table 2 for the train/test split for each category.

Object Class Training Examples Test Examples
Airport 37 10
Airplane 230 74
Boat 556 100
Car 19,807 13,303
Building 221,336 73,778
  • Internally Labelled

  • External Dataset

Table 2. Train/Test Split

6. Object Detection Results

6.1. Universal Classifier Object Detection Results

Initially, we attempt to train a single classifier to recognize all five categories listed above, both vehicles and infrastructure. We note a number of spurious airport detections in this example (see Figure 6), as down sampled runways look similar to highways at the wrong scale.

Figure 6. Poor results of the universal model applied to DigitalGlobe imagery at two different scales (200m, 1500m). Airplanes are in red. The cyan boxes mark spurious detections of runways, caused in part by confusion from small scale linear structures such as highways.

6.2. Scale Confusion Mitigation

There are multiple ways one could address the false positive issues noted in Figure 6. Recall from Section 4 that for this exploratory work our training set consists of only a few dozen airports, far smaller than usual for deep learning models. Increasing this training set size might improve our model, particularly if the background is highly varied. Another option would be to use post-processing to remove any detections at the incorrect scale (e.g. an airport with a size of meters). Another option is to simply build dual classifiers, one for each relevant scale.

We opt to utilize the scale information present in satellite imagery and run two different classifiers: one trained for vehicles + buildings, and the other trained only to look for airports. Running the second airport classifier on down sampled images has a minimal impact on runtime performance, since in a given image there are approximately 100 times more 200 meter chips than 2000 meter chips.

6.3. Dual Classifier Results

For large validation images, we run the classifier at two different scales: 200m, and 2500m. The first scale is designed for vehicles and buildings, and the larger scale is optimized for large infrastructure such as airports. We break the validation image into appropriately sized image chips and run each image chip on the appropriate classifier. The myriad results from the many image chips and multiple classifiers are combined into one final image, and overlapping detections are merged via non-maximal suppression. We find a detection probability threshold of between 0.3 and 0.4 yields the highest F1 score for our validation images.

We define a true positive as having an intersection over union (IOU) of greater than a given threshold. An IOU of 0.5 is often used as the threshold for a correct detection, though as in Equation 5 of ImageNet (Russakovsky et al., 2015) we select a lower threshold for vehicles since we are dealing with very small objects. For SpaceNet building footprints and airports we use an IOU of 0.5.

Figure 7. Car detection performance on a meter aerial image over Salt Lake City () at 30 cm GSD with 1389 cars present. False positives are shown in red, false negatives are yellow, true positives are green, and blue rectangles denote ground truth for all true positive detections. F1 = 0.95 for this test image, and GPU processing time is second.
Figure 8. YOLT classifier applied to a SpaceNet DigitalGlobe 50 cm GSD image containing airplanes (blue), boats (red), and runways (orange). In this image we note the following F1 scores: airplanes = 0.83, boats = 0.84, airports = 1.0.

Table 3 displays object detection performance and speed over all test images for each object category. YOLT performs relatively well on airports, airplanes, and boats, despite small training set sizes. YOLT is not optimized for building footprint extraction, though performs somewhat competitively on the SpaceNet dataset; the top score on the recent SpaceNet challenge achieved an F1 score of 0.69222, while the YOLT score of 0.61 puts it in the top 3. We report inference speed in terms of GPU time to run the inference step. Inference runs rapidly on the GPU, at frames per second. Currently, pre-processing (i.e., splitting test images into smaller cutouts) and post-processing (i.e., stitching results back into one global image) is not fully optimized and is performed on the CPU, which adds a factor of to run time. The inference speed translates to a runtime of minutes to localize all vehicles in an area of the size of Washington DC, and seconds to localize airports over this area. DigitalGlobe’s WorldView3 satellite333 covers a maximum of 680,000 km per day, so at YOLT inference speed a 16 GPU cluster would provide real-time inference on satellite imagery.

Object Class F1 Score Run Time
Car 32
Airplane 32
Boat 32
Building 32
Airport 6000
  • IOU = 0.25

  • IOU = 0.5

Table 3. YOLT Performance and Speed

6.4. Detailed Performance Analysis

The large test set of cars in the nine Utah images of the COWC dataset enables detailed performance analyses. The majority of the cars () lie in the image over central Salt Lake City so we split this image into sixteen smaller meter regions to even out the number of cars per image. We remove one test scene that has only 61 cars in the scene, leaving 23 test scenes, with mean count per test image of . We apply a YOLT model trained to find cars on these test scenes.

In Figure 9 we display the F1 score for each scene, along with the car count accuracy. Total car count in a specified region may be a more valuable metric in the commercial realm than F1 score. Accordingly, we compute the number of predicted cars for each scene as a fraction of ground truth number (. Like the F1 score, a value of 1.0 denotes perfect prediction for the fractional car count metric. The COWC (Mundhenk et al., 2016) authors sought to count (rather than localize) the number of cars in test images, and achieved an error of . Total count error for YOLT on the COWC data is .

Figure 9. Top: F1 score per COWC test scene. (). Bottom: Number of detections as a fraction of ground truth number (

. Dot colors correspond to the test scene, with the multiple red dots indicating central Salt Lake City cutouts. The dotted orange line denotes the weighted mean, with the yellow band displaying the weighted standard deviation.

Inspection of Figure 9 reveals that the F1 score and ground truth fraction are quite high for typical urban scenes, (e.g. shown in Figure 7

). The worst outlier in Figure

9 is , with an F1 score of 0.67, and 2860 cars present. This location corresponds to an automobile junkyard, an understandably difficult region to analyze.

7. Resolution Performance Study

The uniformity of object sizes in the COWC (Mundhenk et al., 2016) dataset enables a detailed resolution study. To study the effects of resolution on object detection, we convolve the raw 15 cm imagery with a Gaussian kernel and reduce the image dimensions to create additional training and testing corpora at [0.30, 0.45, 0.60, 0.75, 0.90, 1.05, 1.20, 1.50, 1.80, 2.10, 2.40, 3.00] meters.

Figure 10. COWC (Mundhenk et al., 2016) training data convolved and resized to various resolutions from the original 15 cm resolution (top left); bounding box labels are plotted in blue.

Initially, we test the multi-resolution test data on a single model (trained at 0.30 meters), and in Figure 11 demonstrate that the ability of this model to extrapolate to multiple resolutions is poor. Subsequently, we train a separate model for each resolution, for thirteen models total. Creating a high quality labelled dataset at low resolution (2.4m GSD, for example) is only possible because we downsample from already labelled high resolution 15 cm data; typically low resolution data is very difficult to label with high accuracy.

Figure 11. Performance of the m model applied to various resolutions. The 23 thin lines display the performance of each individual test scene; most of these lines are tightly clustered about the mean, denoted by the solid red. The red band displays STD. The model peaks at F1 for the trained resolution of 0.3m, and rapidly degrades when evaluated with lower resolution data; it also degrades somewhat for higher resolution 0.15m data.
Figure 12. Object detection results on different resolutions on the same meter Salt Lake City cutout of COWC data. The cutout on the left is at 15 cm GSD, with an F1 score of 0.94, while the cutout on the right is at 90 cm GSD, with an F1 score of 0.84.
Figure 13. Object detection F1 score for ground sample distances of meters (bottom axis), corresponding to car size of pixel(s) (top axis). At each of the thirteen resolutions we evaluate test scenes with a unique model trained at that resolution. The 23 thin lines display the performance of the individual test scenes; most of these lines are tightly clustered about the mean, denoted by the blue dashed line. The red band displays STD. We fit a piecewise linear model to the data, shown as the dotted cyan line. Below the inflection point (large cyan dot) of 0.61 meters (corresponding to a car size of 5 pixels) the F1 score degrades slowly with a slope of ; between 0.60 m and 3.0 m GSD the slope is steeper at . The F1 scores at 0.15 m, 0.60 m, and 3.0 m GSD are 0.92, 0.87, and 0.27, respectively.
Figure 14. Fraction of predicted number of cars to ground truth, with a unique model for each resolution (bottom axis) and object pixel size (top axis). A fraction of 1.0 means that the correct number of cars was predicted, while if the fraction is below 1.0 too few cars were predicted. The thin bands denote the performance of the 23 individual scenes, with the dashed blue line showing the weighted mean and the red band displaying STD. We fit a piecewise linear model to the data, shown as the dotted cyan line. Below the inflection point (large cyan dot) of meters the slope is essentially flat with a slope of ; between m and m GSD the slope is steeper at . For resolutions sharper than meters the predicted number of cars is within of ground truth.

For objects meters in size we observe from Figure 13 that object detection performance degrades from for objects 20 pixels in size to for objects 1 pixel in size, with a mean error of 0.09. Interestingly, the F1 score only degrades by only as objects shrink from 20 to 5 pixels in size (0.15m to 0.60m GSD). At least for cars viewed from overhead, one can conclude that object sizes of pixels yield object detection scores of . The curves of Figure 11 degrade far faster than Figures 13 and 14, illustrating that a single model fit at high resolution is inferior to a series of models trained at each respective resolution.

8. Conclusions

Object detection algorithms have made great progress as of late in localizing objects in ImageNet style datasets. Such algorithms are rarely well suited to the object sizes or orientations present in satellite imagery, however, nor are they designed to handle images with hundreds of megapixels.

To address these limitations we implemented a fully convolutional neural network pipeline (YOLT) to rapidly localize vehicles, buildings, and airports in satellite imagery. We noted poor results from a combined classifier due to confusion between small and large features, such as highways and runways. Training dual classifiers at different scales (one for buildings/vehicles, and one for infrastructure), yielded far better results.

This pipeline yields an object detection F1 score of

, depending on category. While the F1 scores may not be at the level many readers are accustomed to from ImageNet competitions, object detection in satellite imagery is still a relatively nascent field and has unique challenges. In addition, our training dataset for most categories is relatively small for supervised learning methods, and the F1 scores could possibly be improved with further post-processing of detections.

We also demonstrated the ability to train on one sensor (e.g. DigitalGlobe), and apply our model to a different sensor (e.g. Planet). We show that at least for cars viewed from overhead, object sizes of pixels yield object detection scores of . The detection pipeline is able to evaluate satellite and aerial images of arbitrary input size at native resolution, and processes vehicles and buildings at a rate of per minute, and airports at a rate of per minute. At this inference speed, a 16 GPU cluster could provide real-time inference on the DigitalGlobe WorldView3 satellite feed.

We thank Karl Ni for very helpful comments.