You Only Look Twice: Rapid Multi-Scale Object Detection In Satellite Imagery
Detection of small objects in large swaths of imagery is one of the primary problems in satellite imagery analytics. While object detection in ground-based imagery has benefited from research into new deep learning approaches, transitioning such technology to overhead imagery is nontrivial. Among the challenges is the sheer number of pixels and geographic extent per image: a single DigitalGlobe satellite image encompasses >64 km2 and over 250 million pixels. Another challenge is that objects of interest are minuscule (often only 10 pixels in extent), which complicates traditional computer vision techniques. To address these issues, we propose a pipeline (You Only Look Twice, or YOLT) that evaluates satellite images of arbitrary size at a rate of >0.5 km2/s. The proposed approach can rapidly detect objects of vastly different scales with relatively little training data over multiple sensors. We evaluate large test images at native resolution, and yield scores of F1 > 0.8 for vehicle localization. We further explore resolution and object size requirements by systematically testing the pipeline at decreasing resolution, and conclude that objects only 5 pixels in size can still be localized with high confidence. Code is available at https://github.com/CosmiQ/yolt.READ FULL TEXT VIEW PDF
You Only Look Twice: Rapid Multi-Scale Object Detection In Satellite Imagery
You Only Look Twice: Rapid Multi-Scale Object Detection In Satellite Imagery
in the ImageNet(Russakovsky et al., 2015) competition. The availability of large, high-quality labelled datasets such as ImageNet (Russakovsky et al., 2015), PASCAL VOC (Everingham et al., 2010) and MS COCO (Lin et al., 2014) have helped spur a number of impressive advances in rapid object detection that run in near real-time; three of the best are: Faster R-CNN (Ren et al., 2015), SSD (Liu et al., 2015), and YOLO (Redmon et al., 2015) (Redmon and Farhadi, 2016). Faster R-CNN typically ingests pixel images, whereas SSD uses or pixel input images, and YOLO runs on either or pixel inputs. While the performance of all these frameworks is impressive, none can come remotely close to ingesting the input sizes typical of satellite imagery. Of these three frameworks, YOLO has demonstrated the greatest inference speed and highest score on the PASCAL VOC dataset. The authors also showed that this framework is highly transferrable to new domains by demonstrating superior performance to other frameworks (i.e., SSD and Faster R-CNN) on the Picasso Dataset (Ginosar et al., 2014) and the People-Art Dataset (Cai et al., 2015). Due to the speed, accuracy, and flexibility of YOLO, we accordingly leverage this system as the inspiration for our satellite imagery object detection framework.
The application of deep learning methods to traditional object detection pipelines is non-trivial for a variety of reasons. The unique aspects of satellite imagery necessitate algorithmic contributions to address challenges related to the spatial extent of foreground target objects, complete rotation invariance, and a large scale search space. Excluding implementation details, algorithms must adjust for:
In satellite imagery objects of interest are often very small and densely clustered, rather than the large and prominent subjects typical in ImageNet data. In the satellite domain, resolution is typically defined as the ground sample distance (GSD), which describes the physical size of one image pixel. Commercially available imagery varies from 30 cm GSD for the sharpest DigitalGlobe imagery, to meter GSD for Planet imagery. This means that for small objects such as cars each object will be only pixels in extent even at the highest resolution.
Objects viewed from overhead can have any orientation (e.g. ships can have any heading between 0 and 360 degrees, whereas trees in ImageNet data are reliably vertical).
There is a relative dearth of training data (though efforts such as SpaceNet111https://aws.amazon.com/public-datasets/spacenet/ are attempting to ameliorate this issue)
Input images are enormous (often hundreds of megapixels), so simply downsampling to the input size required by most algorithms (a few hundred pixels) is not an option (see Figure 1).
The contribution in this work specifically addresses each of these issues separately, while leveraging the relatively constant distance from sensor to object, which is well known and is typically km. This coupled with the nadir facing sensor results in consistent pixel size of objects.
Section 2 details in further depth the challenges faced by standard algorithms when applied to satellite imagery. The remainder of this work is broken up to describe the proposed contributions as follows. To address small, dense clusters, Section 3.1 describes a new, finer-grained network architecture. Sections 3.2 and 3.3 detail our method for splitting, evaluating, and recombining large test images of arbitrary size at native resolution. With regard to rotation invariance and small labelled training dataset sizes, Section 4 describes data augmentation and size requirements. Finally, the performance of the algorithm is discussed in detail in Section 6.
Deep learning approaches have proven effective for ground-based object detection, though current techniques are often still suboptimal for overhead imagery applications. For example, small objects in groups, such as flocks of birds present a challenge (Redmon et al., 2015), caused in part by the multiple downsampling layers of all three convolutional network approaches listed above (YOLO, SDD, Faster-RCNN). Further, these multiple downsampling layers result in relatively course features for object differentiation; this poses a problem if objects of interest are only a few pixels in extent. For example, consider the default YOLO network architecture, which downsamples by a factor of 32 and returns a prediction grid; this means that object differentiation is problematic if object centroids are separated by less than 32 pixels. Accordingly we implement a unique network architecture with a denser final prediction grid. This improves performance by yielding finer grained features to help differentiate between classes. This finer prediction grid also permits classification of smaller objects and denser clusters.
Another reason object detection algorithms struggle with satellite imagery is that they have difficulty generalizing objects in new or unusual aspect ratios or configurations (Redmon et al., 2015)
. Since objects can have arbitrary heading, this limited range of invariance to rotation is troublesome. Our approach remedies this complication with rotations and augmentation of data. Specifically, we rotate training images about the unit circle to ensure that the classifier is agnostic to object heading, and also randomly scale the images in HSV (hue-saturation-value) to increase the robustness of the classifier to varying sensors, atmospheric conditions, and lighting conditions.
In advanced object detection techniques the network sees the entire image at train and test time. While this greatly improves background differentiation since the network encodes contextual (background) information for each object, the memory footprint on typical hardware (NVIDIA Titan X GPUs with 12GB RAM) is infeasible for a 256 megapixel image.
We also note that the large sizes satellite images preclude simple approaches to some of the problems noted above. For example, upsampling the image to ensure that objects of interest are large and dispersed enough for standard architectures is infeasible, since this approach would also increase runtime many-fold. Similarly, running a sliding window classifier across the image to search for objects of interest quickly becomes computationally intractable, since multiple window sizes will be required for each object size. For perspective, one must evaluate over one million sliding window cutouts if the target is a 10 meter boat in a DigitalGlobe image. Our response is to leverage rapid object detection algorithms to evaluate satellite imagery with a combination of local image interpolation on reasonably sized image chips (meters) and a multi-scale ensemble of detectors.
To demonstrate the challenges of satellite imagery analysis, we train a YOLO model with the standard network architecture ( grid) to recognize cars in pixel cutouts of the COWC overhead imagery dataset (Mundhenk et al., 2016) (see Section 4 for further details on this dataset). Naively evaluating a large test image (see Figure 2) with this network yields a false positive rate, due to the downsampling of the test image. Even appropriately sized image chips are problematic (again, see Figure 2), as the standard YOLO network architecture cannot differentiate objects with centroids separated by less than 32 pixels. Therefore even if one restricts attention to a small cutout, performance is often poor in high density regions with the standard architecture.
In order to address the limitations discussed in Section 2, we implement an object detection framework optimized for overhead imagery: You Only Look Twice (YOLT). We extend the Darknet neural network framework (Redmon, 2017) and update a number of the C libraries to enable analysis of geospatial imagery and integrate with external python libraries. We opt to leverage the flexibility and large user community of python for pre- and post-processing. Between the updates to the C code and the pre and post-processing code written in python, interested parties need not have any knowledge of C to train, test, or deploy YOLT models.
To reduce model coarseness and accurately detect dense objects (such as cars or buildings), we implement a network architecture that uses 22 layers and downsamples by a factor of 16 Thus, a pixel input image yields a prediction grid. Our architecture is inspired by the 30-layer YOLO network, though this new architecture is optimized for small, densely packed objects. The dense grid is unnecessary for diffuse objects such as airports, but crucial for high density scenes such as parking lots (see Figure 2). To improve the fidelity of small objects, we also include a passthrough layer (described in (Redmon and Farhadi, 2016), and similar to identity mappings in ResNet (He et al., 2015)) that concatenates the final layer onto the last convolutional layer, allowing the detector access to finer grained features of this expanded feature map.
Each convolutional layer save the last is batch normalized with a leaky rectified linear activation, save the final layer that utilizes a linear activation. The final layer provides predictions of bounding boxes and classes, and has size:, where is the number of boxes per grid (5 by default), and is the number of object classes (Redmon et al., 2015).
|0||Convolutional||32||33 / 1||41641632|
|1||Maxpool||22 / 2||20820832|
|2||Convolutional||64||33 / 1||208208 64|
|3||Maxpool||22 / 2||104104 64|
|4||Convolutional||128||33 / 1||104104128|
|5||Convolutional||64||11 / 1||10410464|
|6||Convolutional||128||33 / 1||104104128|
|7||Maxpool||22 / 2||525264|
|8||Convolutional||256||33 / 1||52 52256|
|9||Convolutional||128||11 / 1||52 52128|
|10||Convolutional||256||33 / 1||52 52256|
|11||Maxpool||22 / 2||26 26256|
|12||Convolutional||512||33 / 1||26 26512|
|13||Convolutional||256||11 / 1||26 26256|
|14||Convolutional||512||33 / 1||26 26512|
|15||Convolutional||256||11 / 1||26 26256|
|16||Convolutional||512||33 / 1||26 26512|
|17||Convolutional||1024||33 / 1||26 261024|
|18||Convolutional||1024||33 / 1||26 261024|
|19||Passthrough||10 20||26 261024|
|20||Convolutional||1024||33 / 1||26261024|
|21||Convolutional||11 / 1||2626|
At test time, we partition testing images of arbitrary size into manageable cutouts and run each cutout through our trained model. Partitioning takes place via a sliding window with user defined bin sizes and overlap ( by default), see Figure 4. We record the position of each sliding window cutout by naming each cutout according to the schema:
ImageName|row column height width.ext
panama50cm|1370 1180 416 416.tif
Much of the utility of satellite (or aerial) imagery lies in its inherent ability to map large areas of the globe. Thus, small image chips are far less useful than the large field of view images produced by satellite platforms. The final step in the object detection pipeline therefore seeks to stitch together the hundreds or thousands of testing chips into one final image strip.
For each cutout the bounding box position predictions returned from the classifier are adjusted according to the row and column values of that cutout; this provides the global position of each bounding box prediction in the original input image. The overlap ensures all regions will be analyzed, but also results in overlapping detections on the cutout boundaries. We apply non-maximal suppression to the global matrix of bounding box predictions to alleviate such overlapping detections.
Training data is collected from small chips of large images from three sources: DigitalGlobe satellites, Planet satellites, and aerial platforms. Labels are comprised of a bounding box and category identifier for each object. We initially focus on five categories: airplanes, boats, building footprints, cars, and airports. For objects of very different scales (e.g. airplanes vs airports) we show in Section 6.2 that using two different detectors at different scales is very effective.
The Cars Overhead with Context (COWC) (Mundhenk et al., 2016) dataset is a large, high quality set of annotated cars from overhead imagery collected over multiple locales. Data is collected via aerial platforms, but at a nadir view angle such that it resembles satellite imagery. The imagery has a resolution of 15 cm GSD that is approximately double the current best resolution of commercial satellite imagery (30 cm GSD for DigitalGlobe). Accordingly, we convolve the raw imagery with a Gaussian kernel and reduce the image dimensions by half to create the equivalent of 30 cm GSD images. Labels consist of simply a dot at the centroid of each car, and we draw a 3 meter bounding box around each car for training purposes. We reserve the largest geographic region (Utah) for testing, leaving 13,303 labelled training cars.
The second round of SpaceNet data consists of 30 cm GSD DigitalGlobe imagery and labelled building footprints over four cities: Las Vegas, Paris, Shanghai, and Khartoum. The labels are precise building footprints, which we transform into bounding boxes encompassing of the extent of the footprint. Image segmentation approaches show great promise for this challenge; nevertheless, we explore YOLT performance on building outline detection, acknowledging that since YOLT outputs bounding boxes it will never achieve perfect building footprint detection for complex building shapes. Between the four cities there are 221,336 labelled buildings.
We label eight DigitalGlobe images over airports for a total of 230 objects in the training set.
We label three DigitalGlobe images taken over coastal regions for a total of 556 boats.
We label airports in 37 Planet images for training purposes, each with a single airport per chip. For objects the size of airports, some downsampling is required, as runways can exceed 1000 pixels in length even in low resolution Planet imagery; we therefore downsample Planet imagery by a factor of four for training purposes.
The raw training datasets for airplanes, airports, and watercraft are quite small by computer vision standards, and a larger dataset may improve the inference performance detailed in Section 6.
To ensure evaluation robustness, all test images are taken from different geographic regions than training examples. For cars, we reserve the largest geographic region of Utah for testing, yielding 19,807 test cars. Building footprints are split 75/25 train/test, leaving 73,778 test footprints. We label four airport test images for a total of 74 airplanes. Four boat images are labelled, yielding 771 test boats. Our dataset for airports is smaller, with ten Planet images used for testing. See Table 2 for the train/test split for each category.
|Object Class||Training Examples||Test Examples|
Initially, we attempt to train a single classifier to recognize all five categories listed above, both vehicles and infrastructure. We note a number of spurious airport detections in this example (see Figure 6), as down sampled runways look similar to highways at the wrong scale.
There are multiple ways one could address the false positive issues noted in Figure 6. Recall from Section 4 that for this exploratory work our training set consists of only a few dozen airports, far smaller than usual for deep learning models. Increasing this training set size might improve our model, particularly if the background is highly varied. Another option would be to use post-processing to remove any detections at the incorrect scale (e.g. an airport with a size of meters). Another option is to simply build dual classifiers, one for each relevant scale.
We opt to utilize the scale information present in satellite imagery and run two different classifiers: one trained for vehicles + buildings, and the other trained only to look for airports. Running the second airport classifier on down sampled images has a minimal impact on runtime performance, since in a given image there are approximately 100 times more 200 meter chips than 2000 meter chips.
For large validation images, we run the classifier at two different scales: 200m, and 2500m. The first scale is designed for vehicles and buildings, and the larger scale is optimized for large infrastructure such as airports. We break the validation image into appropriately sized image chips and run each image chip on the appropriate classifier. The myriad results from the many image chips and multiple classifiers are combined into one final image, and overlapping detections are merged via non-maximal suppression. We find a detection probability threshold of between 0.3 and 0.4 yields the highest F1 score for our validation images.
We define a true positive as having an intersection over union (IOU) of greater than a given threshold. An IOU of 0.5 is often used as the threshold for a correct detection, though as in Equation 5 of ImageNet (Russakovsky et al., 2015) we select a lower threshold for vehicles since we are dealing with very small objects. For SpaceNet building footprints and airports we use an IOU of 0.5.
Table 3 displays object detection performance and speed over all test images for each object category. YOLT performs relatively well on airports, airplanes, and boats, despite small training set sizes. YOLT is not optimized for building footprint extraction, though performs somewhat competitively on the SpaceNet dataset; the top score on the recent SpaceNet challenge achieved an F1 score of 0.69222https://spacenetchallenge.github.io/Competitions/Competition2.html, while the YOLT score of 0.61 puts it in the top 3. We report inference speed in terms of GPU time to run the inference step. Inference runs rapidly on the GPU, at frames per second. Currently, pre-processing (i.e., splitting test images into smaller cutouts) and post-processing (i.e., stitching results back into one global image) is not fully optimized and is performed on the CPU, which adds a factor of to run time. The inference speed translates to a runtime of minutes to localize all vehicles in an area of the size of Washington DC, and seconds to localize airports over this area. DigitalGlobe’s WorldView3 satellite333http://worldview3.digitalglobe.com covers a maximum of 680,000 km per day, so at YOLT inference speed a 16 GPU cluster would provide real-time inference on satellite imagery.
|Object Class||F1 Score||Run Time|
IOU = 0.25
IOU = 0.5
The large test set of cars in the nine Utah images of the COWC dataset enables detailed performance analyses. The majority of the cars () lie in the image over central Salt Lake City so we split this image into sixteen smaller meter regions to even out the number of cars per image. We remove one test scene that has only 61 cars in the scene, leaving 23 test scenes, with mean count per test image of . We apply a YOLT model trained to find cars on these test scenes.
In Figure 9 we display the F1 score for each scene, along with the car count accuracy. Total car count in a specified region may be a more valuable metric in the commercial realm than F1 score. Accordingly, we compute the number of predicted cars for each scene as a fraction of ground truth number (. Like the F1 score, a value of 1.0 denotes perfect prediction for the fractional car count metric. The COWC (Mundhenk et al., 2016) authors sought to count (rather than localize) the number of cars in test images, and achieved an error of . Total count error for YOLT on the COWC data is .
). The worst outlier in Figure9 is , with an F1 score of 0.67, and 2860 cars present. This location corresponds to an automobile junkyard, an understandably difficult region to analyze.
The uniformity of object sizes in the COWC (Mundhenk et al., 2016) dataset enables a detailed resolution study. To study the effects of resolution on object detection, we convolve the raw 15 cm imagery with a Gaussian kernel and reduce the image dimensions to create additional training and testing corpora at [0.30, 0.45, 0.60, 0.75, 0.90, 1.05, 1.20, 1.50, 1.80, 2.10, 2.40, 3.00] meters.
Initially, we test the multi-resolution test data on a single model (trained at 0.30 meters), and in Figure 11 demonstrate that the ability of this model to extrapolate to multiple resolutions is poor. Subsequently, we train a separate model for each resolution, for thirteen models total. Creating a high quality labelled dataset at low resolution (2.4m GSD, for example) is only possible because we downsample from already labelled high resolution 15 cm data; typically low resolution data is very difficult to label with high accuracy.
For objects meters in size we observe from Figure 13 that object detection performance degrades from for objects 20 pixels in size to for objects 1 pixel in size, with a mean error of 0.09. Interestingly, the F1 score only degrades by only as objects shrink from 20 to 5 pixels in size (0.15m to 0.60m GSD). At least for cars viewed from overhead, one can conclude that object sizes of pixels yield object detection scores of . The curves of Figure 11 degrade far faster than Figures 13 and 14, illustrating that a single model fit at high resolution is inferior to a series of models trained at each respective resolution.
Object detection algorithms have made great progress as of late in localizing objects in ImageNet style datasets. Such algorithms are rarely well suited to the object sizes or orientations present in satellite imagery, however, nor are they designed to handle images with hundreds of megapixels.
To address these limitations we implemented a fully convolutional neural network pipeline (YOLT) to rapidly localize vehicles, buildings, and airports in satellite imagery. We noted poor results from a combined classifier due to confusion between small and large features, such as highways and runways. Training dual classifiers at different scales (one for buildings/vehicles, and one for infrastructure), yielded far better results.
This pipeline yields an object detection F1 score of
, depending on category. While the F1 scores may not be at the level many readers are accustomed to from ImageNet competitions, object detection in satellite imagery is still a relatively nascent field and has unique challenges. In addition, our training dataset for most categories is relatively small for supervised learning methods, and the F1 scores could possibly be improved with further post-processing of detections.
We also demonstrated the ability to train on one sensor (e.g. DigitalGlobe), and apply our model to a different sensor (e.g. Planet). We show that at least for cars viewed from overhead, object sizes of pixels yield object detection scores of .
The detection pipeline is able to evaluate satellite and aerial images of arbitrary input size at native resolution, and processes vehicles and buildings at a rate of per minute, and airports at a rate of per minute. At this inference speed, a 16 GPU cluster could provide real-time inference on the DigitalGlobe WorldView3 satellite feed.