Visual search, the ability to find images that are visually similar to a query image, is a form of pattern recognition applied to the problem of image retrieval. An image retrieval system allows the user to find images that have some logical connection to a set of query parameters such as keywords, captions, or the in the case of content-based image retrieval (CBIR), the visual content of the image itself(gudivada1995content; rui1999image; smeulders2000content; liu2007survey; datta2008image). CBIR systems use computer vision techniques (e.g. the SIFT descriptor, lowe2004distinctive) to quantify texture, color, and shapes present in an image and thereby define image similarity. Convolutional neural networks (lecun1995convolutional) trained on large image datasets (deng2009imagenet) provide another means to extract rich feature representations of photographic imagery, and can be used in CBIR systems for visual search (sharif2014cnn; babenko2014neural; yue2015exploiting; tolias2015particular; zheng2018sift).
Visual search is an increasingly common offering from large software companies. Some of these systems are focused on consumer products, e.g. Pinterest Lens (jing2015visual; zhai2017visual), while other search systems from Google, Microsoft, Baidu, or Yandex operate on photographic imagery more generally.
Aerial and satellite imagery provides another large dataset on which to build a visual search system, although there is much less prior work relative to visual search over photographic imagery. A significant milestone came in May 2016, when a team from Carnegie Mellon University released Terrapattern111http://www.terrapattern.com, an interface for visual search over seven metropolitan areas. The work presented in this paper was largely inspired by Terrapattern. We sought to scale a visual search system beyond metropolitan areas to an entire country or the whole world.
Going beyond searching a single imagery layer, it is interesting to consider what visual search over all publicly available aerial and satellite imagery would look like. As organizations like Descartes Labs222https://descarteslabs.com, Google Earth Engine333https://earthengine.google.com, and Sentinel Hub444https://www.sentinel-hub.com/ continue to build systems to ingest this flood of remote sensing data, what information do they store and in what format?
To begin with, there is the image data itself, which is currently at the petabyte scale and will cross into tens of petabytes in the coming decade. Beyond that, metadata such as the geographical extent of the image, the time at which the image was acquired, and the sensor used to acquire the image are clearly valuable; one can construct databases to efficiently index and search over these metadata. One can also store metadata related to the image data itself: basic summary statistics or the fraction of pixels covered in clouds. What is lacking is metadata encoding a deeper visual or semantic understanding of the image content. Ideally, this “visual metadata” would be significantly smaller than the original image data, say at least 100X smaller, and, like the spatial and temporal metadata described above, would encode some meaningful notion of “distance” between images. The feature vectors presented in this paper provide a step in that direction.
How will such visual metadata be used for real applications? Possibilities include:
interactive exploration of the earth’s surface: a user wants to quickly search for visually similar objects, and that is the end goal.
ground truth: a user could efficiently gather ground truth to be used to train a downstream computer vision model.
pre-processing filter: a user needs to run a computer vision model over a large region but, in order to reduce computational expense, would like to limit the processing to only those images that might contain the object of interest.
This paper is organized as follows. In Section 2 we describe the two image datasets used in this work. In Section 3 we describe how we modified a convolutional neural network to extract useful binary features from this imagery, and in Section 4 we detail the two search methods used to search over these features. Finally, we show example search results in Section 5 and conclude with a summary of future research directions in Section 6.
We used two imagery sources in the work presented here:
Aerial over USA - We use aerial imagery from the National Agriculture Imagery Program555https://lta.cr.usgs.gov/NAIP (NAIP) and the Texas Orthoimagery Program666https://tnris.org/2015-statewide-orthoimagery-project/. These datasets provide coverage of the lower 48 states of the United States.
Landsat 8 over Earth - We created a custom, pan-sharpened, 15-meter global composite (warren2016data) using data from Landsat 8, one of the workhorses of NASA’s satellite-based earth observation program.
We built a visual search system for each of these datasets, but the systems differ in how the features were generated, as described in Section 3.
We maintain the projection of both datasets in the 60 Universal Transverse Mercator (UTM) projections. These conformal projections preserve angles and have weak area distortion within a UTM zone. Both of these properties are desirable for visual search, as we would only be making the search problem more difficult if we used a projection that introduced strong distortion of angles or areas.
We chunked this imagery into square image tiles, 128 pixels on a side. Adjacent tiles overlap by 64 pixels in the vertical and horizontal directions. While this overlapping tiling scheme requires two times more data than a non-overlapping scheme, it gives any object the chance to be roughly centered in at least one image tile.
We used the Descartes Labs Platform777http://docs.descarteslabs.com to access and process this imagery. This platform allows the user to search for the existence of imagery, quickly load that imagery into memory in the form of a python ndarray, and parallelize computations over tens of thousands of CPUs.
3 Feature Vectors
We start with a convolutional neural network with a 50-layer ResNet architecture (he2016deep)
, pre-trained on ImageNet888http://www.image-net.org
. This network is conveniently provided pre-trained in the Keras999https://keras.iodeep learning package, which we used with a Tensorflow101010https://www.tensorflow.org/ back-end.
In our initial experiments we used the “out of the box” features generated in the last few layers of the ImageNet-trained network. These layers give surprisingly good similarity search results for satellite imagery, despite being trained on photographic imagery of animals, plants, vehicles, etc. But in the end we modified the network in two ways, as described below.
An overview of the feature-generation process is shown in Figure 1.
3.1 Binary Features
We decided that we ultimately wanted to search over binary features, due to their smaller data footprint. To that end, we encouraged the network to make floating-point features very close to 0.0 or 1.0 at the layer of interest by injecting noise with an amplitude comparable to the width of the layer’s activation function(salakhutdinov2009semantic)
. The noise is injected during training but not during inference. The network learns to make almost-binary features at this layer; otherwise the noise destroys the information that the layer is trying to pass on. Finally, we binarize the floating-point features by thresholding at 0.5.
3.2 Customizing for Aerial and Satellite Imagery
We customized this network to work with each source of aerial and satellite imagery. For NAIP, we followed the strategy used by Terrapattern and fine-tuned the network to classify images into 130 object classes from OpenStreetMap (OSM), such as parking lots or golf courses(zheng2016good). Prior to fine-tuning, we added two fully-connected layers to the end of the network, and extracted 512 binary features from one of them (vo2019generalization)
. We emphasize that this supervised learning step is not used to produce a network that is good at identifying particular object classes (although it does ultimately do better on these classes), but rather to create a network that produces features that are useful for generic visual search on aerial and satellite imagery.
For Landsat 8, the OSM classes were less useful due to the much coarser spatial resolution, so we took an unsupervised approach instead. Specifically we used an autoencoder to compress the 2048 features from the penultimate layer of ResNet50 into 512 binary features using a 2048-1024-512-1024-2048 autoencoder architecture. The autoencoder was trained by passing Landsat 8 imagery through the original ResNet50 network.
We also note that we do not directly optimize on the binary features used for look-up, unlike many feature hashing approaches (norouzi2012hamming). Instead, we rely on the fact that images of similar classes do map to similar outputs of late layers in the neural net.
At the end of this process, we have mapped the original 128x128x3 8-bit image (or 16-bit image in the case of Landsat 8) to 512 bits. This corresponds to a 768X (1536X) reduction in data size. These binary features form a compact representation of the visual information present in each image.
We pre-compute the feature vectors for all of the tiles in each dataset: approximately 2 billion tiles for NAIP and 200 million tiles for Landsat 8. We distributed this computation across tens of thousands of CPUs in the Google Cloud Platform.
We store the binary features vectors in Redis, an in-memory data store. The data is stored in key:value pairs. The keys are strings that uniquely identify each image tile, and the values are the corresponding 512-bit feature vectors, each stored as 64 bytes.
Now that we have feature vectors, how do we search for similar vectors?
We first define a distance between vectors. We use the Hamming distance: the number of bits that differ between the two binary vectors. A small distance is associated with visual similarity.
Next we need to find the -nearest vectors to a query vector. We use two methods: an exact, brute-force search, and an approximate, hash-based search, as described below.
4.1 Direct Search
By taking advantage of low-level instructions for comparing bits (XOR) and counting bits (__builtin_popcountll), we can perform a direct, exact, brute-force search over the 200 million Landsat 8 images. Search for similar vectors is a nearest-neighbors search in the 512-bit space. This works as follows.
A query image tile is selected.
We retrieve the binary feature vector for this image tile using Redis.
We compute the Hamming distance between the query vector and all vectors using a pre-computed binary file that contains all binary feature vectors. This sub-program is written in C, using the GCC compiler with -Ofast optimization.
We keep the -nearest feature vectors and determine their corresponding images tiles using a second index file.
This direct, brute-force method searches across the 200 million Landsat 8 images in approximately 2 seconds using a single thread. While an individual query is single-threaded, we run a server (16 virtual CPUs, Haswell architecture, 32 GB of RAM) which allows for up to 16 concurrent queries that share the feature vector data in memory.
4.2 Hash-based Search
The NAIP dataset is approximately ten times larger than the Landsat 8 dataset, with approximately 2 billion images. While the direct search method works, it is too slow for interactive use. Instead we use an approximate, hash-based search method. The general idea is to use hash functions to quickly return a list of candidate neighbors, and then run a brute-force search on that relatively small list of candidates.
More specifically, we use bit sampling, a simple form of locality-sensitive hashing. We use a family of 32 hash functions, each of which simply selects a subset of 16 bits from the full, 512-bit feature vector. Within each hash table, the keys are specific realizations of those 16 bits, e.g. 0101001111101110, and the values are lists of strings (image tile ids) corresponding to tiles that contain those specific realizations of those 16 bits. We store the hash tables in Bigtable, a scalable NoSQL database service from Google Cloud. An overview of the hash-based search system is shown in Figure 2 and works as follows.
A query image tile is selected.
We retrieve the binary feature vector for this image tile using Redis.
We apply each of the 32 hash functions to this feature vector, each of which returns a set of candidate neighbors.
We compute the unique set of candidate neighbors and fetch their corresponding feature vectors using Bigtable.
We perform a brute-force distance calculation between the query vector and the candidate vectors.
We keep the -nearest feature vectors and their corresponding images tile ids.
The search takes approximately 0.1 seconds per query over approximately 2 billion images.
5.1 Search Results
We expect to have relatively good search results for many of the queries shown here (e.g. wind turbines) because the query object was one of the 130 OSM classes used to fine-tune the NAIP network. However, the system also provides good search results over generic images that were not part of the supervised learning phase, such as the sinuous waterways or the trees changing color.
Search results from Landsat 8 are shown in Figure 5. The Landsat 8 features were generated without any supervised learning or reference to OSM classes, so the search results demonstrate generic visual search.
We show low-quality search results in Figure 6. While these results do contain images that are visually similar to the query image, they are failures in the sense that they do not contain the query object class. This highlights the fact that there is a balance to strike between providing good search results for common object classes and providing good search results for generic, class-agnostic search.
Finally, we have made a quantitative assessment of search quality for 20 query types. For each query we have manually recorded the precision (fraction of true positives) out of the top-30 search results. We have not attempted to assess recall, given the lack of unified, comprehensive ground truth for these object types.
Our top-30 precision results are as follows. For the 10 object types that were present during the network fine-tuning stage, the average top-30 precision is 86%. The individual results for this category are: airplanes (36%), baseball diamonds (100%), football fields (50%), golf courses (100%), highway overpasses (70%), marinas (100%), parking lots (100%), railyards (100%), storage tanks (100%), and taxiways (100%).
For the 10 object types that were not present during the fine-tuning stage, the average top-30 precision is 58%. The individual results for this category are: barges (10%), CAFOs (97%), center pivot irrigation (97%), cul-de-sacs (33%), dry docks (3%), feed lots (50%), roller coasters (3%), row housing (90%), suburban homes (100%), and shipping containers (93%). As one would expect, the search results are generally of higher quality for object types that were present during the network fine-tuning stage (86% vs 58%), although the precision of the new object types (58%) is still high enough to be useful for some applications, e.g. efficiently gathering ground truth to be used to train a downstream computer vision model.
5.2 User Interface
We built a browser-based user interface on top of the search back-end described above. This interface was released publicly in March 2017 at https://search.descarteslabs.com. We encourage the reader to try it out, but the basic structure is: a user clicks anywhere on the map, the top 1000 search results are displayed in thumbnail image form, and their geographical locations are shown on a map.
A screenshot of the user interface is shown in Figure 7. This particular search query highlights the benefit of showing the geographical distribution of the search results. The user clicked on a single example of piñon-juniper woodlands, a landscape common to the southwestern United States, and the full geographical extent of piñon-juniper woodlands is visible immediately.
6 Conclusion and Future Directions
We have presented a system for visual search over billions of aerial and satellite images. Binary features encoding visual information are extracted with a convolutional neural network that has been trained on aerial and satellite imagery. Using a hash-based search method we are able to search over approximately 2 billion images in 0.1 seconds. An interactive demo is available at https://search.descarteslabs.com
We see three clear future directions for extending the work presented here.
Multi-scale search - The current system searches at one spatial scale, namely square images that are 128 pixels across. Instead we would like to enable search over a handful of spatial scales, both smaller and larger than 128 pixels across.
Geospatial filtering - The current system searches across all images that have been indexed, regardless of their geographical location. In the future we would like to enable geospatial filtering, e.g. only return results from Japan.
Temporal filtering - The current system searches across a single image layer. Although the images within this layer were acquired at different times, that temporal information has not been used in the search. In the future we would like to enable temporal filtering, e.g. only return results that were acquired in May 2018.
With these changes, it should be possible, for example, to process and search over all Landsat 8 scenes, rather than searching over a single Landsat 8 composite image. And while the work presented here used RGB imagery, the underlying technique should work with non-visual bands (infrared, SWIR, SAR, etc.) or a different number of bands. Such a system should provide a more flexible way to compactly encode visual information for large collections of aerial and satellite imagery.