Geospatial data allows us to gain massive amounts of information based on location. When samples—expressed as points, polygons, or rasters with real-world coordinates—are coupled with large-scale datasets such as OpenStreetMap (OSM) osm2017, we can gain an information-rich dataset to derive insights from. To illustrate, given your current position, it is possible to obtain, say, the number of malls within 1.5-km, the distance to the nearest supermarket, or the frequency of traffic jams—all of which can be used later on for downstream machine learning tasks.
However, engineering features for geospatial data is a challenging task, requiring significant amount of compute and storage nargesian2017learning; nargesian2018dataset; storcheus2015survey. Important considerations include (1) the storage capacity to house geospatial data sources, (2) the compute complexity to query from that source, and the (3) ease of extracting information from these sources klien2005requirements.
In this paper, we introduce Geomancer, an open-source framework to perform geospatial feature engineering at scale. It leverages a data warehouse, geospatial datasets, and a Python library to pull out information from spatial datasets. In addition, Geomancer provides a solution for versioning and sharing feature transforms for other users. It is open-source and licensed under MIT. Geomancer has been used for production machine learning use-cases such as area valuation, poverty mapping, and real-estate price estimation.
The fundamental unit in Geomancer is a logical feature smith2017ballet called a Spell
. It maps a coordinate into a vector of feature values,, where is the set of feasible coordinates (latitude and longitude in EPSG:4326 WGS84EPS46:online) and is the dimensionality of the th feature vector. A collection of spells, i.e., a SpellBook, is then defined as a set of feature functions .
Figure 1 shows the information flow in the Geomancer framework. Given a reference data source , Geomancer allows users to define feature transforms , and apply these functions to a dataset containing spatial coordinates . The result is a feature matrix smith2017ballet that can be used for downstream machine learning tasks:
There are three main components in the Geomancer framework: a Python library client, a data warehouse server, and a reference data source (Figure 2).
Python library client The geomancer library 222https://pypi.org/projects/geomancer serves as the framework’s user-interface. Users can define feature functions (Spells or SpellBooks), export/read SpellBooks, and apply transforms to any given spatial dataset. Creating new features is done via the factory design pattern whereas the SpellBook mechanism is accomplished using the builder pattern gamma1995design.
Data warehouse server The data warehouse provides the storage and compute capacity in the framework. The library client can connect to multiple databases at the same time, and can handle both online transactional (OLTP) or analytical (OLAP) processing workloads. The extracted features can be stored inside the warehouse or exported as a dataframe for immediate consumption.
Reference data source A reference data source is loaded inside the warehouse as basis for feature engineering. For example, if we want to obtain the number of malls within a 1.5-km radius, we should have some knowledge of all mall locations within the area in question. Fortunately, open datasets such as OpenStreetMap (OSM) osm2017 exists to give such information. Usually, we create Extract-Transform-Load (ETL) pipelines to deliver timely, rich, and accurate data from external sources.
In practice, Geomancer enables researchers to (1) define geospatial features for extraction, (2) connect to various data warehouses, and (3) replicate and version features. The following sections will demonstrate how this can be done in the framework.
Feature functions for geospatial feature engineering
A Spell provides a declarative interface to define logical features smith2017ballet. They can be casted to a set of coordinates after instantiation. For example, if we wish to get the distance to the nearest embassy given a sample of coordinates, we write the following:
Connect to various data warehouses
Geomancer can establish a connection to any warehouse by providing a valid database URL. In practice, this feature has been helpful when engineering features across tables from different locations (e.g., OSM dataset is stored in BigQuery, traffic dataset in PostGIS, etc.). So far, Geomancer supports the following database backends:
BigQuery, an analytics data warehouse from the Google Cloud Platform google2012bigquery; melnik2010dremel.
PostGIS, a geospatial extension for PostgreSQL stonebraker1987postgres; stonebraker1986design.
SpatiaLite, a geospatial extension for SQLite bhosale2015sqlite; spatialite.
Save and share feature functions
Features can be grouped together to form a SpellBook, allowing us to cast multiple Spells at once. In addition, SpellBooks can be exported into a JSON file with various metadata (e.g., author, description, etc.) regarding the feature collection:
Once a SpellBook is exported to a file, it can be version-controlled, shared, and reused to other datasets. In the demonstration below, the Spells in the exported SpellBook, my_features.json, will be casted on a new set of points:
4 Case study: property value estimation in Singapore
We used Geomancer to predict residential prices per square foot in Singapore. The raw data was acquired from the Urban Redevelopment Authority’s open listing of apartment and condominium sales in the last four years ura2019property. We used this information to compile a dataset containing the locations and unit price per square foot for over transactions.
Thus, we are given a raw dataset , where is the property’s spatial coordinate in EPSG:4326, and is the unit price per square foot. We then used Geomancer, coupled with OSM data, to define logical features such as the number of restaurants within 3-km, distance to the nearest bus stop, or distance to the nearest nightclub. This resulted to a feature matrix that will be used for model training.
For this dataset, we were able to extract geospatial features using points-of-interests in OpenStreetMap (OSM) and logical features from Geomancer. After performing a
holdout split, we fed these features to a random forest regressor modelbreiman2001random; geurts2006extremely to predict the unit price per square foot. Running the trained model on aggregated areas of Singapore produced a heatmap as shown in Figure 3. It is apparent that the value of an area increases when it’s nearer to common points-of-interests. By zooming-in to two selected properties in Figure 4, we can see how the features obtained from Geomancer directly influenced a property’s value.
We also compared model predictions to actual selling prices of each property. Using Geomancer-based features, the model was able to perform really well on medium prices, but underestimated properties that are extremely cheap or expensive. Still, we have reliable predictions with error not exceeding SGD per square foot as shown in Figure 4(a).
Lastly, we used Geomancer-based features to explain how the model predicts a property’s price. From Figure 4(b), we can see that over half of the predictive power comes from the postal district where a unit is located. Within each district, however, it is possible to improve our price estimation by percentage points () by adding Geomancer-based features such as distance to restaurants, hotels, and ATMs. For more information, an interactive map can be accessed through this link333https://thinkdatasci.carto.com/builder/ed486c74-f19e-4a51-862d-a117785e121c.
In this paper, we introduced Geomancer, an open-source framework to perform geospatial feature engineering at scale. We described the Spell, a logical feature that serves as the basic building-block of the framework. Then, we showed how it integrates with the overall architecture and demonstrated how it can be used through the Python client library. Lastly, we provided a sample production use-case of Geomancer for predicting residential prices per square foot in Singapore. Using only Geomancer-based features and OpenStreetMap data, we were able to achieve accuracy with an error margin of SGD 100.
For future research, we plan to evaluate user-efficiency and system robustness in more detail. Finally, we also hope to expand the number of database connections (e.g. Amazon Athena, Redshift, etc.) and primitive features to accommodate different cloud providers and other advanced use-cases.
This work was supported by the UNICEF Innovation Fund. We would like to thank our mentors for the insightful discussions and valuable guidance. We would also like to thank Tiffani Gamboa, Cara Evangelista, and Niek van Veen for the Singapore case study.