Map Generation from Large Scale Incomplete and Inaccurate Data Labels

05/20/2020 ∙ by Rui Zhang, et al. ∙ ibm 57

Accurately and globally mapping human infrastructure is an important and challenging task with applications in routing, regulation compliance monitoring, and natural disaster response management etc.. In this paper we present progress in developing an algorithmic pipeline and distributed compute system that automates the process of map creation using high resolution aerial images. Unlike previous studies, most of which use datasets that are available only in a few cities across the world, we utilizes publicly available imagery and map data, both of which cover the contiguous United States (CONUS). We approach the technical challenge of inaccurate and incomplete training data adopting state-of-the-art convolutional neural network architectures such as the U-Net and the CycleGAN to incrementally generate maps with increasingly more accurate and more complete labels of man-made infrastructure such as roads and houses. Since scaling the mapping task to CONUS calls for parallelization, we then adopted an asynchronous distributed stochastic parallel gradient descent training scheme to distribute the computational workload onto a cluster of GPUs with nearly linear speed-up.



There are no comments yet.


page 1

page 2

page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Generating maps of roads and houses from high resolution imagery is critical to keep track of our ever changing planet. Furthermore, such capability become vital in disaster response scenarios where pre-existing maps are often rendered useless after destructive forces struck man-made infrastructure. New maps of roads accessible as well as indication of destroyed buildings will greatly help the disaster response team in rescue planning.

Due to the significant advancement in computer vision by deep learning during the last couple of years in parallel to the explosive amount of high-resolution imagery becoming available, there has been a growing interest in pushing research in automated map generation from high resolution remote sensing imagery

(24; Microsoft (2018); P. Kaiser, J. D. Wegner, A. Lucchi, M. Jaggi, T. Hofmann, and K. Schindler (2017)). Several challenges in the field of detecting houses or roads using high-resolution remote sensing data have been established–two of them being SpaceNet (24) and DeepGloble (Demir et al., 2018). SpaceNet hosted a series of challenges including three rounds of building detection competitions, and two rounds of road network detection contests. The SpaceNet building detection challenge provides 5 cities (plus one additional city–namely Atlanta, GA, USA–in the last round) across different continents, including Las Vegas, NV, USA; Paris, France; Rio de Janeiro, Brasil; Shanghai, China; and Khartoum, Sudan. The corresponding building and house labels are provided along with the high resolution imagery. The DeepGlobe challenge was using the same set of data, but employed a slightly different evaluation measure.

The SpaceNet and DeepGlobe training datasets being publicly available attracted researchers globally to address the challenge of automated map generation. However, as is commonly known infrastructure vary significantly from one location to another. While high-rise apartment buildings have become the typical in China whereas most US population lives in single family houses. A model trained on US data will perform poorly on China. Even within the US from state to state the geo-spatial variation is evident. From this perspective a task of training a model to generate a global map requires more geographically diverse training data. Indeed, the past decade has shown a dramatic increase in the amount of open geo-spatial datasets made available by government agencies such as the United States Department of Agriculture (USDA), NASA, the European Space Agency. Specifically in this paper, we employ two publicly available datasets to train the model of map generation from imagery. One from the corpus of OpenStreetMap (OSM) data and another from the National Agriculture Imagery Product (NAIP) distributed by USDA.

Most state-of-the-art automated programs rely on a significant amount of high-resolution image data with a correspondingly well labeled map (Kaiser et al., 2017; Maggiori et al., 2016). There has been limited studies utilizing the freely available OSM due to its fluctuating accuracy depending on the OSM community’s activity in a given geo-location. In this work we attempt to utilize the inaccurate and incomplete labels of OSM to train neural network models to detect map features beyond OSM.

In addition to the challenge in data quality, the data volume needed to train a model that can map the whole United States is daunting too. The total area of the CONUS is about million square kilometers. For NAIP imagery of resolution, the amount of data sums up to about

TB. In view of the very large scale training dataset, we adopted an asynchronous distributed parallel stochastic gradient descent algorithm to speed up the training process nearly linearly on a cluster of 16 GPUs.

Figure 2. Sample OSM data near Dallas, TX and San Antonio, TX: The plot on the upper left shows an urban area well labeled with most roads (white/yellow lines) and houses (red boxes) accurately marked. The sample on the upper right is drawn from a rural location with less accurate labels. The lower left plot represents another rural region with partially labeled houses. Finally, the lower right figure exemplifies a highly populated area without any house record in the OSM dataset, yet.

We propose a training framework to incrementally generate maps from inaccurate and incomplete OSM data with high resolution aerial images, using data from four cities in Texas drawn from OSM and NAIP. Moreover, Las Vegas data from SpaceNet have been employed as well. The salient contribution is the following:

  • We propose a training frame work to incrementally generate map from inaccurate and incomplete OSM maps.

  • To our best knowledge, it is the first time OSM data is employed as a source of rasterized map imagery in order to train an pixel-to-pixel mapping from high resolution imagery.

  • We analyzed the completeness of the OSM data and discuss an evaluation metric given the inaccuracy and incompleteness of the OSM labels.

  • We numerically investigate the transfer of a model from one geo-location to another to quantify how well the models under consideration generalize.

  • We tested an asynchronous parallel distributed stochastic gradient descent algorithm to speed up the training process nearly linearly on a cluster of 16 GPUs.

  • Through a publicly accessible tool we showcase how we make available the our generated maps.

2. Related Work

Employing remote sensing imagery to generate maps has become more popular due to the rapid progress in deep neural network in the past decade–particularly in the arena of computer vision.

  • encoder-decoder–type convolutional neural networks for population estimation from house detection

    (Tiecke et al., 2017), or road detection with distortion tolerance (Zhou et al., 2018), or ensembles of map-information assisted U-Net models (Ozaki, 2019)

  • ResNet-like down-/upsampling for semantic segmentation of houses: (Microsoft, 2018)

  • image impainting for aerial imagery for semi-supervised segmentation: (Singh et al., 2018)

  • human infrastructure feature classification in overhead imagery with data prefiltering: (Bonafilia et al., 2019)

Distributed deep learning is the de-facto approach to accelerate its training. Until recently, it was believed that the asynchronous parameter-server-based distributed deep learning method is able to outperform synchronous distributed deep learning. However, researchers demonstrated synchronous training is superior to asynchronous parameter-server based training, both, from a theoretical and an empirical perspective (Zhang et al., 2016; Gupta et al., 2016; Goyal et al., 2017). Nevertheless, the straggler problem remains a major issue in synchronous distributed training, in particular in a large scale setting. Decentralized deep learning (Lian et al., 2017) is recently proposed to reduce latency issues in synchronous distributed training and researchers demonstrated that decentralized deep learning is guaranteed to have the same convergence rate as synchronous training. Asynchronous decentralized deep learning is further proposed to improve runtime performance while guaranteeing the same convergence rate as the synchronous approach (Lian et al., 2018)

. Both scenarioes have been verified in theory and practice over 100 GPUs on standard computer vision tasks such as ImageNet.

3. Data Corpus

Since geo-spatial data comes in different geo-projections, spatial and temporal resolution, it is critical to correctly reference it for consistent data preparation. In this study, we used three sources of data: NAIP, OSM, and SpaceNet. To obtain consistency in the generated training data, a tool developed by IBM was utilized: PAIRS (Klein et al., 2015; Lu et al., 2016), shorthand for Physical Analytics Integrated data Repository and Services. The following sections describe in detail how PAIRS processes each dataset to generate a uniform, easily accessible corpus of training data.

3.1. Big Geo-Spatial Data Platform

Figure 3. Sample training data generated from PAIRS retrieval. The curation and geo-indexing of the PAIRS system aligns the aerial imagery and the rasterized OSM map to exactly match pixel by pixel.

Both the NAIP imagery and OSM rasterized data are first loaded into IBM PAIRS. PAIRS is a big spatio-temporal data platform that curates, indexes and manages a plurality of raster and vector datasets for easy consumption by cross-datalayer queries and geospatial analytics. Among others, datasets encompass remote sensing data including satellite/aerial images, point cloud data like LiDAR (Light Detection And Ranging)

(Yan et al., 2015), weather forecast data, sensor measurement data (cf. Internet of Things (IEEE, 2015)), geo-survey data, etc..

PAIRS (Lu et al., 2016; Klein et al., 2015) is based on the scalable key-value store HBase(Vora, 2011). For users it masks various complex aspects of the geo-spatial domain where e.g. hundreds of formats and geo-projections exist–provided by dozens of data sources. PAIRS employs a uniform geo-projection across all datasets with nested indexing (cf. QuadTree (Finkel and Bentley, 1974)). Raster data such as satellite images are cut into cells of size 32 by 32 pixels. Each cell is stored in HBase indexed by a 16 bytes key encoding its spatial and temporal information. Among others, the key design ensures efficient data storage and fast data retrieval.

3.2. Training Data Characteristics

USDA offers NAIP imagery products which are available either as digital ortho quarter quad tiles or as compressed county mosaics (of Agriculture, 2020). Each individual image tile within the mosaic covers a by minute quarter quadrangle plus a meter buffer on all four sides. The imagery comes with 4 spectral bands covering a red, green, blue, and near-infrared channel. The spatial resolution effectively varies from half a meter up to about two meters. The survey is done over most of the territory of the CONUS such that each location is revisited about every other year.

The open-source project OpenStreetMap (OSM)

(19) is a collaborative project to create a freely available map of the world. The data of OSM is under tremendous growth with over one million registered contributors111as of February 2020, cf. editing and updating the map. On the one hand, OSM data stem from GPS traces collected by voluntary field surveys such as e.g. hiking and biking. On the other hand, a web-browser–based graphical user interface with high resolution satellite imagery provides contributors across the world an online tool to generate vector datasets annotating and updating information on roads, buildings, land cover, points of interest, etc..

Both approaches have limitations. Non-military GPS is only accurate within meters–roads labeled by GPS can be off by up to 20 meters. To the contrary, manual annotation is time consuming and typically focused on densely populated areas like cities, towns, etc.. Fig. 2 shows examples of OSM labels with satellite imagery as background for geo-spatial reference. By visual inspection, it is evident that in most of the densely populated areas, roads are relatively well labeled. Concerning houses the situation is similar. However, compared to surrounding roads, more geo-spatial reference points need to be inserted into the OSM database per unit area in order to label all buildings. Thus, its coverage tends to lag behind road labeling. Shifting attention to rural and remote areas, the situation becomes even worse, because there is less volunteers available on site–or there is simply less attention and demand to build an accurate map in such geographic regions.

Since our deeply-learnt models are conceptually based on a series of convolutional neural networks to perform auto-encoding–type operations for the translation of satellite imagery into maps, the problem under consideration could be rendered in terms of image segmentation. Therefore, we pick a rasterized representation of the OSM vector data, essentially projecting the information assembled by the OSM community into an RGB-channel image employing the Mapnik framework (Mapnik, 2005). This way each color picked for e.g. roads (white), main roads (yellow), highways (blue), etc. and houses (brown) becomes a unique pixel segmentation label.

The rasterized OSM data is then geo-referenced and indexed in PAIRS such that for each NAIP pixel there exists a corresponding unique label. The label generation by the OSM rasterization is performed such that the OSM feature with the highest z-order attribute222this index specifies an on-top order is picked. E.g. if a highway crosses a local road by bridge, the highway color label is selected for the RGB image rasterization. This procedure resembles the top-down view of the satellite capturing the NAIP imagery. A notable technical aspect of PAIRS is its efficient storage: Besides data compression, no-data pixels (cf. the GeoTiff standard (Consortium, 2019)) are not explicitly stored. In terms of uploading the rasterized OSM data, this approach significantly reduced the amount of disk space needed. Per definition, we declare the OSM sandy background color as no-data.

After both NAIP and rasterized OSM data is successfully loaded into PAIRS, we use Apache Spark (Zaharia et al., 2016) to export the data into tiles of pixels. The uniform geo-indexing of the PAIRS system guarantees aerial imagery and rasterized maps to exactly match at same resolution and same tile size. An illustration of training samples shows Fig. 3. For this study, we focused on the state of Texas, where we exported a total of about million images with total volume of TB. However, we limit our investigation to the big cities in Texas with rich human infrastructure, namely: Dallas, Austin, Houston, and San Antonio.

3.3. SpaceNet: Data Evaluation Reference

Since OSM data bear inaccuracies and it is incomplete in labeling, we picked a reference dataset to test our approach against a baseline that has been well established within the last couple of years within the remote sensing community. SpaceNet offers e.g. accurately labeled building polygons for the city of Las Vegas, NV. Given a set of GeoJSON vector data files, PAIRS rasterized these into the same style as employed by the OSM map discussed above. Then, the corresponding result is curated and ingested as a separate raster layer, ready to be retrieved from PAIRS the same way as illustrated in Fig. 3. This setup allows us to apply the same training and evaluation procedure to be discussed in the sections below.

4. Models

In this study we have used two existing deep neural network architectures with further improvement on the baseline models in view of the characteristics of the data available (cf. Sect. 1). Details of the models employed are described in the following.

4.1. U-Net

U-Net (Ronneberger et al., 2015) is a convolutional neural network encoder-decoder architecture. It is the state-of-the-art approach for image segmentation tasks. In particular, U-Net is the winning algorithm in the 2nd round of the SpaceNet (Etten et al., 2018)

challenge on house detection using high resolution satellite imagery. The U-Net architecture consists of a contracting path to capture context, and employs a symmetric expanding path that enables precise localization. The contracting path is composed of a series of convolution and max-pooling layers to coarse-grain the spatial dimensions. The expanding path uses up-sampling layers or transposed convolution layers to expand the spatial dimensions in order to generate a segmentation map with same spatial resolution as the input image. Since the expanding path is symmetric to the contracting path with skip connections wiring, the architecture is termed

U-Net. Trainng the U-Net in a supervised manner for our remote sensing scenario requires a pixel-by-pixel–matching rasterized map for each satellite image.

4.2. CycleGAN

Figure 4. Two pairs of samples of rasterized OSM maps (RGB color, left) with their corresponding feature mask (bi-color, right) next to each other.
Figure 5. Given satellite imagery data and corresponding map data : illustration of our CycleGAN architecture showing the data flow from image input , to generated map , to recreated image . This cycle is used to compute a consistency loss which is weighted by the feature map yielding the FW-loss contribution during training.

The model was introduced in 2017 (Zhu et al., 2017)

for the task of image-to-image translation which is a class of computer vision problems with the goal to learn the mapping between two distributions of unpaired data sets

and . Given images from a source distribution and maps from a target distribution , the task is to learn a mapping such that the distribution of is as close as possible to the distribution of . In addition, a mapping is established to further regulate the network’s learning by a so called cycle consistency loss enforcing . Starting off with and repeating this line of reasoning, a second cycle consistency loss pushes the ’s numerical optimization towards .

The paper introducing (Zhu et al., 2017) provided a showcase that translated satellite imagery to maps by way of example. Some rough, qualitative measurement on ’s ability to convert overhead imagery to maps was provided. In our paper, it is the first time is evaluated quantitatively in terms of house and road detection.

For the discussion to follow, represents imagery input, map images. The corresponding generators are and , respectively.

4.3. Feature-Weighted CycleGAN

In generating maps from aerial images, the focus is to precisely extract well defined features such as e.g. houses and roads from the geographic scene. Thus, we added one more loss to the ’s training procedure which we refer to as feature-weighted cycle consistency loss, FW loss for short. The FW loss is an attention mechanism putting more weight on pixel differences of the cycle consistency loss which correspond to map features under consideration. In the following section we describe in detail how features are defined and extracted, and how the FW loss is computed.

As mentioned in Sect. 3.2, for the rasterized OSM data, houses and roads are labeled using a set of fixed colors . For example, houses (mostly residential) are labeled by a brownish color, cf. Fig. 4. In contrast, roads can take a few colors such as plain white, orange, light blue, etc.. We applied a color similarity-based method to extract the pixels that are detected as features according to the predefined list of feature colors. More specifically, the feature pixels from the OSM map and its generated counterpart is extracted by a color similarity function generating a pixel-wise feature mask . is defined by the international Commission on Illumination (Sharma, 2003). In particular, we used the formula referenced to as . The definition of the binary feature mask for a map is generated from a three-dimensional matrix with indices representing gridded geospatial dimensions such as longitude and latitude, and indexes the color channel of the map image:


where is a predefined threshold for the set of colors .

Given , we added a FW loss

to the generators’ loss function in the

model, defined by:


5. Evaluation Metrics

Figure 6. Sample scenario from which to compute the score: On the top row (a), (b) and (c) are the OSM map , the generated map , and the NAIP image , respectively. On the bottom row, (d), (e) and (f) are houses as labeled by the OSM map and houses detected. It results in the following cases colored: correctly detected houses (TP, red), missed houses (FN, pink), and false positive houses (FP, cyan).

We adopted a feature-level detection score, similar to the SpaceNet evaluation approach: Each house or road detected by the generated map is evaluated against the OSM map using a binary classification score which consists of both, and . In the following section we detail on how each score is computed.

In a first step, detected feature pixels in the maps (both, OSM map and generated map ) are extracted using the same method as described in Sect. 4.3. Then a set of polygons and is generated from the extracted features. A feature like a house in the real map represented by a polygon is correctly detected if there exists a corresponding polygon in the generated map , such that the Intersection over Union (IoU)


is greater than a given threshold where we used throughout our experiments. The case counts as a true positive (, otherwise) for our evaluation. If there does not exist any that exceeds the IoU threshold , a false negative (, otherwise) is registered. Vice versa, if there does not exist any of , such that , then the polygon is counted as a false positive (, otherwise).

Fig. 6 demonstrates examples of all three situations. The procedure is repeated for all pairs of geo-referenced test data maps with corresponding generated map .

The true positive count of the data set is the total number of true positives for all samples . In the same manner, the false positive count is computed according to from all . Finally, we have the false negative count determined by .

Once the integrated quantities , , and are obtained, precision and recall is computed by their standard definitions:


In addition, the F1-score

that resembles the harmonic mean of precision and recall is defined through


As already discussed, neither is the OSM labels complete in terms of houses, nor is it accurate in terms of roads for rural areas. By way of experiment, however, we found that most of the house labels are accurate, if existing. Therefore, we assume house labels to be incomplete, but accurate, and hence, we restrict ourself to the recall score as a measure to evaluate model performance for detecting houses. We provide a discussion on the precision score as complement in Sect. 7.4 employing human, visual inspection.

6. Experimental Setup

Our experiments were developed and performed on a cluster of 4 servers. Each machine has 14-core Intel Xeon E5-2680 v4 2.40GHz processors, 1TB main memory, and 4 Nvidia P100 GPUs. GPUs and CPUs are connected via PCIe Gen3 bus with 16GB/s peak bandwidth in each direction. The servers are connected by 100Gbit/s ethernet.

PyTorch version 1.1.0 is the underlying deep learning framework in use. We use Nvidia’s CUDA 9.2 API model, the CUDA-aware OpenMPI v3.1.1, and the GNU C++ compiler version 4.8.5 to build our communication library, which connects with PyTorch via a Python-C interface.

7. Results and Discussion

As discussed above, in this study we focused on four cities in Texas, namely: Austin, Dallas, San Antonio, and Houston. After all data tiles had been exported from PAIRS, in a first step, we applied an entropy threshold on the tile’s pixel value distribution. It enabled us to filter out tiles dominated by bare land with few features such as houses and roads. Then, each collection of tiles is randomly split into training and testing with split ratio .

It is found that among the four cities in Texas extracted for this study, the house density333defined as average number of houses labeled per square kilometer varies significantly as summarized in Table 1. Given the four cities are located in the same state, one would expect the density of houses to be relatively similar, yet the number of house labels as provided by the OSM map varies by more than one order of magnitude. For our setting, we consider the SpaceNet dataset as most complete in terms of house labels. Although Las Vegas, NV is not in the state of Texas, lacking of a better alternative we used the its house density for book-keeping purpose to compute the completeness score. We define the completeness score by the ratio of the house density of any Texas city vs. the house density in Las Vegas. The house densities and corresponding completeness scores are listed in Table 1. House density is a critical variable for model performance, as a less complete dataset generates a more biased model, thus impacting overall accuracy. After we present our findings on model performance in view of data completeness below, we detail on how we incrementally fill missing data in order to improve overall model accuracy.

City House Density Completeness Score
Vegas 3283 100%
Austin 1723 52%
Dallas 1285 39%
San Antonio 95 3%
Houston 141 4%
Table 1. House density (average number of labeled houses per square kilometer) and completeness score for each dataset.

7.1. Model Comparison on Datasets with Different Level of Completeness

In a first step, we established a comparison of U-Net vs. CycleGAN using the most accurate and complete dataset from Las Vegas. Results are summarized in Table 2. As expected, the winning architecture in the SpaceNet building detection challenge performs much better than the CycleGAN model. We note that for the SpaceNet Las Vegas dataset, houses are the only labels, i.e. in the rasterized map used for training, no road labels exist. In our experiments we observed that CycleGAN is challenged by translating satellite images with rich features such as roads, parks, lots, etc. into void on the generated map. Thus the task is not necessarily suited for such an architecture.

Model Train City Test City Precision Recall F1
U-Net Vegas Vegas 0.829 0.821 0.825
CycleGAN Vegas Vegas 0.700 0.414 0.520
Table 2. U-Net and FW-CycleGAN Comparison on SpaceNet Vegas Dataset.

7.2. Model comparison on generalization

In a next step, we wanted to investigate the generalization capability of the U-Net model trained on the accurate and complete SpaceNet data in Las Vegas. If the model would be able to generalize from one geo-location to another, the amount of data needed to train a model for the entire area of CONUS would be significantly reduced. After we did train a U-Net model on the SpaceNet Las Vegas data, inference was performed on the Austin, TX dataset from OSM. The results are summarized in Table 3. We observe a drop in recall from 82% in Las Vegas, NV down to 25% in Austin, TX. Hence, the result underlines the need for a training dataset with wide variety of scenes across different geo-locations in order to be able to generate accurate maps.

Model Train City Test City Precision Recall F1
U-Net Vegas Vegas 0.829 0.821 0.825
U-Net Vegas Austin 0.246 0.253 0.250
Table 3. The U-Net model generalization test cases.

Last, we compared the CycleGAN, the FW-CycleGAN and the U-Net models using the Austin dataset. Corresonpding results are shown in Table 4. We demonstrated that the additional FW loss significantly improved the recall of the CycleGAN increasing it to 74.0% from a baseline that yielded a value of 46.4%. Also, the FW-CycleGAN model slightly outperformed the U-Net which achieved a recall of 73.2%.

Model Train City Test City Precision Recall F1
CycleGAN Austin Austin 0.546 0.464 0.501
FW-CycleGAN Austin Austin 0.641 0.740 0.687
U-Net Austin Austin 0.816 0.732 0.772
Table 4. CycleGAN, FW-CycleGAN and U-Net Comparison on PAIRS dataset.

In yet another case study, we numerically determined the recall of the two best performing models, namely FW-CycleGAN and U-Net, on other cities. The results are summarized in Table 5. As demonstrated, the FW-CycleGAN model consistently generates better recall values across all the three cities in Texas other than Austin.

Model Train City Test City Precision Recall F1
FW-CycleGAN Austin Austin 0.641 0.740 0.687
U-Net Austin Austin 0.816 0.732 0.772
FW-CycleGAN Austin Dallas 0.495 0.626 0.553
U-Net Austin Dallas 0.512 0.498 0.505
FW-CycleGAN Austin San Antonio 0.034 0.546 0.063
U-Net Austin San Antonio 0.032 0.489 0.059
FW-CycleGAN Austin Houston 0.040 0.470 0.074
U-Net Austin Houston 0.045 0.370 0.080
Table 5. Generalization comparison between FW-CycleGAN and U-Net

7.3. Incremental Data Augmentation

Given the assumption that maps of cities from the same state follow the same statistics regarding features, we propose a data augmentation scheme to incrementally fill in missing labels in less complete datasets. The incremental data augmentation scheme uses a model trained on a more completely labeled dataset, e.g., Austin area, to generate maps for a less complete datasets, e.g., geo-locations of Dallas and San Antonio.

More specifically, due to consistently better generalization score shown in Table 5 we used the FW-CycleGAN model trained on Austin data to generate augmented maps. The houses labeled as false positive in the generated maps are added to the original maps to create the augmented maps . In this study, we also generated the augmented map for Austin. A sample of the augmented map compared to the original map as well as corresonding satellite imagery is shown in Fig. 7.

By data augmentation, the average house density of Austin increased by about 25% to per square kilometer. The house density of Dallas and San Antonio has been increased to a level close to Austin. Numerical details provided in Table 6. It is noted that data augmentation is only performed on the training dataset, the testing dataset remains the same throughout the experiment.

City OSM Augmented
House Density House Density
Austin 1723 2168
Dalla 1285 1678
San Antonio 95 1259
Table 6. House Density comparison with data augmentation.

The model accuracy from both, the original OSM map and its augmented counterpart for the three cities under consideration are shown in Table 7. As obvious, the models trained using the augmented map outperform the models trained using original OSM, in particular for cities with less labels in OSM. For example, the recall score got lifted from 11.8% to 51.4% for the U-Net in the San Antonio area. Even for Austin, where the data are most complete, recall improved from 73.2% to 84.9% which is almost close to the SpaceNet winning solution, which has a F1 score of 88.5% for Las Vegas. We note that, compared to our training data the SpaceNet winning solution was achieved by using an 8-band multispectral data set plus OSM data trained on a more completely and accurately labeled dataset. It is also noted that we do employ a smaller IoU threshold than the SpaceNet metrics. Nevertheless, there are certainly more trees in Texas, which highly impact the detected area of a house. Tree cover is the main reason we reduce the IoU threshold to a smaller value of 0.3.

Figure 7. Sample data augmentation, there are six samples in total. Three figures in each sample, thus there are two samples per row. The first figure in the sample is the original OSM map label, the second figure is the augmented figure, and the last one is the high resolution imagery as a reference to show the validity of the augmented map labels. As is shown, only false positive houses in the generated map is added back the to the original OSM map to generate the augmented map.
city model Precision Recall F1
OSM Augmented OSM Augmented OSM Augmented
(l)3-8 Austin FW-Cycle-Gan 0.641 0.614 0.74 0.769 0.687 0.683
U-Net 0.816 0.7 0.732 0.849 0.772 0.768
Dallas FW-Cycle-Gan 0.524 0.536 0.51 0.761 0.517 0.629
U-Net 0.765 0.633 0.772 0.830 0.768 0.718
San Antonio FW-Cycle-Gan 0.082 0.020 0.133 0.409 0.101 0.039
U-Net 0.179 0.026 0.118 0.514 0.142 0.049
Table 7. Model accuracy comparison using original OSM map and augmented map.

7.4. Discussion on Precision

The precision score evaluated using OSM labels as ground truth is negatively impacted by the incompleteness of the OSM data. Indeed, we observed that OSM labels frequently miss houses whose presence is evident from NAIP imagery. In order to provide a quantitative understanding on the true model performance, we took the Austin test dataset as a case study: We randomly picked 100 samples of the aerial images, and manually surveyed these images to correct the incomplete labels of OSM. Using the corrected labels as the ground truth, , , and were computed for the subset.

Corresponding results are shown in Table 8. Comparing Table 8 with Table 4, the F1 score of both U-Net and FW-CycleGAN improved with the manual label correction. While the recall scores are the same, the U-Net yields a higher precision score, thus resulting in a higher F1 score of 87.5%. The improved score is indeed expected, as many false positives turned into true positives after the correction of incomplete OSM labels. Moreover, the models trained with incomplete labels will likely under-estimate the positives, thus both resulted in a higher precision and lower recall score. Overall, U-Net out-performs FW-CycleGAN in this relatively more accurate dataset among four cities under study.

Remarkably, this limited preliminary work showed that after data augmentation, in spite of the fact that the U-Net is trained using inaccurate and incomplete OSM training labels, its F1 score performance is par to the winning SpaceNet solution which was trained and validated on accurate datasets. we plan to perform more comprehensive manual checking on other dataset in the future.

Model TP FP FN Precision Recall F1
FW-CycleGAN 4112 758 956 0.844 0.811 0.828
U-Net 3817 203 889 0.950 0.811 0.875
Table 8. Corrected scores for FW-CycleGAN and U-Net with manual count on Austin dataset.

7.5. Distributed training experiment

We used Decentralized Parallel SGD (DPSGD) (Lian et al., 2017) to accelerate training. We rearranged the weight update steps for both the generators and the discriminators such that the stepping function of the stochastic gradient descent algorithm renders simutaneously. This way, the weight updating and averaging step of the two sets of networks (generators and discriminators) remains the same as an architecture that employs one network and one objective function, which is originally proposed in (Lian et al., 2017). All learners (GPUs) are placed in a communication ring. After every mini-batch update, each learner randomly picks another learner in the communication ring to perform weight averaging, as proposed in (Zhang et al., 2020). In addition, we overlap the gradient computations by the weight communication and averaging step to further improve runtime performance. As a result, we achieved a speed-up of 14.7 utilizing 16 GPUs, hence reducing the training time of CycleGAN from roughly 122 hours on one GPU down to 8.28 hours employing the decentralized parallel training utilizing 16 GPUs.

8. Public Access to Generated Maps

Given the big geo-spatial data platform PAIRS discussed in Sect. 3.1, we make available inferred maps of our models to the public. The open-source Python module ibmpairs can be downloaded from pip and conda package are available through and, respectively, i.e. running pip install ibmpairs or conda install -c conda-forge ibmpairs is a way to get the Python module. . In order to retrieve the generated map features as colored overlay along with geo-referenced NAIP RGB image in the background, the following JSON load

{ ”layers”: [
    { ”id”: ”50155”}, {”id”: ”50156”}, {”id”: ”50157”},
    { ”id”: ”49238”, ”aggregation”: ”Max”,
      {”intervals”: [{”start”: ”2014-1-1”,”end”: ”2016-1-1”}]}
    { ”id”: ”49239”, ”aggregation”: ”Max”,
      {”intervals”: [{”start”: ”2014-1-1”,”end”: ”2016-1-1”}]}
    { ”id”: ”49240”, ”aggregation”: ”Max”,
      {”intervals”: [{”start”: ”2014-1-1”,”end”: ”2016-1-1”}]}
  ”temporal”: {”intervals”: [{”snapshot”: ”2019-4-16”}]},
  ”spatial”: {
    ”coordinates”: [32.6659568,-97.4756499, 32.6790701,-97.4465533],
    ”type”: ”square”

can be submitted as query to PAIRS. The Python sub-module paw of ibmpairs utilizes the PAIRS’s core query RESTful API. Example code reads:

# set up connection to PAIRS
from ibmpairs import paw
import os
os.environ[’PAW_PAIRS_DEFAULT_USER’] = \
# retrieve generated map and associated imagery from PAIRS
query = paw.PAIRSQuery(queryJSON)

with queryJSON the JSON load listed above, assuming a fictitious user Overall, six PAIRS raster layers are queried corresponding to 3 RGB channels for the satellite imagery and 3 RGB channels for the generated, rasterized map.

9. Conclusion and Future Work

In this paper, we performed a case study to investigate the quality of publicly available data in the context of map generation from high-resolution aerial/satellite imagery by applying deep learning architectures. In particular, we utilized the aerial images from the NAIP program to characterize rasterized maps based on the crowdsourced OSM dataset.

We confirmed that geographic, economic, and cultural heterogeneity renders significant differences on man-made infrastructure which calls for more broadly available training data like the ones used in this study. We employed two state-of-the-art deep convolution neural network models, namely U-Net and CycleGAN. Furthermore, based on the objective, we introduced the Feature-Weighted CycleGAN which significantly improved binary classification accuracy for house detection. Although OSM is not accurate in rural areas, and it is incomplete in urban areas, we assumed: once a house is labeled, it is accurately labeled. Consequently, we focused our model performance evaluation on recall. In addition, we provided manually obtained evaluation metrics, to show that both precision and recall increases in value in accurately labeled subsets.

For scenarios where the incompleteness of OSM labels is significant, we propose an incremental data augmentation scheme that has significantly improved model accuracy in such areas. Even for cities which are relatively complete in terms of labeling, the data augmentation scheme helped lifting the best recall to 84%, and our manual count of binary classification of the Austin dataset shows the precision score is above 84%, yielding a F1 score prominently close to a corresponding SpaceNet winning solution exploiting more data input compared to our approach.

Obviously, to finally map the entire world we need to deal with enormous amounts of training data. To this end, we applied an Decentralized Parallel Stochastic Gradient Descent (DP-SGD) training scheme that is scalable to hundreds of GPUs with near linear speed-up. At the same time it carries the same level of convergence compared to non-parallel training schemes. We demonstrated an implementation of the DP-SGD scheme and achieved a speed up of times over a cluster of 16 GPUs. Nevertheless, we observed a gap in convergence. Tuning the code for further improvement is the subject of our current, ongoing agenda.


  • D. Bonafilia, J. Gill, S. Basu, and D. Yang (2019)

    Building high resolution maps for humanitarian aid and development with weakly-and semi-supervised learning


    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    pp. 1–9. Cited by: 4th item.
  • O. G. Consortium (2019) External Links: Link Cited by: §3.2.
  • I. Demir, K. Koperski, D. Lindenbaum, G. Pang, J. Huang, S. Basu, F. Hughes, D. Tuia, and R. Raskar (2018) DeepGlobe 2018: a challenge to parse the earth through satellite images. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §1.
  • A. V. Etten, D. Lindenbaum, and T. M. Bacastow (2018) SpaceNet: a remote sensing dataset and challenge series. External Links: 1807.01232 Cited by: §4.1.
  • R. A. Finkel and J. L. Bentley (1974) Quad trees a data structure for retrieval on composite keys. Acta informatica 4 (1), pp. 1–9. Cited by: §3.1.
  • P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017) Accurate, large minibatch sgd: training imagenet in 1 hour. External Links: 1706.02677 Cited by: §2.
  • S. Gupta, W. Zhang, and F. Wang (2016) Model accuracy and runtime tradeoff in distributed deep learning: a systematic study. pp. 171–180. External Links: Document Cited by: §2.
  • IEEE (2015) External Links: Link Cited by: §3.1.
  • P. Kaiser, J. D. Wegner, A. Lucchi, M. Jaggi, T. Hofmann, and K. Schindler (2017) Learning aerial image segmentation from online maps. IEEE Transactions on Geoscience and Remote Sensing 55 (11), pp. 6054–6068. Cited by: §1, §1.
  • L. J. Klein, F. J. Marianno, C. M. Albrecht, M. Freitag, S. Lu, N. Hinds, X. Shao, S. B. Rodriguez, and H. F. Hamann (2015) PAIRS: a scalable geo-spatial data analytics platform. In 2015 IEEE International Conference on Big Data (Big Data), pp. 1290–1298. Cited by: §3.1, §3.
  • X. Lian, C. Zhang, H. Zhang, C. Hsieh, W. Zhang, and J. Liu (2017) Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5330–5340. External Links: Link Cited by: §2, §7.5.
  • X. Lian, W. Zhang, C. Zhang, and J. Liu (2018) Asynchronous decentralized parallel stochastic gradient descent. In

    Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018

    pp. 3049–3058. External Links: Link Cited by: §2.
  • S. Lu, X. Shao, M. Freitag, L. J. Klein, J. Renwick, F. J. Marianno, C. Albrecht, and H. F. Hamann (2016) IBM pairs curated big data service for accelerated geospatial data analytics and discovery. In 2016 IEEE International Conference on Big Data (Big Data), Vol. , pp. 2672–2675. External Links: Document, ISSN null Cited by: §3.1.
  • S. Lu, X. Shao, M. Freitag, L. J. Klein, J. Renwick, F. J. Marianno, C. Albrecht, and H. F. Hamann (2016) IBM pairs curated big data service for accelerated geospatial data analytics and discovery. In 2016 IEEE International Conference on Big Data (Big Data), pp. 2672–2675. Cited by: §3.
  • E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez (2016) Convolutional neural networks for large-scale remote-sensing image classification. IEEE Transactions on Geoscience and Remote Sensing 55 (2), pp. 645–657. Cited by: §1.
  • Mapnik (2005) Mapnik. GitHub. Note: Cited by: §3.2.
  • Microsoft (2018) US building footprints. GitHub. Note: Cited by: §1, 2nd item.
  • U.S. D. of Agriculture (2020) External Links: Link Cited by: §3.2.
  • [19] Open street map. Note: https://www.openstreetmap.orgAccessed: 2020-02-01 Cited by: §3.2.
  • K. Ozaki (2019) External Links: Link Cited by: 1st item.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pp. 234–241. External Links: ISBN 9783319245744, ISSN 1611-3349, Link, Document Cited by: §4.1.
  • G. Sharma (2003) Digital color imaging handbook. CRC Press. External Links: ISBN ISBN 0-8493-0900-X Cited by: §4.3.
  • S. Singh, A. Batra, G. Pang, L. Torresani, S. Basu, M. Paluri, and C. Jawahar (2018) Self-supervised feature learning for semantic segmentation of overhead imagery.. In BMVC, Vol. 1, pp. 4. Cited by: 3rd item.
  • [24] SpaceNet – accelerating geospatial machine learning. Note: 2020-02-03 Cited by: §1.
  • T. G. Tiecke, X. Liu, A. Zhang, A. Gros, N. Li, G. Yetman, T. Kilic, S. Murray, B. Blankespoor, E. B. Prydz, et al. (2017) Mapping the world population one building at a time. arXiv preprint arXiv:1712.05839. Cited by: 1st item.
  • M. N. Vora (2011) Hadoop-hbase for large-scale data. In Proceedings of 2011 International Conference on Computer Science and Network Technology, Vol. 1, pp. 601–605. Cited by: §3.1.
  • W. Y. Yan, A. Shaker, and N. El-Ashmawy (2015) Urban land cover classification using airborne lidar data: a review. Remote Sensing of Environment 158, pp. 295–310. Cited by: §3.1.
  • M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, et al. (2016) Apache spark: a unified engine for big data processing. Communications of the ACM 59 (11), pp. 56–65. Cited by: §3.2.
  • W. Zhang, X. Cui, A. Kayi, M. Liu, Finkler,Ulrich, B. Kingsbury, G. Saon, Y. Mroueh, A. Buyuktosunoglu, P. Das, Kung,David, and Picheny,Michael (2020) Improving efficiency in large-scale decentralized distributed training. In ICASSP’2020, Cited by: §7.5.
  • W. Zhang, S. Gupta, X. Lian, and J. Liu (2016) Staleness-aware async-sgd for distributed deep learning. In

    Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence

    IJCAI’16, pp. 2350–2356. External Links: ISBN 9781577357704 Cited by: §2.
  • L. Zhou, C. Zhang, and M. Wu (2018) D-linknet: linknet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction.. In CVPR Workshops, pp. 182–186. Cited by: 1st item.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, Cited by: §4.2, §4.2.