Quantitative Comparison of Open-Source Data for Fine-Grain Mapping of Land Use

11/09/2017 ∙ by Xueqing Deng, et al. ∙ University of California, Merced 0

This paper performs a quantitative comparison of open-source data available on the Internet for the fine-grain mapping of land use. Three points of interest (POI) data sources--Google Places, Bing Maps, and the Yellow Pages--and one volunteered geographic information data source--Open Street Map (OSM)--are compared with each other at the parcel level for San Francisco with respect to a proposed fine-grain land-use taxonomy. The sources are also compared to coarse-grain authoritative data which we consider to be the ground truth. Results show limited agreement among the data sources as well as limited accuracy with respect to the authoritative data even at coarse class granularity. We conclude that POI and OSM data do not appear to be sufficient alone for fine-grain land-use mapping.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Land use information plays an important role in urban planning and can inform city design and utility distribution (Arsanjani et al., 2013). Land use refers to the function of the land, which is shaped by human activities (Mao et al., 2016), such as education, retail, etc. It is different from land cover, such as vegetation, built-up areas, etc., which is determined by the land’s physical attributes. Remote sensing can be used to determine land cover; mapping land use, however, is much more challenging. The most accurate method for assessing land use has traditionally been through surveys. This is labor intensive and time consuming, and is soon outdated. More automated methods for mapping land use are needed.

Land use and land cover are often treated together. There are many combined land use and land cover (LULC) classification systems but they typically blur the distinction between the two super classes and tend to be relatively coarse grain. The European Urban Atlas (UA) project111https://www.eea.europa.eu/data-and-maps/data/urban-atlas#tab-gis-data is one example. UA provides consistent LULC data for urban zones with more than one hundred thousand people across Europe. It has a well defined mapping methodology and a hierarchical taxonomy of 17 urban and 10 rural classes. To our knowledge, no LULC mapping effort at this scale and even this relatively coarse granularity exists in the United States. The evaluation performed in this paper is a step towards an automated method for fine-grain LU classification in the United States and beyond. Significantly, we undertake the key step in this paper of establishing a LU class taxonomy that is finer grained than any previous system and whose classes are distinct from LC.

A range of techniques have been developed for automated LULC classification, including using remote sensing imagery (Cheng et al., 2015), social media (Zhu and Newsam, 2015), cell phone data (Toole et al., 2012), and points of interest (Yao et al., 2017), or combinations of sources (Liu et al., 2017). Classification based on remote sensing imagery has perhaps the longest history but the resulting products tend to confuse land use and land cover and are coarse grain (Adam et al., 2014; Manandhar et al., 2009; Saadat et al., 2011). More recently, ground-level imagery has been investigated for LU classification (Zhu and Newsam, 2015; Leung and Newsam, 2012). The different and close-up perspective of this imagery has the potential to detect function, particularly indoors. However, this approach is limited by the availability of georeferenced ground-level images.

Points of interest (POI) data is a particularly promising source of data for LU mapping. It is readily available online, often through well-developed application programming interfaces (APIs), and typically consists of geographic coordinates and a specific type or category such as restaurant, bank, etc. Previous work has investigated POI data for LU mapping (Yao et al., 2017; Jiang et al., 2015) or well as other applications such as mapping population (Bakillah et al., 2014). A key challenge in evaluating LU classification is the lack of ground truth. POI data has therefore also been used as reference set (Mao et al., 2016) although its validity as ground truth is not clear.

Another source of data for land use mapping is volunteered geographic information (VGI), a term introduced by Goodchild (Goodchild, 2007) in 2007 to refer to geographic data that is created, assembled, and disseminated voluntarily by individuals. Open Street Map (OSM) is perhaps the most well-known example of VGI. The LU information available in OSM has been compared with authoritative LU data (Arsanjani et al., 2015) but this study was limited to Germany where OSM data is more complete. OSM is much less complete outside Europe, particularly in the United States (Zielstra et al., 2013) and in China, where 94% of the country had little or no data as of 2014 (Zheng and Zheng, 2014).

A wide range of open-source data has been used for mapping LU. However, it is not clear how these sources differ. We therefore undertake the first comparison, to our knowledge, of these different sources. We do this with respect to a new, fine-grain LU class taxonomy which we introduce. We focus on POI and VGI data as these seem to be the most promising sources for LU mapping. We compare the sources to each other as well as to a coarse-grain authoritative LU map.

We summarize our contributions as follows:

  • We introduce a new, fine-grain LU class taxonomy based on the American Planning Association’s Land Based Classification Standard (Association, 2010). This taxonomy characterizes function. It is hierarchical with 9 level-one classes, 47 level-two classes, and 159 level-three classes. We refer to this as the LBCS LU classes. The LBCS hierarchy relevant to this study is shown in the first four columns of table 5 and the first two columns of table 6.

  • We compare three POI sources, Google Places, Bing Maps, and the Yellow Pages, and one VGI source, OSM, with respect to mapping the LBCS classes at the parcel footprint level for the city of San Francisco. We compare the sources to each other as well as to a coarse-grain authoritative LU map. This is the first time, to our knowledge, that such a number of sources has been compared.

2. Overview of the Study

A data source can be deficient in a number of ways for mapping LU with respect to a particular class taxonomy over a given geographic region. The source’s classes might not align with the target classes. That is, classes could be missing or not at same taxonomic level. The location information of the data might not be accurate. And, the spatial coverage could be sparse. Ground truth would allow a data source to be quantitatively assessed along these three dimensions. No ground truth exists for our LBCS classes and so we instead compare our sources to each other to provide insight into their individual deficiencies.

We first align each source with the LBCS taxonomy. This is a difficult undertaking since the sources were not created for LU classification. They also differ significantly among each other with respect to their taxonomic structure. We then use the geographic locations of the source data to assign LBCS classes to the parcel footprints. This allows us to assess the spatial and taxonomic coverage of the individual data sources. We also quantitatively compare them at the footprint scale.

We do have access to coarse-grain authoritative LU information at the parcel footprint level for the study region. We compare each of the sources with this information.

3. Datasets

This section describes the datasets used in the study. We download POI data from Google Places (Inc, 2017) using its API, from Bing Maps (Microsoft, 2017) using its API, and from the Yellow Pages website222https://www.yellowpages.com/. We download OSM points and polygons in ESRI shape format from QGIS 333http://www.qgis.org/en/site/. Finally, we download the authoritative LU data including the parcel footprints from DataSF (DataSF, 2017). Thus our dataset can be divided into three categories, POI, OSM features including points and polygons, and authoritative data.

3.1. Poi

We obtain 55,126 records from the Google Places API for 74 relevant place types (out of 91) for San Francisco City. Examples of relevant place types include “bank”, “museum”, and “restaurant”. We obtain 7,601 records from the Bing Maps API using 39 relevant entity types (out of 69). Examples of entity types include “shopping”, “hotel”, and “ATM”. We obtain 42,183 records from the Yellow Pages website by searching for the 74 Google Places place types. We wrote our own script to parse the Yellow Pages search results.

3.2. Osm

OSM data can be accessed either through its own API or third-party open-source tools. We used QGIS, an application which can download OSM data in XML format and convert it to ESRI shapefiles. We extract 31,784 points and 161,285 polygons in a bounding box of San Francisco City. Unlike the POI data, OSM attributes do not have a fixed set of values. Instead, contributors are free to use any description and even create new attributes. Examples of OSM attributes for San Francisco include “land use”, “building”, “railway”, and “shop”. The values assigned to these attributes often overlap. Of the 193,069 records downloaded from OSM, only 10,439 have non-empty relevant attribute values.

3.3. Authoritative Data

We download the parcel footprints as ESRI shapefiles for San Francisco from DataSF (DataSF, 2017). There are a total of 245,003 parcel polygons. This data also includes coarse-grain LULC labels for each footprint. This flat taxonomy contains 12 classes, 10 of which are relevant to land use. These 10 classes are listed in table 1.

Name Description
CIE Cultural, Institutional, Educational
MED Medical
MIPS Office (Management, Information, Professional Services)
MIXED Mixed Use (Without Residential)
MIXRES Mixed Use (With Residential)
PDR Industrial (Production, Distribution, Repair)
RETAIL/ENT Retail, Entertainment
RESIDENT Residential
VISITOR Hotels, Visitor Services
Table 1. Land use classes from DataSF

4. Methodology

This section describes how the POI and OSM data is used to assign LBCS classes to the parcel footprints. Figure 1 shows a flowchart of the overall study.

Figure 1. Flowchart of the study

4.1. Mapping POI Data

We first manually align the POI data with the LBCS classes. This alignment is shown in tables 5 and 6. The first few columns of these tables show the target LBCS classes. The last three columns show the assignment of the Google Places, Bing Maps, and Yellow Pages POI data.

Once we have associated an LBCS class with each POI, we label the footprints using the workflow shown in figure 2. The POI’s LBCS class is assigned to whichever footprint the POI falls in. If the POI does not fall in any footprint, we assign its class to the nearest footprint in a 10m radius. POIs that do not fall within 10m of a footprint are ignored. Note that a footprint can thus be labeled with multiple LBCS classes by a single source. This makes sense because a parcel can have more than one land use.

Figure 2. Labeling parcel footprints with POI data

4.2. Mapping OSM Data

We again first manually align the OSM data with the LBCS classes. The challenge here is that OSM data does not have a fixed set of attributes (keys) and values. We therefore first identify a set of commonly used keys relevant to our application. This includes keys such as “amenity”, “building”, and “land use”. We then identify a set of relevant values for these keys and associate them with the LBCS classes. Table 4.2 shows the alignment between OSM keys and key values and LBCS classes.

OSM data consists of points and polygons. Points are used to label footprints the same way as the POI data above. Polygons are used to label footprints using shape intersection. Labels are assigned if there is a non-zero intersection between the OSM polygon and a footprint.

Figure 3. Labeling parcel footprints with OSM data
OSM Key OSM Key Value LBCS
amenity (point) bicycle repair station, car wash, clothes stores, corner market, fuel, grocery, market place 2100

bank&atm 2200
car rental 2300
animal shelter, embassy, laundry, pet grooming, post office, veterinary, conference center 2400
bar, café, fast food, night club, restaurant 2500
bicycle parking, bus station, parking 4100
library 4200
arts center, cinema, music venue 5100
gym 5300
college, kindergarten, music school, university 6100
fire station, police 6400
clinic, community center, hospital, dentist, doctors, doctors office, nursing home 6500
place of worship 6600
building (point) apartment, house, residential 1100
commercial, retail 2000
school 6100
landuse (point) residential 1100
commercial, retail 2000
recreation 5000
amenity (polygon) commercial 2000
car wash, fuel, market place, pharmacy 2100

bank&atm 2200
car rental, boat rental 2300
animal shelter, embassy, post office, veterinary, conference center 2400
bar, café, fast food, night club, restaurant 2500
bicycle parking, bus station, parking 4100
library, studio 4200
arts center, cinema,theater 5100
college, kindergarten, school, preschool, university 6100
fire station, police 6400
clinic, community center, hospital, dentist, doctors, doctors office, nursing home 6500
place of worship 6600
building (polygon) apartment, house, residential 1100
hotel 1300
commercial, retail 2000
trainstation 4100
library, museum 4200
school, college, kindergarten, university 6100
hospital 6500
church 6600
landuse (polygon) residential 1100
commercial, retail 2000
recreation 5000
Table 2. Alignment of OSM keys/values to LBCS classes

5. Quantitative evaluation

This section presents our quantitative evaluation. This includes comparing the different sources with each other as well as with the authoritative data.

5.1. Spatially Valid Data

Figure 4 shows the number of records that remain after spatially mapping the data to the footprints. The reduction in valid (that within 10m of a parcel) data is likely due to several factors. First, is simple errors in the location information. Also, some of the records correspond to features which do not fall within footprints such as taxi stands and bus stops. Significantly, none of the data sources contains more than 50,000 valid records. This means that less than 20% of our target set of 245,003 footprints can be labeled with any one source. Our first finding is thus that both the POI and OSM datasets are sparse at the footprint scale.

Columns 2 through 5 of table 4 show the breakdown of valid data for each of the sources with respect to the LBCS hierarchy.

Figure 4. Records before and after mapping to footprints

5.2. Pairwise Comparisons of Data Sources

We perform pairwise comparisons between the data sources to determine their level of agreement. High levels of agreement between multiple data sources can be an indication of good accuracy of each especially in the absence of ground truth.

Columns 2 through 5 of table 4 show the number of parcels labeled by each of the datasets for each class. Columns 6 through 11 show the agreement between pairs of data sources and column 12 shows the agreement between all of the sources. These columns indicate the number of parcels labeled consistently by the combinations of sources (their agreement). The number in parentheses is this number divided by the total number parcels labeled by either or both datasets (intersection over union) reported as a percentage. For example, according to the first row, Google Places labeled 586 parcels and Bing Maps labeled 167 parcels with class 1000. 113 of these are in agreement. This represents 17.66% of the parcels labeled as class 1000 by either Google or Bing or by both.

We make the following observations based on the results in table 4. Google Places and the Yellow Pages tend to have the highest agreement. This is as high as 49.33% agreement at level-one of the class hierarchy for class 2000 General Sales or Services. They also have agreements above 20% for many level-two and level-three classes.

There is little agreement between Bing Maps and the other data sources. This is mostly due to the small size of the Bing Maps dataset.

Class DataSF Bing Google YP OSM
Results Precision Recall Results Precision Recall Results Precision Recall Results Precision Recall
CIE 3701 38/68 0.56 0.01 606/1413 0.43 0.16 433/978 0.44 0.12 808/1949 0.41 0.22
RETAIL/ENT 4513 181/677 0.27 0.04 1295/7274 0.17 0.28 914/4311 0.21 0.20 529/2104 0.25 0.12
VISITOR 508 58/167 0.35 0.11 150/586 0.26 0.30 131/397 0.33 0.26 47/53 0.89 0.09
MED 359 1/12 0.08 0.00 151/2055 0.07 0.42 104/1004 0.10 0.29 66/901 0.07 0.18
MIPS 2138 0/3 0.00 0.00 73/339 0.22 0.03 132/499 0.26 0.06 2/7 0.29 0.00
RESIDENT 179028 28309/34818 0.81 0.16
Table 3. Quantitative comparison with authoritative data

The agreement between OSM and Google Places or the Yellow Pages is mixed. This agreement can be high for some classes. However, there is a clear mismatch between OSM and the other sources in terms of the class taxonomies. For example, OSM labels a large number of parcels with class 1000 Residence or Accommodation but these are nearly all in subclass 1100 Private Household. In contrast, all of the parcels labeled by Google Places or the Yellow Pages with class 1000 are in subclass 1300 Hotels, Motels, of Other Accommodation Services. This difference reflects the fact that Google Places and the Yellow Pages provide POIs while OSM contains information about residential areas.

5.3. Comparison with Authoritative Data

We here compare the data sources with the coarse-grain authoritative data. This requires aligning our LBCS classes with the 10 classes of the authoritative data shown in table 1. To do this, we assign classes 2100, 5200, and 5300 to RETAIL/ENT; classes 6100 and 6600 to CIE; classes 6200 and 6300 to MIPS; class 1100 to RESIDENT; class 1300 to VISITOR; and class 6500 to MED. Some of the LBCS classes are not assigned due to the mismatch between the two taxonomies. We do not use authoritative classes MIXED and MIXRES due to how broad and ambiguous they are. We also do not use class PDR since our data sources do not cover this class.

Table 3 shows the comparison between each of the data sources and the authoritative data. These results are calculated differently than in table 4

since we treat the authoritative data as the ground truth. The second column shows the counts of parcels labeled with the authoritative data classes. For each data source, we report the number of parcels labeled correctly by that source as well as the precision and recall. For example, 38 of 68 parcels labeled by Bing Maps as CIE are correct according to the authoritative data. This represents a precision of 0.56 and a recall of 0.01.

Google Places and the Yellow Pages are seen to be the best datasets in terms of precision and recall. However, even at this coarse granularity, neither of them achieves precision or recall rates above 0.5 for any class. Bing has very low recall due to its small size. OSM has higher precision than recall and is able to achieve precision above 0.8 for classes VISITOR and RESIDENT. This again emphasizes its difference with the POI data.

6. Conclusion and Future Work

We compared four open-source data sources for fine-grain land-use mapping at the parcel level for San Francisco. We observed limited agreement among the data sources as well as limited accuracy with respect to coarse-grain authoritative data. These results suggest that, at least, the four sources considered are not sufficient for mapping land use over a large geographic region particularly with respect to the proposed fine-grain land-use taxonomy.

This motivates future work on investigating and integrating additional data sources especially ones with dense spatial coverage.

7. Acknowledgments

This work was funded in part by a National Science Foundation CAREER grant, #IIS-1150115.


  • (1)
  • Adam et al. (2014) Elhadi Adam, Onisimo Mutanga, John Odindi, and Elfatih M. Abdel-Rahman. 2014.

    Land-use/cover classification in a heterogeneous coastal landscape using RapidEye imagery: evaluating the performance of random forest and support vector machines classifiers.

    International Journal of Remote Sensing 35, 10 (2014), 3440–3458.
  • Arsanjani et al. (2013) Jamal Jokar Arsanjani, Marco Helbich, Mohamed Bakillah, Julian Hagenauer, and Alexander Zipf. 2013. Toward mapping land-use patterns from volunteered geographic information. International Journal of Geographical Information Science 27, 12 (2013), 2264–2278.
  • Arsanjani et al. (2015) Jamal Jokar Arsanjani, Peter Mooney, Alexander Zipf, and Anne Schauss. 2015. Quality assessment of the contributed land use information from OpenStreetMap versus authoritative datasets. In OpenStreetMap in GIScience. Springer, 37–58.
  • Association (2010) American Planning Association. 2010. Land Based Classification Standards. (2010). https://www.planning.org/lbcs/
  • Bakillah et al. (2014) Mohamed Bakillah, Steve Liang, Amin Mobasheri, Jamal Jokar Arsanjani, and Alexander Zipf. 2014. Fine-resolution population mapping using OpenStreetMap points-of-interest. International Journal of Geographical Information Science 28, 9 (2014), 1940–1963.
  • Cheng et al. (2015) G. Cheng, J. Han, L. Guo, Z. Liu, S. Bu, and J. Ren. 2015. Effective and Efficient Midlevel Visual Elements-Oriented Land-Use Classification Using VHR Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing 53, 8 (2015), 4238–4249.
  • DataSF (2017) DataSF. 2017. Open data: land use. (2017). https://data.sfgov.org/Housing-and-Buildings/Land-Use/us3s-fp9q/data
  • Goodchild (2007) Michael F Goodchild. 2007. Citizens as sensors: the world of volunteered geography. GeoJournal 69, 4 (2007), 211–221.
  • Inc (2017) Google Inc. 2017. Google Places API. (2017). https://developers.google.com/places/web-service/search
  • Jiang et al. (2015) Shan Jiang, Ana Alves, Filipe Rodrigues, Joseph Ferreira, and Francisco C Pereira. 2015. Mining point-of-interest data from social networks for urban land use classification and disaggregation. Computers, Environment and Urban Systems 53 (2015), 36–46.
  • Leung and Newsam (2012) Daniel Leung and Shawn Newsam. 2012. Exploring Geotagged Images for Land-use Classification. In Proceedings of the ACM Multimedia 2012 Workshop on Geotagging and Its Applications in Multimedia (GeoMM ’12). 3–8.
  • Liu et al. (2017) Xiaoping Liu, Jialv He, Yao Yao, Jinbao Zhang, Haolin Liang, Huan Wang, and Ye Hong. 2017. Classifying urban land use by integrating remote sensing and social media data. International Journal of Geographical Information Science 31, 8 (2017), 1675–1696.
  • Manandhar et al. (2009) Ramita Manandhar, Inakwu O. A. Odeh, and Tiho Ancev. 2009. Improving the Accuracy of Land Use and Land Cover Classification of Landsat Data Using Post-Classification Enhancement. Remote Sensing 1, 3 (2009), 330–344.
  • Mao et al. (2016) Huina Mao, Gautam Thakur, and Budhendra Bhaduri. 2016. Exploiting Mobile Phone Data for Multi-category Land Use Classification in Africa. In Proceedings of the 2Nd ACM SIGSPATIAL Workshop on Smart Cities and Urban Analytics (UrbanGIS ’16). Article 9, 6 pages.
  • Microsoft (2017) Microsoft. 2017. Bing Maps API. (2017). https://msdn.microsoft.com/en-us/library/gg585126.aspx
  • Saadat et al. (2011) ”Hossein Saadat, Jan Adamowski, Robert Bonnell, Forood Sharifi, Mohammad Namdar, and Sasan Ale-Ebrahim”. 2011. Land use and land cover classification over a large area in Iran based on single date analysis of satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 66, 5 (2011), 608 – 619.
  • Toole et al. (2012) Jameson L. Toole, Michael Ulm, Marta C. González, and Dietmar Bauer. 2012. Inferring Land Use from Mobile Phone Activity. In Proceedings of the ACM SIGKDD International Workshop on Urban Computing (UrbComp ’12). 1–8.
  • Yao et al. (2017) Yao Yao, Xia Li, Xiaoping Liu, Penghua Liu, Zhaotang Liang, Jinbao Zhang, and Ke Mai. 2017. Sensing spatial distribution of urban land use by integrating points-of-interest and Google Word2Vec model. International Journal of Geographical Information Science 31, 4 (2017), 825–848.
  • Zheng and Zheng (2014) Shudan Zheng and Jianghua Zheng. 2014. Assessing the Completeness and Positional Accuracy of OpenStreetMap in China. In Thematic Cartography for the Society. Springer International Publishing, Cham, 171–189.
  • Zhu and Newsam (2015) Yi Zhu and Shawn Newsam. 2015.

    Land Use Classification Using Convolutional Neural Networks Applied to Ground-level Images. In

    Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL ’15). Article 61, 4 pages.
  • Zielstra et al. (2013) Dennis Zielstra, Hartwig H. Hochmair, and Pascal Neis. 2013. Assessing the Effect of Data Imports on the Completeness of OpenStreetMap - A United States Case Study. Transactions in GIS 17, 3 (2013), 315–334.