The increasing growth of geospatial data poses a great challenge to data discovery, access, and maintenance (Jiang et al., 2018). In order to increase data reusability and facilitate geospatial knowledge discovery, many geoportals have been established to provide integrated access to geospatial resources (Hu et al., 2015a). Examples of geoportals include the DataOne Data Catalog111https://search.dataone.org/data, U.S. Geological Survey Science Data Catalog222https://data.usgs.gov›datacatalog, NASA Earth data Search333https://search.earthdata.nasa.gov/search, ArcGIS Online, and so on.
The most important component of a geoportal is its search functionality, which is usually supported by geographic information retrieval (GIR) techniques. Generally speaking, information retrieval (IR) aims at finding relevant entries based on a user’s query. The entries can be documents, websites, services, maps, and so on, depending on the application scenarios. As a subfield of IR, geographic information retrieval (Jones and Purves, 2008) adds space (and time) as additional dimensions to the traditional information retrieval problems (Janowicz et al., 2011). In addition to traditional thematic similarity, spatial (and temporal) similarity is considered when the relevance score between a user’s query and an entry is calculated.
Despite the success of GIR in academia, in practice, the core search functionality of most existing geoportals is still based on Apache Lucene or Elasticsearch (Jiang et al., 2018). These Lucene-based engines use a term frequency-inverse document frequency (TF-IDF) approach to compute the similarities between a user’s query and document entries, which is insufficient to completely capture a user’s search intention. For example, when a user searches for natural disaster in California (Query
), (s)he is probably more interested in a document which describes the Kincade Fire that burned in Sonoma County on Oct. 23rd, 2019 since wild fires are a type of natural disaster and Sonoma County is a subdivision of California. However, if this document contains neither the term “natural disaster” nor “California”, a Lucene-based model will give a zero relevance score between this document and the Query, thus resulting in a low recall. This highlights the necessity of understanding the user’s search intentions both semantically and spatially in a (G)IR system.
According to Dominich (2008), IR can be formally defined as:
where is the degree of relevance; is the relevance relationship; is a set of (document) entries; is the user’s query; and are implicit and inferred information. The most challenging part in this equation is the question of how to obtain the implicit and inferred information , based on user queries. Query expansion techniques, which add terms and conditions to a user query with the goal of improving the query-object relevance score (Vechtomova, 2009), can be utilized to semantically take the user’s search intention into account.
The traditional query expansion focuses on semantically-enriching a user’s query from a thematic perspective. In the context of geoportals (e.g., ArcGIS Online) we argue that a user’s query should be expanded (or semantically-enriched) from two perspectives: thematic and geospatial. In the thematic aspect, a query can be enriched/expanded by adding thematically similar concepts/terms. For example, as for Query , some highly related topics of “natural disaster” such as earthquake, wild fire, flood, and hurricane can be added to the original query. In a geoportal, extra attention should be paid to the geospatial aspect. Geospatially related terms can be added to the query. For example, as for Query , we can consider adding the names of the subdivisions of California to the query. Since this process relies on the place hierarchy, we call it platial query expansion. Moreover, the spatial scopes of the query and entries can also be used to compute the spatial similarity between them. After being enriched/expanded from these two perspectives, the new query is applied to the geoportal in the hope of improving the recall of the GIR system.
Note that the core idea of query expansion is to minimize the mismatch between a user query and candidate entries so that the recall of the IR system is improved. A similar idea can be applied when we calculate spatial similarities between a user’s query and entries. Most of the traditional spatial similarity measures are based on topological relations between the spatial scopes of the user’s query and an entry. For example, Jiang et al. (2018) defined the spatial similarity between a query and a document entry , denoted as , based on their geographic scopes , as well as their intersection (See Equation 2).
According to Equation 2, if , then which means if the intersection of the geographic footprints of and is zero, the spatial similarity score is zero. This may lead to a loss of valuable spatial proximity information in many scenarios. To give a concrete example, if a user searches for Weather in Los Angeles (Query ), a map about Temperature in Oxnard should be considered more relevant than, say, which is about Temperature in Southern Africa. However, since the both geographic scopes of Oxnard and Southern Africa do not intersect with the footprint of Los Angeles ( and ), we will have and according to Equation 2 which does not match our intuition.
In other words, it might be better to utilize a distance decay function here instead and minimize the mismatch between the current query and . Inspired by this observation, we utilize a Gaussain kernel distance decay function to compute the spatial similarity between the spatial scopes/geographic footprints between the query and documents. Using a distance decay function to optimize the query-document relevance is also related to work on query relaxation in the context of geographic question answering (Mai et al., 2019).
The research contributions of this work are as follows:
We propose a semantic query expansion framework for geoportals which enriches a user’s query from both thematic and geospatial aspects.
We develop a semantically-enriched search engine prototype for ArcGIS Online by implementing the proposed query expansion framework.
We collect a benchmark dataset to evaluate the presented framework against a widely used baseline model - Lucene’s practical scoring function. The evaluation results show that our semantic query expansion framework outperforms the baseline by a significant margin.
The remainder of this work is structured as follows. In Sec. 2, several work about geographic information retrieval are discussed. Next, we present our query expansion framework and describe each component of this system in Sec. 3. Particularly in Sec. 3.1 we discuss about the reproducibility of our work and provide guidelines related to data sets and software that facilitate future research along this line. In Sec. 4, we introduce a benchmark dataset we collect to evaluate our GIR framework and then discuss the evaluation results. Finally in Sec. 5 we conclude our work and discuss the future research directions.
2 Related Work
The idea of query expansion is to reformulate a user’s query by adding semantically related concepts (Azad and Deepak, 2019)
to minimize the query-object mismatch and increase the recall of an IR system. This typically comes at the expense of reducing the precision. Generally speaking, query expansion techniques can be classified into two categories: global analysis and local analysis(Azad and Deepak, 2019)
. As for global analysis, the expansion terms are selected based on manually built knowledge bases, knowledge graphs, or large corpora. Finding semantically related terms based on word embedding(Mikolov et al., 2013; Mai et al., 2018) or topic modeling (Hu et al., 2015b) is an example. Local analysis refers to query expansion methods that select expansion terms based on the retrieved documents of the initial user’s query. Example models include relevance feedback (Rocchio, 1971) and pseudo-relevance feedback (Buckley et al., 1995). In this work, we adopt the global analysis method and use word embedding to select semantically related terms of query terms.
Many query expansion techniques are not directly applicable for geospatial terms. For example, it is more reasonable to select geospatially related terms based on place hierarchies (e.g., from a digital gazetteer) rather than using word embedding models. This suggests a need for separately handling geospatial aspect in a query expansion task. For instance, Huang et al. (2008) classified queries into two types - location sensitive and location non-sensitive - and then handled them by using different query expansion techniques.
In the field of geographic information retrieval, there are a few works aiming at ranking documents based on both textual and spatial relevance such as the multi-dimensional scattered ranking method proposed by Van Kreveld et al. (2005). Our work follows a similar research direction but also add platial similarity to the ranking algorithm.
In addition to query expansion, another line of work for building a semantically-enriched search engine for geoportals is to enrich the metadata. For example, Hu et al. (2015a) converted the metadata of ArcGIS Online items into Linked Data and then enriched the metadata to enable semantic search. Similar to our idea, Hu et al. (2015a) also considered the semantic enrichment in two aspects: thematic and geospatial. However, converting data into another format for semantic enrichment requires additional processing steps, storage, and maintanance to keep both data sources in sync. In this work, we focus on enabling semantic search by using query expansion techniques in which the underlying data storage (e.g., Elasticsearch, Apache Lucene) remains unchanged.
In this section, we will first describe the dataset and project setup in Section 3.1. Next, we describe our semantic query expansion framework in detail. The proposed framework is composed of two major components - geospatial component and thematic component - which focus on different aspects. Figure 1 shows the overall architecture of the proposed framework. We will present each component below with the example query Chicago traffic (Query ).
3.1 Data and Software Availability
Developed by Environment System Research Institute (ESRI), ArcGIS Online is one of the best-known web geoportals. It contains a collections of web maps, data layers, tools, services, and applications contributed from different GIS users all over the world (Hu et al., 2015a). Elasticsearch444https://www.elastic.co/, a widely used search and analytic engine, is utilized to store the metadata of these ArcGIS Online items and support the portals searching functionality. The metadata of each ArcGIS Online item has different fields such as “id”, “title”, “snippet”, “description”, “type”, “location” (point), “coordinates” (the bounding box) and so on. The core search functionality of ArcGIS Online is based on Lucene’s query-document similarity function which is computed based on term frequency and inverse document frequency (TF-IDF) scoring such as Lucene’s practical scoring function555https://www.elastic.co/guide/en/elasticsearch/guide/2.x/practical-scoring-function.html, Okapi BM25, and so on. Therefore, Lucene’s practical scoring function is a natural baseline for our semantic query expansion framework.
In order to establish an evaluation dataset for our search engine prototype, we collect 53,404 items using the ArcGIS Online RESTful API which contains 1) all items published by Esri or its related organizations before September 2017; 2) all items published on ArcGIS Online between June and September in 2014 and 2017.
We use Elasticsearch to host all the retrieved ArcGIS Online items. The proposed semantic query expansion framework will serve as a middle layer as shown in Figure 1 to semantically-enrich the current user query. The expanded query will be sent to the established Elasticsearch index to get relevant ArcGIS Online items. The motivation here is to enable semantic search functionality on top of a portal such as ArcGIS Online without changing the underlying layers, e.g., data storage. In order to evaluate the proposed semantic query expansion framework and compare it with the baseline, namely Lucene’s practical scoring function, we also conduct a human participant test to get query-document relevance scores through Amazon Mechanical Turk sandbox666https://www.mturk.com/. Detail description about this benchmark dataset can be found in Section 4.2. The data and source code are available at 777https://github.com/gengchenmai/arcgis-online-search-engine including 1) the evaluation benchmark dataset; 2) the source code of our query expansion framework. The established database is hosted by Elasticsearch 5.4.0888https://www.elastic.co/blog/elasticsearch-5-4-0-released
with a vector scoring plugin999https://github.com/MLnick/elasticsearch-vector-scoring to enable word embedding computation.
3.2 Query Preprocessing: Place Name Recognition
Given a query such as Chicago traffic, we need to first split it into a geospatial aspect and a thematic aspect. A place name recognition service (e.g., DBpedia Spotlight101010https://www.dbpedia-spotlight.org/) is utilized to recognize the toponyms appearing in the query (in this case the city of Chicago) and then link it to the corresponding entities (dbo:Chicago) in a knowledge graph such as Wikidata or DBpedia. The identified places are then handled by the geospatial query expansion component and the rest of the query is send to the thematic query expansion component.
3.3 Geospatial Query Expansion Component
The geospatial query expansion component focuses on improving the platial and spatial similarity between a user’s query and a candidate ArcGIS Online item.
In order to facilitate the following query expansion process, we first enrich the identified geographic entities with additional information such as geographic coordinates, place names, total area, and their GeoNames identifier (See Listing 1). We call this GeoEnrichment step (See Figure 1).
The platial component focuses on finding similar geographic terms based on the place hierarchy. We use the GeoNames111111https://www.geonames.org/ service to get the top subdivisions of the identified places. For example, we can add Belmont Cragin and Englewood as expanded geographic terms to the expanded query of Query . Here, the platial similarity between a query and an ArcGIS item , denoted as , is defined as
Here refers to the th identified place from ; is the relative importance of place among all the identified places and ; refers to the set of expanded geographic terms; indicates the importance of with respect to the corresponding place ; indicates the weight of matching one specific metadata field since matching some fields such as “title” is much more important than matching other fields such as “description” and ; indicates the number of matches of the expanded geographic term in the current field .
The spatial component measures the spatial similarity between a query and item . Frontiera et al. (2008) discussed different geometric approaches to accessing spatial similarity and most of them are computed based on the topological relationships between the geographic scopes of query and item . An example of similarity measures is Jaccard similarity index (Jaccard, 1912). Some non-topological relation based spatial similarity indices also exist such as Hausdorff Distance.
In this work, we use a distance decay approach with Gaussian kernels. Each identified place has a Gaussian kernel which is placed at the center of its bounding box. The bandwidth of a kernel is determined based on the bounding box of the corresponding place. The intuition comes from Tobler’s First Law of Geography: the relatedness between query and item decreases with respect to their distance. Here ArcGIS Geocoding API is utilized to obtain the bounding boxes of the identified places. The spatial similarity is defined in Equation 4 where is the Gaussian score between identified place and item . The impact of different spatial similarity measures on the performance of this semantic query expansion framework will be left for future work.
3.4 Thematic Query Expansion Component
As the name indicates, thematic query expansion focuses on minimizing the query-item mismatch from a thematic, i.e., topic-based, point of view. To achieve this, we adopt two approaches: concept expansion and embedding-based document similarity. We will discuss each of them below.
Before performing thematic query expansion, some text preprocessing steps such as tokenization, word lemmatization, and stop word removal have been taken to extract thematic concepts/terms from the user’s query such as natural, disaster in Query and traffic in Query .
Concept Expansion Component
The idea of concept expansion is to find thematically similar terms to the query terms and add them to the expanded query clause. This is a common way to do query expansion (Jiang et al., 2018; Hu et al., 2015b). Unlike the previous work in GIR which use semantic knowledge base (Jiang et al., 2018) or topic modeling (Hu et al., 2015b) to find thematically similar terms, we use word embedding technique (Mikolov et al., 2013) to achieve this. A similar approach has been used in developing academic search engine (Mai et al., 2018). Given the term traffic, word embedding model finds thematically similar terms such as congestion, rail, train, roads, and so on.
Equation 5 shows the thematic similarity between and based on concept expansion . Here, indicates a thematic term in the user’s query such as traffic. means the normalized weight of among all thematic query terms and . indicates the set of thematically similar terms of based on a pretrained word embedding model such as GLove (Pennington et al., 2014) and indicates normalized weight of term with respect to
based on their cosine similarity.refers to the number of matches of the expanded thematic term in the current field .
Embedding-Based Document Similarity Component
Instead of explicitly matching the expanded thematic terms to ArcGIS Online items, the embedding-based document similarity compares query and item in the hidden word embedding space. Equation 6 shows how the similarity score is defined. is the embedding of query which is computed by simply adding the word embeddings of each thematic terms in the query . is the document embedding of which is computed based on TF-IDF weighted word embedding of each terms in its title, snippet, and description.
3.5 Expanded Query Construction
The overall similarity between a query and an ArcGIS Online iterm is a weighted sum of all four components: platial (place-based) component, spatial component, concept expansion component, and embedding-based document similarity component. , , , and are their corresponding weights.
In practice, each component can be written as a collection of function score query clauses in Elasticsearch. Figure 2 shows an example of Elasticsearch query constructed after the proposed semantic query expansion framework for the given Chicago traffic query. Each component is highlighted. Executing this expanded query in the established Elasticsearch index will give us the final search result.
4.1 Semantically-Enriched Search Engine
Based on the presented semantic query expansion framework in Section 3, we develop a semantically-enriched search engine prototype for ArcGIS Online on top of the established Elasticsearch index. Figure 3 is a screenshot of the developed system in which the radio buttons Semantic Search and Lucene correspond to our semantic query expansion based GIR model and the baseline - Lucene’s practical scoring function based IR model which we will call it Lucene baseline in the following. This web interface is available through here 121212http://stko-testing.geog.ucsb.edu:3010/ A mobile application is also developed based on AppStudio for ArcGIS (See Figure 4) .
A collection of user search logs is an ideal benchmark dataset to evaluate the presented framework as well as the Lucene baseline as Jiang et al. (2018) did. As the search logs are not available for the current project, we decide to build our own evaluation dataset. The benchmark dataset construction process can be summarized as follows:
For each query, we get the top 10 search results from our semantic query expansion model as well as the Lucene baseline.
We create a survey form for each query and each model. Each survey form consists of one query and 10 random ordered ArcGIS Online items. Users are then asked to judge the relevance between the query and each item on an ordinal scale, with labels such as“Perfect” (4), “Good” (3), “Some Relevance” (2),“Fair” (1), and “Bad” (0). The numbers in () are used as the corresponding relevance score. An example survey form can be seen in Figure 5.
To host these surveys, a crowd-facing Web interface is developed and deployed on Amazon Mechanical Turk sandbox environment.
Eight users completed these surveys who are from different departments of a US university.
In total, we have 40 survey forms, 20 for each GIR model, completed by 8 different accessors. The average relevance score among these 8 accessors’ results is treated as the relevance score between a query and an item in one form.
is a typical evaluation metric for information retrieval system. DCG is the weighted sum of “gains” of presenting a specific item. The weight is adiscounted factor by ranking an item at a particular position. For IR systems, DCG at top K rank is defined as shown in Equation 8 in which indicates the relevance score between a query and an item, the said gain, and is the discounted factor based on the current rank .
We choose DCG@3, DCG@5, and DCG@10 as the evaluation metrics and Table 1 shows the evaluation results of both our semantic search model and Lucene baseline on each query. Some interesting observations can be made based on Table 1:
By comparing the average DCG scores, our semantic search model outperforms Lucene baseline by a significant margin.
In 17 out of 20 queries, the semantic search model outperforms the Lucene baseline with .
As for the two queries (Query 2 and Query 8), the semantic search model provides relatively similar DCG scores ().
The only query in which our semantic search model performs clearly worse is Query 10 - Crimes in Tennessee. After examining the top 10 search results the two models, we find that:
All top 10 search results of Lucene baseline are crime maps about other places such as New York, Miami, or world wide crime reports. Basically Lucene baseline fetches these items based on the thematic similarity.
9 out of 10 search results of semantic search model are about other topics in Tennessee such as public health, energy, banking while one item is about crimes in neighboring states. As for these 9 items, 7 of them do not contain any place names in their title, snippet, or description but with spatial footprints close to the center of Tennessee. This implies that semantic search model finds these items mostly based on spatial similarity.
There is actually no correct answer about the crime in Tennessee.
However, based solely on these observations we cannot conclude that people pay more attention to thematic similarity than spatial similarity. That is because this bias may be caused by the design of the survey form in which thematic similarity is relatively easy to judge, while spatial similarity is rather difficult as users need to click the link and go to the web map to see the geographic scopes of an item.
These observations raise an interesting question. How to design an appropriate survey form for evaluating GIR systems in contrast more general IR systems.
|Lucene Baseline||Semantic Search|
|1||New York water||1.35||1.91||4.07||5.90||8.20||11.78|
|3||California population density||3.72||5.18||7.26||6.97||9.38||12.88|
|4||Vacation in Hawaii||3.85||5.05||7.93||8.60||11.54||15.50|
|6||Weather in Iowa||3.30||5.44||7.40||5.51||7.97||11.67|
|8||Libraries in Montana||9.40||12.57||15.30||9.29||12.56||15.26|
|9||Natural disasters in Utah||3.18||5.45||8.30||7.22||8.82||10.85|
|10||Crimes in Tennessee state||5.03||7.54||11.90||1.74||1.97||2.92|
|12||Agriculture in Michigan||6.32||7.03||8.61||8.69||9.98||12.36|
|14||Tourist attraction in LA||1.57||2.03||3.43||6.43||8.18||11.35|
|15||Hurricane in Louisiana||4.18||5.64||9.33||7.11||9.22||13.20|
|16||Universities in Boston||2.14||2.68||4.30||5.66||7.23||9.10|
|17||Hospitals in New York||1.90||2.63||4.41||5.82||8.70||12.17|
|18||Grocery store in Seattle||6.12||8.17||11.28||10.40||13.93||16.99|
|19||Highways in Los Angeles||2.22||2.92||4.19||7.64||9.09||10.26|
|20||Air pollution of New York||6.88||8.37||9.76||7.04||9.55||12.71|
In this work, we present a semantic query expansion framework for geographic information retrieval systems. It enriches a user’s query from both geospatial and thematic perspectives. Two components are developed for each perspective. By using ArcGIS Online as an example, we develop a semantically enriched search engine prototype by following the proposed query expansion framework. We constructed a benchmark dataset to evaluate the proposed GIR model as well as a widely used baseline model - Lucene’s practical scoring function model. The results demonstrate that our semantic query expansion model significantly outperforms the Lucene baseline, thereby highlighting the effectiveness of our proposed approach.
As for future research, we want to improve the efficiency of the presented semantic query expansion framework. We also want to investigate other ways to measure spatial similarity such as Space2Vec (Mai et al., 2020). In addition, we are interested in evaluating the impact of different spatial similarity measures on the performance of GIR systems more generally. Moreover, we plan to investigate the question of whether the added geospatial aspect of GIR will affect the way how we evaluate the system.
This presented work was partially done while the first author was interning at Esri Inc.. This work is partially funded by Esri Inc. and the NSF award 1936677 C-Accel Pilot - Track A1 (Open Knowledge Network): Spatially-Explicit Models, Methods, And Services For Open Knowledge Networks. We thank four Ph.D. students from UC Santa Barbara for evaluation data annotations: Jingyi Xiao, Ning Zhang, Haoxin Zhou, and Yao Xuan.
- Query expansion techniques for information retrieval: a survey. Information Processing & Management 56 (5), pp. 1698–1735. Cited by: §2.
- Automatic query expansion using smart: trec 3. NIST special publication sp, pp. 69–69. Cited by: §2.
- Evaluating search engines by modeling the relationship between relevance and clicks. In Advances in Neural Information Processing Systems, pp. 217–224. Cited by: §4.2.
- The modern algebra of information retrieval. Springer. Cited by: §1.
- A comparison of geometric approaches to assessing spatial similarity for gir. International Journal of Geographical Information Science 22 (3), pp. 337–360. Cited by: §3.3.
- Enabling semantic search and knowledge discovery for arcgis online: a linked-data-driven approach. In AGILE 2015, pp. 107–124. Cited by: §1, §2, §3.1.
- Metadata topic harmonization and semantic search for linked-data-driven geoportals: a case study using arcgis online. Transactions in GIS 19 (3), pp. 398–416. Cited by: §2, §3.4, item 1.
- Hierarchical location and topic based query expansion.. In AAAI, pp. 1150–1155. Cited by: §2.
- The distribution of the flora in the alpine zone. 1. New phytologist 11 (2), pp. 37–50. Cited by: §3.3.
- The semantics of similarity in geographic information retrieval. Journal of Spatial Information Science 2011 (2), pp. 29–57. Cited by: §1.
- Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS) 20 (4), pp. 422–446. Cited by: §4.2.
Towards intelligent geospatial data discovery: a machine learning framework for search ranking. International journal of digital earth 11 (9), pp. 956–971. Cited by: §1, §1, §1, §3.4, §4.2.
- Geographical information retrieval. International Journal of Geographical Information Science, pp. 219–228. Cited by: §1.
- Multi-scale representation learning for spatial feature distributions using grid cells. In The Eighth International Conference on Learning Representations, Cited by: §5.
- Combining text embedding and knowledge graph embedding techniques for academic search engines.. In Semdeep/NLIWoD@ ISWC, pp. 77–88. Cited by: §2, §3.4.
- Relaxing unanswerable geographic questions using a spatially explicit knowledge graph embedding model. In AGILE: The 22nd Annual International Conference on Geographic Information Science, pp. 21–39. Cited by: §1.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §2, §3.4.
Glove: global vectors for word representation.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §3.4.
- Relevance feedback in information retrieval. The Smart retrieval system-experiments in automatic document processing, pp. 313–323. Cited by: §2.
- Multi-dimensional scattered ranking methods for geographic information retrieval. GeoInformatica 9 (1), pp. 61–84. Cited by: §2.
- Query expansion for information retrieval. Encyclopedia of database systems, pp. 2254–2257. Cited by: §1.