Log In Sign Up

Geo-referencing Place from Everyday Natural Language Descriptions

by   Hao Chen, et al.

Natural language place descriptions in everyday communication provide a rich source of spatial knowledge about places. An important step to utilize such knowledge in information systems is geo-referencing all the places referred to in these descriptions. Current techniques for geo-referencing places from text documents are using place name recognition and disambiguation; however, place descriptions often contain place references that are not known by gazetteers, or that are expressed in other, more flexible ways. Hence, the approach for geo-referencing presented in this paper starts from a place graph that contains the place references as well as spatial relationships extracted from place descriptions. Spatial relationships are used to constrain the locations of places and allow the later best-matching process for geo-referencing. The novel geo-referencing process results in higher precision and recall compared to state-of-art toponym resolution approaches on several tested place description datasets.


page 1

page 2

page 3

page 4


Disambiguating fine-grained place names from descriptions by clustering

Everyday place descriptions often contain place names of fine-grained fe...

DeepMove: Learning Place Representations through Large Scale Movement Data

Understanding and reasoning about places and their relationships are cri...

Representing Videos based on Scene Layouts for Recognizing Agent-in-Place Actions

We address the recognition of agent-in-place actions, which are associat...

Like Partying? Your Face Says It All. Predicting the Ambiance of Places with Profile Pictures

To choose restaurants and coffee shops, people are increasingly relying ...

The Hangulphabet: A Descriptive Alphabet

This paper describes the Hangulphabet, a new writing system that should ...

Differentiating Geographic Movement Described in Text Documents

Understanding movement described in text documents is important since te...

An unsupervised approach for semantic place annotation of trajectories based on the prior probability

Semantic place annotation can provide individual semantics, which can be...

1 Introduction

With the increasing volume of unstructured text documents being published online, as well as the growing need for place-related information in everyday life, the relation between places and text documents has recently attracted research attention, and the necessity of identifying and locating places from text documents has been emphasized by others [Jones et al. (2001), Schlieder et al. (2001), Hill (2006), Teitler et al. (2008)]. Place information extracted from text can be used to facilitate a wide range of applications such as geographic information retrieval [Silva et al. (2006), Purves et al. (2007), Jones and Purves (2008)], to smooth human-computer interaction [Raubal (2009), Winter et al. (2016), Davies et al. (2009)]

, and to build place information systems. The rapid development of text mining and natural language processing techniques makes it feasible to extract information from text documents, through information extraction techniques such as named entity recognition and relation extraction.

This research focuses on natural language place descriptions as data input. Natural language place descriptions occur in everyday verbal communication as a way of encoding and transmitting spatial knowledge about place between individuals [Vasardani et al. (2013b), Vasardani et al. (2013a)] as well as in written texts such as web documents, news articles, social media texts, trip guides, and tourism articles [Teitler et al. (2008), Kim et al. (2015)]. Such place descriptions provide a qualitative reference system for describing geographic locations, and consist essentially of references to places and the qualitative spatial relations between these places. Consider for example the following transcription of an emergency call:

“We need an ambulance. We are in the Cussonia Courtyard, on the campus. The courtyard is beside the clock-tower. The closest road is probably Monash Road. You can rush through the Old Quad.”

The information conveyed by such a place description is useful for mental sketching of a spatial environment, and can be used, for example, to provide navigational instructions or to inform the location of events.

Current techniques for geo-referencing place from text documents, i.e., linking place references to geographic locations or footprints, are based on toponym resolution [Leidner (2007)] which relies on the identification of place names in external knowledge bases (typically a gazetteer) and performing disambiguation if there is a reference ambiguity [DeLozier et al. (2015), Lieberman and Samet (2012), Garbin and Mani (2005a), Gouvea et al. (2008), Li et al. (2006), Smart et al. (2010), Buscaldi and Rosso (2008a), Buscaldi and Rosso (2008b)]. However, places extracted from everyday place descriptions are challenging for this approach. First, everyday place descriptions are flexible in language, and often contain place references that cannot be found in a gazetteer. For example, these references can be synonyms to gazetteered names, such as vernaculars (e.g., ‘FedSquare’ instead of ‘Federation Square’), references of otherwise limited spread (e.g., ‘the place where we met yesterday’), or categories (e.g., ‘the train station’), possibly with additional qualifiers (e.g., ‘the central station’). Many of the vernacular descriptions refer to places of vague boundaries (e.g., ‘the BBQ area on the lawn’). Such places in place descriptions can be located only by the provided spatial relationships to other places. Secondly, places in everyday place descriptions are frequently of a spatial granularity where environmental features are no longer gazetteered. Individual buildings, establishments in buildings (e.g. rooms), or features of local interest (e.g., ATMs) usually have higher ambiguity than larger geographic features thus require different approaches to disambiguate and resolve [Buscaldi (2011)]

. Current toponym resolution approaches, designed for larger geographic features such as populated places (e.g., cities or countries) or natural geographic features (e.g., rivers or mountains) can, for example, use heuristics based on the sizes of the features (e.g., population). Such an approach is not applicable for everyday features that are too numerous and too similar. In summary, not as much attention and effort has been spent on developing approaches for resolving place references that are fine-grained or cannot be found in a gazetteer – which are common in everyday place descriptions.

This research aims at overcoming these limitations in order to geo-reference all places in everyday place descriptions. The paper presents a methodology that starts from a graph containing extracted places and their spatial relationships: a place graph [Vasardani et al. (2013a)]. The hypothesis of this research is that integrating the structure of the place graph into the geo-referencing process allows to exceed the state-of-art toponym resolution approaches in precision and recall. The approach presented in this paper has been implemented and tested successfully. Hence, the contributions of this paper can be summarized as follows:

  1. We propose a new toponym resolution approach based on a place graph with merged information extracted from any number of (everyday) place descriptions;

  2. We demonstrate how all places from a place graph can be geo-referenced even if some are expressed by place references that cannot be recognized by a gazetteer;

  3. We evaluate our approach with experiments on different datasets based on precision and recall and compare the result to state-of-art toponym resolution approaches.

The remainder of the paper is structured as follows: In Section 2 a review of related work is given. Section 3 clarifies how place references and spatial relationships are modelled by a place graph, as well as the roles they play in the following geo-referencing approach. In Section 4, a multi-step geo-referencing approach is explained. Section 5 shows implementation and experiment results on several test datasets. In Section 6 a discussion is presented. Section 7 concludes.

2 Related work

People talk about space by referring to places [Winter et al. (2010)]. Bennett et al. define places as conceptual entities that enable cognitive structuring of the spatial aspects of reality [Bennett and Agarwal (2007)]. In GIScience and related geographic research fields, the notion of place is a central concept in human spatial cognition and communication [Tuan (1977), Goodchild (2011a)]. Place based research is an emerging research dimension in GIScience in order to smooth and simplify human-computer interaction through capturing, modeling and utilizing place-related information, and the importance of place based research has been widely acknowledged (e.g., [Golledge (1997), Goodchild (2007), Goodchild (2011b), Winter and Freksa (2012), Winter et al. (2016)]).

Identifying and locating place from unstructured text documents has recently attracted place based research attention. In the remaining parts of this section, relevant tools for geo-referencing place from text documents will be introduced.

2.1 Gazetteer

In order to locate place names on a map with precise coordinates, gazetteers are often used in conjunction to maps. A gazetteer is an important component in geo-referencing systems for both enterprise and academic purpose, and is commonly used for geographic information retrieval, navigation services and web-mapping applications. A gazetteer typically contains three core components: place names, feature types, and footprints [Hill (2000), Goodchild and Hill (2008)]

, and is often regarded as a geospatial dictionary of geographic names. A place name is what people usually search for this place, and is typically considered as ‘the official name’. A place type is a category from a feature-type thesaurus for classifying places according to their semantics. A footprint represents the location of a place, typically by a single coordinate pair as an estimated center of an extended object, and sometimes by a polygon or a polyline instead. Some gazetteers, such as the Getty Thesaurus of Geographic Names (TGN)

111 or GeoNames222, also store alternative names, and provide detailed descriptions of places as well as positions of places in administrative or political hierarchies.

2.2 Toponym resolution

The goal of toponym resolution is to recognize place names from text documents and link them to geographic locations or footprints, and the essential challenges are place name recognition and disambiguation [Leidner (2007)]. Disambiguation is the process of mapping each place name to its actual geographic locations when there is more than one candidate reference locations. For example, according to GeoNames, the toponym ‘Paris’ can refer to more than sixty different geographic locations around the world.

Disambiguation approaches can be classified into map-based (e.g. [Smith and Crane (2001), Zhang et al. (2012)]), knowledge-based (e.g. [Buscaldi and Rosso (2008a), Karimzadeh et al. (2013)]

), and machine learning (e.g.,

[Smith and Mann (2003), Garbin and Mani (2005b)]). Various heuristics have also been suggested based on features such as population and whether a place name is a capital city name [Leidner (2007)]. The selection of the disambiguation approach is usually based on the task and data source available [Buscaldi (2011)]. Geotagging systems – typically systems that determine the geo-focus for the entire document for geographic information retrieval purposes – use toponym resolution techniques existing in literature or customized ones (e.g., [Teitler et al. (2008), Lieberman et al. (2007)]).

Existing toponym resolution approaches are not suitable for the task of this research due to three reasons. First, these approaches typically focus on gazetteered place names, while everyday places descriptions often contain place references that are not gazetteered. Second, place descriptions may contain vague places that can only be geo-referenced using spatial relationships to other places. Third, these existing approaches typically focus on places of spatial granularities that are larger or equal to suburb- and city-level, which are easier to resolve than places of finer spatial granularities. Fine-grained places often require additional information and a different methodology to disambiguate and resolve.

2.3 Qualitative spatial relationships

Qualitative spatial relationships reflect spatial cognitive capacity of people [Vasardani et al. (2013b)]

and provide useful knowledge for understanding locative expressions. Qualitative spatial relationships have been extensively studied in the Artificial Intelligence community for qualitative spatial reasoning, including cardinal

[Freksa (1992), Frank (1992), Liu et al. (2005)], topological [Egenhofer and Franzosa (1991), Randell et al. (1992)], relative direction [Schlieder (1995), Freksa (1992)], and qualitative distance [Frank (1992), Worboys (2001)]. Spatial relationships can be modeled using formal logic and applied for tasks such as robotic navigation. In English, such qualitative spatial relationships are often expressed by prepositions thus can be identified and extracted from text documents.

Some studies use qualitative spatial relationships to derive uncertainty fields for locative expressions [Herskovits (1985)]. People use locative expressions to describe a vague location through spatial relationships to some known place. E.g., ‘10 east of Berkeley’ refers to some unspecified place at a 10 distance in a particular direction from a known place called Berkeley. Wieczorek et al. developed such a point-radius method using cardinal directions and (imprecise) metric distance relationships [Wieczorek et al. (2004)]. Associated uncertainties such as coordinate-, distance-, and direction-imprecision are calculated in order to derive an uncertainty field representation. The methodology was later modified by applying a probabilistic distribution model, as the possibilities of a place to be located at any location within the uncertainty field are not equal [Guo et al. (2008)]. Liu et al. go a step further by adding topological and qualitative distance relationships in the model [Liu et al. (2009)].

Other studies attempt to quantify qualitative spatial relationships. A study by Delboni et al. focus on determining semantic equivalence for spatial relationships through quantification [Delboni et al. (2007)]. Fu et al. assigns different search radii for near based on the semantic categories of the referred places given a spatial query [Fu et al. (2005)]. Hall et al. conduct a series of data-driven studies to quantify spatial relationships including near and cardinal direction relationships, in terms of distance and orientation [Hall and Jones (2008b), Hall and Jones (2008a)]. The approach by Skoumas et al. [Skoumas et al. (2016)] is comparable to the one proposed by Hall et al., which uses grid-based representations for the derived probabilistic models for spatial relationships.

The above approaches focus on either deriving uncertainty probabilistic fields for spatial relationships, or on the quantification of spatial relationships. Yet how qualitative spatial relationships can be used in the task of geo-referencing, i.e., linking a place referred by spatial relationships to specific geographic locations or the footprint stored by a gazetteer, remains undiscussed. In addition, these previous studies typically assume some already geo-referenced relata, and in our case relata are not geo-referenced beforehand.

2.4 Place graph

A locative expression can be modelled by a locatum, the reference to a place that is to be located, a relatum, the reference to a place that is already located, and a spatial relationship between the two. Such representation, e.g., <building, near, south lawn>, has been called a spatial triplet [Vasardani et al. (2013a)]. Approaches for extracting triplets from place descriptions are available [Khan et al. (2013), Liu et al. (2014)].

A spatial property graph can then be constructed based on a set of triplets, and a complete construction process is given by Kim et al. [Kim et al. (2016a)]. Each triplet is stored as two nodes, one each for locatum and relatum, and an edge in between for the spatial relationship. Spatial relationships are standardized, e.g., ‘to the north of’ and ‘Northern’ are both normalized by an edge ‘north of’. Each edge in the graph are directed from locatum to relatum due to the asymmetry of spatial relationships, and there can be multiple edges for different spatial relationships between the same pair of nodes. Nodes are merged through a comprehensive similarity matching process considering string, linguistic and spatial relationship [Kim et al. (2016b)]. If multiple place references are identified to be referring to the same place by their similarities, they are stored as a single node, i.e., each node has a unique identifier and potentially multiple place references.

All places from such a place graph are not yet geo-referenced. The approach in this research starts from a place graph as data input, instead of raw place descriptions as text documents which are typically used by toponym resolution studies.

3 Place graph as a knowledge base

Before explaining the geo-referencing approach, this section clarifies how places, place references and spatial relationships are modeled by a place graph.

3.1 Place and place reference

Places are referred to in place descriptions by place references. People use a variety of ways to refer to places; examples were given above. Between places and place references are many-to-many relationships, i.e., a place may be referred to by one or more different place references, while the same place reference may be used to describe different places in different context (e.g., ‘the train station in Melbourne central’).

In this research, places and place references are further categorized by whether they are gazetteered or not. Regarding places, a gazetteered place is a place that is stored by a gazetteer, and a non-gazetteered place is a place that is not stored by a gazetteer. Regarding place references, a gazetteered reference is a place name stored in a gazetteer as the name of a gazetteered place, and a non-gazetteered reference is a place reference that is not known by a gazetteer. Examples for illustration will be given later.

Naturally, a place reference to a gazetteered place can be different to its gazetteered name. For example, two references ‘Flinders Street Railway Station’ (gazetteered) and ‘the train station’ (non-gazetteered) come from conversational contexts where they referred to the same, gazetteered place (Flinders Street Railway Station). In a different context, the reference ‘the train station’ may refer to another train station.

3.2 Place in a place graph

Figure 1 shows a sample place graph constructed from everyday place descriptions. It consists of six places represented by nodes (labeled a, b, c, d, e, f) as well as seven spatial relationships represented by labeled edges. A list of place references that have been found in various place descriptions for each node is shown in the solid-line rectangles. Each dashed-line rectangle shows the ground-truth gazetteered name(s) for these places (‘-’ for non-gazetteered places). The ground-truth names are only shown for demonstration purposes, and are not available from the input place graph.

Figure 1: A sample place graph with six nodes and seven edges

Such a place graph may be derived from multiple place descriptions, e.g., a place graph of Melbourne constructed from hundreds of place descriptions about Melbourne as Web documents. The construction methodology is given in Section 2.4. The geo-referencing approach below starts from such a place graph, while current toponym resolution processes typically focus on a single document at a time.

In the sample place graph shown in Figure 1, nodes a, b, c, d, e are gazetteered places, and node f is a non-gazetteered place. Each of those nodes may have several place references, and each of them may be a gazetteered reference. Thus, three situations can be distinguished for places in a place graph, which lead to a multi-step geo-referencing approach.

  1. A gazetteered place has at least one gazetteered reference (nodes a, c, d);

  2. A gazetteered place has no gazetteered references (nodes b, e); and

  3. A non-gazetteered place (node f).

3.3 Qualitative spatial relationships as constraints

Spatial relationships provide valuable knowledge for inferring the approximate location of a place by connecting it to other places, e.g., ‘the bake house inside Flinders Street Station’. The inferred approximate location can then be used for gazetteer matching and resolving reference ambiguity. In this research, the semantics of qualitative spatial relationships from four families are considered, as shown in Table 1. In order to be able to use these families studied in Artificial Intelligence, a mapping of the much richer, more flexible, and deeply contextualized natural language of spatial prepositions and verbs indicating relationships has to be applied to place graphs.


Spatial relationship family Spatial relationships
Cardinal direction north of, south of, east of, west of, north east of, south east of, north west of, south west of
Qualitative distance near
Relative direction in front of, behind, left of, right of
Topological inside, covered by, overlap, meet, disjoint, cover, contain, equal


Table 1: Spatial relationships considered in the approach below

A search space is defined for each relationship from the four families to represent the constrained location of a locatum that satisfies the spatial relationship to an already geo-referenced relatum. In the geo-referencing approach below, search spaces are used for filtering out gazetteer entries that do not satisfy those given spatial relationships. Thus, for each locatum, only a constrained number of gazetteer entries will be left for the later best-matching process.

3.3.1 Cardinal direction relationships

The search spaces for cardinal direction relationships are defined in this paper using Frank’s half-plane models [Frank (1992)] if the relatum is geo-referenced as a point, as shown in Figure 2 (a), (b) and (c). If the relatum is geo-referenced as a polygon, its centroid will be used to derive half-planes. An alternative model for polygon-based relata is minimal-bounding-box based (Figure 2 (d)). This paper applies the former (centroid-based) model since a cardinal direction relationship from a place description may be an internal cardinal relationship, e.g., ‘in the north (northern part) of the city’, while using the latter model will result in misinterpretation in such cases.

Figure 2: Search spaces for cardinal direction relationships

3.3.2 Topological relationships

If the relatum is geo-referenced as a polygon, the search spaces for topological relationships are defined as shown in Figure 3. If the relatum is geo-referenced as a point or polyline, no search spaces will be defined. As shown by the figure, only the search spaces for covered by, equal and inside are limited, while the search spaces for other topological relationships can not be constrained by limited areas directly. In natural language people often use containment relationships for expressing topological relationships (typically inside), therefore in most cases the derived search spaces for topological relationships are constrained.

For all relationships other than covered by, equal and inside, the defined search spaces are not limited. However, once all gazetteer entries that are within a search space are obtained, they can further be filtered through computing their geometries (for polygons only) and the geometry of the relata to determine whether these entries satisfy the given topological relationship. The detailed process for filtering by geometrical computation will be explained in Section 4.2.2.

Figure 3: Search spaces for covered by, equal, inside (a), disjoint, meet (b), and others (c)

3.3.3 Qualitative distance relationship

The search space for the qualitative distance relationship near is defined in a comparable way to the probabilistic uncertainty model proposed by Liu et al. [Liu et al. (2009)], and a comparison is shown in Figure 4. The search space in this research is a buffered region based on the geometry of the relatum, and can be computed using existing buffering algorithms. The computation of nearness in order to deal with the probabilistic uncertainty of near will be given in Section 4.2.2.

Figure 4: Search space for near in this research (a), (b), (c) and by Liu et al. (d)

Defining a buffer distance is the general accepted model for quantifying the qualitative nearness in local-search applications as well as geographic information retrieval engines (e.g., [Fu et al. (2005)]). In this research, the buffer distance is defined considering the size of involved objects (relata) as well as the scale of the spatial context, as shown in Eq. 1. stands for the buffer distance, is a constant, and are two coefficients that makes positively correlated with the area of the relatum as well as the area of the spatial context. The computation of spatial contexts will be explained in Section 4.1.2. Different parameters values will be tuned in the implementation stage, and associated results will be discussed.


Another qualitative spatial relationship far is not considered since it offers little help in constraining the location of a given locatum, as the search space would be unlimited. Furthermore, unlike some relationships which also have unlimited search spaces, e.g., overlap, far can not be used for filtering out candidate gazetteer entries.

3.3.4 Relative direction relationships

Search spaces for relative direction relationships are defined in a way similar to both cardinal direction and qualitative distance relationships. Relative direction relationships are based on orientation reference frames used by people, and can be either deictic, intrinsic or extrinsic [Retz-Schmidt (1988)]. Assuming that the reference frame used is known, the search spaces could be defined as shown in Figure 5. The arrow in the figure shows the direction of ‘front’ used by the reference frame. Search spaces are defined by the intersection of half planes centered at the centroid of the relatum, and the search space of near, as relative direction relationships are mostly used for describing places that are near to each other.

Figure 5: Search spaces for relative directions given a reference frame

No existing natural language parser is able to infer the reference frame automatically from place descriptions unless explicitly given by the description provider. If information of the reference frame used is unavailable, the search spaces for relative direction relationships are defined the same as for near, as a fall-back approach.

3.3.5 Approximate location region

An approximate location region (ALR) is a derived region that represents the approximate location of a place based on all known spatial relationships to some already geo-referenced places, and is computed by the intersection of all search spaces for this place. For instance, as shown in Figure 6, place b from the sample graph has no gazetteered references; however three outgoing relationships are available, i.e., east of a, south of and near c. Assuming that a and c are already geo-referenced, the location of b can be constrained by the shaded region representing the ALR.

Figure 6: An example of deriving ALR for place b through intersection of search spaces

In the geo-referencing approach below, ALRs are used to limit the likely location of places as well as to limit the number of candidate gazetteer entries to be obtained for the later best-matching process. As a place graph gets populated with more place descriptions and spatial relationships, the ALRs generated are expected to become closer to the actual footprints thus increase the overall geo-referencing accuracy.

4 Approach for geo-referencing

The three situations described in Section 3.2 for all places in a place graph are addressed by a three-step approach accordingly. The first step is to identify, disambiguate and geo-reference places in the place graph that have at least one gazetteered reference (Situation 1). These places are then regarded as anchor places. For each of the places that do not have associated gazetteered references (Situation 2), a comprehensive best-matching process is conducted based on all stored spatial relationships to the previously geo-referenced anchor-places. Gazetteer entries with the highest overall similarity scores will be selected for geo-referencing. Finally, places that cannot be matched during the second step, i.e., non-gazetteered place (Situation 3), will be geo-referenced using their derived ALRs.

The input for the following algorithms is a place graph, a labelled directed multi-graph represented by a set of places and a set of relationships between these places. Each place corresponds to a node in the graph and is associated with a number of place references , . Each relationship is a tuple consisting of the starting node, the spatial relationship, and the ending node, e.g., .

4.1 Geo-referencing anchor places

4.1.1 Identification of anchor places

The algorithm for identifying anchor places is shown in Algorithm 1. The function getReferences takes a place (e.g., ) as input and returns a list of all associated references of this place (i.e., ). getGazetteerEntries retrieves all gazetteer entries that exactly match a place reference in string, and returns a list (empty if the place references are non-gazetteered).

Input: PlaceList
Output: AnchorPlacesAndEntries

for  in PlaceList do for each place
     for  in getReferences(do for each place reference of this place
         Entries getGazetteerEntries()
     end for
     if Entries  then if the place has at least one gazetteered reference
         AnchorPlacesAndEntries (, Entries)
     end if
end for
return AnchorPlacesAndEntries
Algorithm 1 Identifying anchor places and obtaining gazetteer entries

Essentially, Algorithm 1 looks up all place references in an input place graph using a gazetteer. If a place has at least one associated place reference that can be found in the gazetteer, it is regarded as an anchor place.

Taking the sample graph as an example, each of node a, c, d has at least one gazetteered reference thus will be identified as an anchor place by the algorithm. The next step is disambiguation, as some place references may have more than one entries in the gazetteer, e.g., ‘St Paul’s Cathedral, Melbourne; St Paul’s Cathedral, London; St Paul’s Cathedral, Bendigo;…’.

4.1.2 Disambiguation

Knowledge-based disambiguation approaches are not suitable for this research since place descriptions often contain fine-grained places, and the coverage provided by the commonly used external knowledge resources, e.g., Wikipedia and WordNet, are very limited for such places. Also, data-driven disambiguation approaches can not be applied since it is difficult to obtain annotated training corpus for such fine-grained places. In addition, some commonly used heuristics such as those based on population or jurisdiction level, are not applicable either.

Thus, a novel disambiguation approach is proposed: a density-based clustering approach that is superior to the previous (map-based) approaches at least for the task at hand. For example, some studies (e.g., [Habib and van Keulen (2012), Buscaldi and Rosso (2008b)]) use overall-minimal-distance for disambiguation. If the data source contains places from multiple spatial foci that are away from each other, e.g., places from two suburbs or cities, their clustering algorithms will generate one large cluster. In contrast, the algorithm proposed in this research will generate multiple small clusters covering the foci separately. A comparison is illustrated in Figure 7. Note that the dashed circular regions are not indicating the actual cluster boundaries. Stronger disambiguation, or smaller clusters, are more useful for limiting the locations for other places in the following steps. Moncla et al. [Moncla et al. (2014)] use an existing clustering algorithm called DBSCAN [Ester et al. (1996)] for disambiguation. However, the algorithm requires manual input of the parameters of the neighborhood radius, , and the minimum number of points required to form a dense region, MinPts, which in the case of Moncla et al. are empirically adjusted based on the dataset. In contrast, the proposed algorithm is parameter-free. Requiring manually defined parameters makes DBSCAN unrobust for place graphs, as place graphs can be of various spatial scopes and extents.

Figure 7: Comparison of clustering by (a) overall minimal distance, and (b) point density

The clustering algorithm in this research is inspired by Ripley’s function [Ripley (1976)] which was originally designed to assess the degree of departure of a point set from complete spatial randomness, ranging from spatial homogeneity to a clustered pattern. Spatial randomness is irrelevant in this research, yet detecting point density meets our interest. A distance interval function is developed, as shown in Eq. 2. It computes the overall point density within the region of a given distance interval for all points. Here represents the number of points that are at a distance between and from point . The denominator of the left side of the function is the area of the region, and the left side part is used for computing point density. is used for discretizing the function, and by default is set to 100. The original function can be regarded as a cumulative version of the distance interval function.


The goal of the function is to detect clusters in the input point cloud that have significantly larger point densities, as well as to derive a cluster distance for deriving clusters. As shown in Figure 8, increases sharply at the beginning, indicating at least one cluster. To determine the density threshold, the average density value

and the standard deviation

of for all discrete distance intervals are calculated. Then the 3 rule is adopted and the density threshold is . The complete disambiguation process using distance interval function is shown in Algorithm 2.

Figure 8: Deriving the cluster distance based on a distance interval function

Input: AnchorPlacesWithEntries
Output: DisambiguatedAnchorPlaces

PointCloud all obtained entries as input point cloud
for Entry in AnchorPlacesWithEntries.getAllEntries() do
     PointCloud Entry.getCoordinates()
end for
calculate distance interval function
MaxDistance maximumPointWiseDistance(AllEntries)
for  in iterate(0, MaxDistance, do loop of (min, max, interval)
end for
Threshold average() + standardDeviation()
determine cluster distance
ArgmaxDistance = getArgmaxDistance()
SatisfyingDistances :=
for  in range(0, MaxDistance, do
     if  Threshold and ArgmaxDistance then
     end if
end for
ClusterDistance min(SatisfyingDistances)
derive all clusters based on the cluster distance and disambiguation
RankedClusters rank(computeClusters(ClusterDistance))
for Cluster in RankedClusters do
     for AnchorPlace in AnchorPlacesWithEntries do try disambiguation
         for Entry in AnchorPlace.getEntries() do
              if Entry in Cluster then
                  DisambiguatedAnchorPlaces (AnchorPlace, Entry)
              end if
         end for
     end for
end for
return DisambiguatedAnchorPlaces
Algorithm 2 Disambiguation using distance interval function

The first part of Algorithm 2 collects coordinates of the centroid of all retrieved gazetteer entries for all anchor places as the input for the distance interval function. In the second and third parts of the algorithm, the input point clouds are analyzed using the distance interval function to derive the cluster distance. getArgmaxDistance returns where is maximum. In the last part, all points within the cluster distance from each other are classified into one cluster (by function computeClusters). Clusters with a single point will be discarded, and no overlapping of clusters is possible. All clusters are ranked based on the number of points contained (in decreasing order). Finally, all anchor places are disambiguated through iterating the ranked clusters. If an anchor place, for example, has no entry in the top ranking cluster, the second-ranking cluster will be tested, and continue until an entry is found.

The disambiguation result for the sample input place graph is shown in Figure 9. Note that the dashed circular region is for illustration only and does not indicate the actual cluster boundary. Each cluster represents an approximate spatial scope where the original descriptions are geographically embedded, and the minimal bounding box of all the points within the cluster is regarded as the spatial context. The spatial context is used to decide the search space of near for all anchor places within it, as mentioned in Section 3.3.3. The area of a spatial context can be different sizes, e.g., as several street blocks, suburb-, city- or state-level.

Figure 9: Example of disambiguated anchor places (Google Maps, 2016)

The robustness of the clustering algorithm will be tested through case studies. The computational intensive part of Algorithm 2 is calculating for each distance interval, including calculating the point-wise distance for all entries as coordinates and binary search of the sorted distance list.

4.1.3 Further disambiguation

It is possible that a derived cluster contains more than one entry for an anchor place. For example, assume the cluster region shown in Figure 9 contains two entries of ‘Kentucky Fried Chicken’. In such a situation further disambiguation is needed. Remaining ambiguous anchor places will be temporarily excluded from anchor place list, and will be geo-referenced together with the remaining places in the next step through deriving ALRs. An ambiguous anchor place can be detected if it is associated with multiple entries in the DisambiguatedAnchorPlaces list returned by Algorithm 2.

4.2 Geo-referencing places without gazetteered references through best-matching

The process of geo-referencing in this step is illustrated in Algorithm 3. The first part of the algorithm derives an ALR for each of the remaining places based on all the stored spatial relationships with the place as the starting node and an anchor places as the ending node. Candidate gazetteer entries, i.e., all gazetteer entries within the ALR regardless of name match, are then obtained. In the second part of the algorithm, a best-matching process is conducted based on reference similarity as well as spatial similarity. At the end of the matching process, each place will be geo-referenced with the candidate entry with the highest overall similarity score.

Input: DisambiguatedAnchorPlaces ,
RemainingPlaces , SpatialRelationships
Output: BestMatchedPlaces

for  in RemainingPlaces do derive an ALR and obtain gazetteer entries
     Relationships getRelationshipsToAnchorPlaces()
     ALR deriveALR(Relationships)
     Entries getGazetteerEntriesWithinALR(ALR) get entries by region
best matching
     for Entry in Entries do
         SpatialSim calculateSpatialSimilarity(, Entry)
         for  in getReferences(do consider all stored place references
              ReferenceSim calculateReferenceSimilarity(, Entry)
              OverallSim calculateOverallSimilarity(ReferenceSim, SpatialSim)
              if OverallSim HighestSim then
                  HighestSim OverallSim
                  BestEntry := Entry
              end if
         end for
     end for
     BestMatchedPlaces (, BestEntry, HighestSim)
end for
return BestMatchedPlaces
Algorithm 3 Best-matching

Taking the sample graph as an example, node b and e have some spatial relationships to anchor places a and c, and the ALRs derived for node b and e are illustrated in Figure 10. All gazetteer entries within each of the derived ALR (shaded region shown in the figure) are obtained as candidate entries to be matched with. Note that node a is assumed to be geo-referenced by a polygon to allow a search space for inside.

Figure 10: Derived ALRs for node b, e (Google Maps, 2016)

4.2.1 Reference similarity

Reference similarity measures how well a candidate entry matches a place reference, both stringwise and semantically. String similarity measurement, e.g., based on edit-distance, does not consider word semantics, such as abbreviations (e.g., ‘bldg.’ and ‘building’) and words with similar meanings (e.g., ‘woods’ and ‘forest’, ‘department’ and ‘section’). Furthermore, word re-ordering should also be considered (e.g., ‘St Paul’s Cathedral’ and ‘Cathedral of St Paul’s’). Some gazetteers (e.g., OpenStreetMap333 also store tagging information associated with each gazetteer entry, e.g., {name: Richard Berry; type: building; organization: University of Melbourne; faculty: science; department: Mathematics and Statistics}. Such tagging information is also useful for entry-matching if available. For instance, the Richard Berry building in the University of Melbourne is often referred as ‘school of mathematics and statistics’ or simply ‘mathematics building’ instead of its official name. Thus, the proposed method for reference similarity measurement is shown in Algorithm 4.

Input: PlaceReference, CandidateEntry both as string
SemanticSimilarityDictionary abbreviations and word-wise semantic similarity Output: ReferenceSimilarity

PlaceReferenceTokens tokenize(PlaceReference)
CandidateEntryTokens tokenize(CandidateEntry)
TagRecordTokens tokenize(CandidateEntry.getTagValues()) if available
for PToken in PlaceReferenceTokens do Find maximum match for each token
     for CToken in CandidateEntryTokens do compare reference and entry
         TokenSim SemanticSimilarityDictionary.getSim(PToken, CToken)
         if TokenSim HighestSim then
              HighestSim TokenSim
         end if
     end for
     for TToken in TagRecordTokens do compare reference and tag if available
         TokenSim SemanticSimilarityDictionary.getSim(PToken, TToken)
         if TokenSim HighestSim then
              HighestSim TokenSim
         end if
     end for
     TokenSimList HighestSim
end for
ReferenceSimilarity average(TokenSimList) average similarity of token pairs
return ReferenceSimilarity
Algorithm 4 function calculateReferenceSimilarity()

The function SemanticSimilarityDictionary.getSim(, )

returns the semantic similarity between two tokens, e.g., ‘department’ and ‘building’, and the function can be implemented using WordNet synsets as lexicons

[Miller (1995)]. Some algorithms and implementations already exists (e.g., [Ballatore et al. (2013)]). Common abbreviations are considered as having similarity to the original words. If the similarity value cannot be found in the dictionary for two input tokens, a fall-back measurement is based on Damerau-Levenshtein edit distance similarity [Damerau (1964)].

For example, the reference similarity between ‘Fed Sq.’, as one of the place references for node b, and a gazetteer entry ‘Federation Square’ is calculated by first tokenizing both of the two strings into lists of tokens, and then measuring similarities for each pair of tokens. Finally, an average score all token-wise similarity is returned as the reference similarity, as shown in Figure 11.

Figure 11: Computing the reference similarity between ‘Fed Sq.’ and ‘Federation square’

4.2.2 Spatial similarity

All the obtained candidate entries within the derived ALR of a place are considered as satisfying all the stored spatial relationships to the neighboring anchor places with the same degree. Probabilistic uncertainties associated with each spatial relationship has not been discussed, i.e., any locations within an ALR are treated as of equal likelihood. Spatial similarity is defined for measuring how well a gazetteer entry at a certain location satisfies the known spatial relationships, and features considered for measuring spatial similarity include orientation, distance, and topology. For example, if there are two squares as candidate entries that are obtained for the place reference ‘the large square’ with exactly the same reference similarity, these two entries can only be further ranked considering their closeness to the anchor place ‘St Paul’s Cathedral’ given the spatial relationship near in between.

Spatial similarity for a candidate gazetteer entry is computed as shown in Algorithm 5 by the average spatial similarity for all spatial relationships. Orientation and nearness similarities are calculated based on centroids of both the locatum and the relatum as points, while topological similarity is calculated based on polygons (if either the locatum or the relatum is geo-referenced not as a polygon, spatial similarity for this relationship will be ignored, i.e., continue the loop).

Input: , CandidateEntry All candidate entries for
Output: SpatialSimilarity

for Relation, AnchorPlace in getRelationshipToAnchorPlaces(do
cardinal direction relationship
     if Relation is CardinalDirectionRelationship then
         Sim = calculateOrientationSim(Relation)
topological relationship
     else if Relation is TopologicalRelationship then
         Sim = calculateTopologicalSim(Relation)
         if Sim  then filter out entries that do not satisfy
              return assign zero spatial similarity to the entry
         end if
qualitative distance relationship
     else if Relation is QualitativeDistanceRelationship then
         Sim = calculateNearnessSim(Relation)
relative direction relationship
     else if Relation is RelativeDirectionRelationship then
         NearnessSim = calculateNearnessSim(Relation)
         OrientationSim = calculateOrientationSim(Relation)
         Sim = (NearnessSim + OrientationSim)/2
     end if
     Similarities Sim
end for
SpatialSimilarity average(Similarities)
return SpatialSimilarity
Algorithm 5 function calculateSpatialSimilarity()

Examples of spatial similarity calculation for three spatial relationships are shown in Figure 12

below. The shaded regions indicate search spaces. Nearness similarity is measured by (one minus) the distance between the centroid of the locatum and the relatum divided by the buffer distance, and must be between 0.0 and 1.0. Orientation similarity is measured by the angle between the displacement vector starting from the relatum to the locatum, and the direction vector pointing to the true direction (e.g., geographical north for

north of). Thus an orientation similarity must also be between 0.0 ( for north, east, south, west, and for composite directions such as northeast) and 1.0 (). Topology similarity is measured by whether the geometry of the locatum and the relatum satisfy the given topological relationship, and can be either 0.0 (not satisfy) or 1.0 (satisfy). The determination of topological relationship between two given polygons can be computed using existing algorithms and libraries.

Figure 12: Spatial similarity calculation for near (a), north of (b), and overlap (c)

4.2.3 Overall similarity scoring

The overall similarity is calculated by function getOverallSimilarity in Algorithm 3 based on Eq. 3. Different weights ( and ) will be tested and evaluated in the implementation stage. Table 2 shows an example of calculating the overall similarities for three candidate entries for node b. The highlighted cell indicates the candidate entry with the highest score, and this entry is used for geo-referencing node b. Similarly, node e (‘Flinders Street Bake House’) can also be geo-referenced.



Place to be geo-referenced Place reference Candidate entry Overall similarity
node b Fed Sq. Ian Potter Centre 0.17
Federation Square 0.78
Kirra Galleries 0.20
the large square Ian Potter Centre 0.29
Federation Square 0.63
Kirra Galleries 0.30


Table 2: Example of best-matching for node b based on computed overall similarities

4.3 Geo-referencing non-gazetteered places

A non-gazetteered place is geo-referenced using its derived ALR. Thus, node f from the sample graph can be geo-referenced as it is known that node f is in front of node e (‘Flinders Street Bake House’), and node e has already been geo-referenced by the last step through best-matching.

With such a representation, the location of the place can further be described using anchoring theory [Galton and Hood (2005)]. Thus, the place can be regarded as anchored to a location just by stating what is known with certainty and leaving the rest for further reasoning. Here the place can be described as anchored in its derived ALR.

For such a non-gazetteered place, if it has rich-connected spatial relationships to other anchor places, the derived ALR can be close to its actual location. Therefore, an alternative way is to use the centroid of an ALR to represent the approximate location of the place for geo-referencing purpose, and such a point-based representation can be more useful for applications such as geographic information retrieval. However, in this research we do not use the point-based representation as it is over-restricting. In comparison, using ALRs for geo-referencing non-gazetteered places preserves as much information as can be inferred without further generalization.

Currently there is no robust method to automatically distinguish gazetteered places without gazetteered place references (Situation 2) and non-gazetteered places (Situation 3) other than manual annotation. Therefore, in order to separate them, defining an overall similarity threshold is needed, i.e., classify all places geo-referenced by the best-matching process with overall similarities lower than the threshold as non-gazetteered places. Different threshold values will be tested and evaluated in case studies.

5 Implementation and case studies

The geo-referencing approach explained has been implemented in a system written in Python. Neo4j graph database444 is used for storing place graphs as well as for querying. Some functions within the geo-referencing process such as geocoding by gazetteer, reverse-geocoding by region, Damerau-Levenshtein similarity calculation, WordNet-based semantic similarity measurement and geometry computation, are implemented using existing free Python packages.

Three place graphs are tested. The first graph is constructed by Kim et al. [Kim et al. (2016a)] using 44 descriptions (738 triplets in total) submitted by graduate students about the University of Melbourne campus. The other two place description datasets are harvested from web texts [Kim et al. (2015)], and are used to construct a place graph of Santa Fe, New Mexico (218 triplets) and a place graph of Melbourne, VIC (4173 triplets). Gazetteers used include OpenStreetMap, GeoNames and Google Maps.

Places from the three place graphs are manually annotated with labels anchor place (places with at least one gazetteered place reference, Situation 1), gazetteered place (places without gazetteered place references, but are still gazetteered, Situation 2), non-gazetteered place (Situation 3) for evaluation purpose. Proportions of these places from the three input place graphs are shown in Table 3.


Place graph anchor place gazetteered place non-gazetteered place
Campus graph 28.6% 62.1% 9.3%
Melbourne graph 14.6% 71.9% 13.5%
Santa Fe graph 19.7% 58.7% 21.6%


Table 3: Proportions of places from the three categories from the input place graphs

5.1 Geo-referencing anchor places

Figure 13 shows the results of clustering as well as the disambiguated anchor places for the University of Melbourne campus graph (a), the Melbourne graph (b) and the Santa Fe graph (c). For each of the first two graphs, only one cluster is identified, and the cluster contains all the anchor places from the graph. For the Santa Fe graph, more than five clusters are identified, indicating multiple spatial foci. The top-ranking two clusters shown in Figure 13 (c) are within Santa Fe, New Mexico, US, and Grand Junction, CO, US respectively.

Figure 13: Clustering results and geo-referenced anchor places for the three place graphs

In order to measure how well these anchor places are geo-referenced, the standard evaluation metrics commonly used by toponym resolution studies are applied. Results are shown in Table 

4. Precision is computed by the number of correctly geo-referenced anchor places divided by the total number of annotated anchor places. Recall values for anchor places are irrelevant since the same gazetteers are used for annotation as well as geo-referencing.


Place graph Campus Graph Melbourne Graph Santa Fe Graph
Precision 93.4% 87.5% 91.8%


Table 4: Precisions for geo-referencing anchor places

5.2 Geo-referencing through best-matching

For evaluating the best-matching process, all the places from the three place graphs that are annotated as gazetteered place are considered. The result precisions for the three graphs are shown in Figure 14. Each y-value represents the average precision for places matched with similarities greater than or equal to the x-value. For example, the y-value for the campus graph at 0.8 is 78.6%, indicating that the overall precision for all best-matched places with similarities greater than or equals to 0.8 is 78.6%.

Figure 14: Precisions of best-matching associated with matched similarity

Another evaluation metrics here is termed ALR precision, and is defined as the number of places with their derived ALRs covering their corresponding gazetteer entries, divided by the total number of places annotated as gazetteered place. A place can not be correctly geo-referenced if its derived ALR is not covering the location of the gazetteer entry. Overall precisions and ALR precisions for the three graphs are shown in Table 5.


Place graph Precision ALR precision
Campus Graph 42.7% 83.4%
Melbourne Graph 31.1% 81.3%
Santa Fe Graph 23.1% 68.3%


Table 5: Precisions and ALR precisions for geo-referencing through best-matching

As discussed before, different buffer distances for near may lead to different geo-referencing results. Generally, a larger search space for near is more likely to cover the true gazetteer entries of places to be matched, i.e., increase ALR precision, but at the same time increase the likelihood of getting false positives, i.e., decrease precision. Different parameters in Eq. 1 are tested. However, no significant improvements in precisions and ALR precisions are observed. Reasons will be discussed later.

The best-matching algorithm is determined by both reference similarity as well as spatial similarity. Different weights are tested, and the overall precisions for the three place graphs are shown in Table 6. Previous results shown in Figure 14 and Table 5 are based on the weights with the highest overall precision. Note that assigning different weights does not affect ALR precision, only precision.


ReferenceSim weight SpatialSim weight Precision
1.0 0.0 26.8%
0.7 0.3 28.2%
0.5 0.5 14.8%
0.3 0.7 8.1%
0.0 1.0 2.2%


Table 6: Precisions with different weights for best matching

5.3 Geo-referencing non-gazetteered places

Place references referring to non-gazetteered places have no corresponding gazetteer entries, thus precision is not useful here and only ALR precision is considered. Results for the three input place graphs for non-gazetteered places are shown in Table 7.


Place graph Campus Graph Melbourne Graph Santa Fe Graph
ALR Precision 81.4% 77.5% 72.1%


Table 7: ALR precisions for non-gazetteered places

Previous results are computed based on the manual annotated ground-truth labels. As already discussed in Section 4.3, a threshold is necessary in order to automatically classify non-gazetteered places and gazetteered places. The evaluation metrics of classification is recall, and is defined as the proportion of places that are correctly classified for the two classes gazetteered place and non-gazetteered place. Applying different threshold values will affect the recall values for both two classes.

For example, if the threshold is set to 0.9, then most places here will be classified as non-gazetteered place, and the recall value for gazetteered place will be small consequently. Different thresholds are tested, and the overall recall values for the two classes for the three input graphs are shown in Table 15.

Figure 15: Recall trade-off between classes gazetteered place and non-gazetteered place with different threshold values

6 Discussion

The approach explained above is feasible for geo-referencing all places from everyday place descriptions, and its flexibility and applicability is demonstrated by case studies using place descriptions collected from different sources.

6.1 Comparison and evaluation of the presented methodology

The developed methodology is compared to the existing toponym resolution approaches and engines. However, the comparison is not straightforward since the objectives and tasks are quite different. Previous toponym resolution studies typically focus on gazetteered place references, referring to places of spatial granularities that are coarser (e.g., cities, countries, and geographic features). Such places are important for their objectives such as to determine the spatial foci of text documents for geographic information retrieval purposes, or to geo-reference places from historical document collections. For the task of this research, we are additionally interested in places of finer spatial granularities as well as places with non-gazetteered references. Such places are typically ignored in previous studies. Therefore, places from an input place graph are divided into three categories: anchor places, gazetteered places, and non-gazetteered places, to allow a fair comparison.

The task of geo-referencing anchor places is comparable to existing toponym resolution approaches as anchor places are gazetteered places with gazetteered references. As shown in Table 4, the precisions for this task for the three tested place graphs are approximately 90%. Existing toponym resolution engines, e.g., CLAVIN555, STEWARD666, or NewsStand777, have quite low recall for those anchor places due to the differences in the target corpora and gazetteers used, thus cannot be compare directly by testing our datasets. Therefore, we choose to compare our results to the most similar map-based toponym resolution approaches [Habib and van Keulen (2012), Buscaldi and Rosso (2008b), Moncla et al. (2014)] which have already been discussed and compared to our approach in Section 4.1.2, and the overall precisions for these approaches are from to 45% to 94%. Therefore, it is reasonable to say the approach developed for geo-referencing anchor places is as good as (if not better than) other existing toponym resolution approaches for the task of research in terms of precision.

As shown in Table 3, the proportion of anchor places from the three datasets are approximately only 10% to 30%, while the remaining places do not have gazetteered references. For gazetteered places without gazetteered references, the overall precision and ALR precision of the proposed approach are shown in Table 5, indicating that the approach is feasible for geo-referencing these places. The ALR precision values are generally much higher than the precision values, which means that most places are covered by their derived ALRs, even if not all of them are successfully geo-referenced by their corresponding gazetteer entries by the best-matching process. Figure 14 shows that the geo-referencing precision is generally higher if the similarity scored by the best-matching process is higher, i.e., places matched with higher overall similarities are more likely to be correctly geo-referenced. For places that are matched with similarity values equal to or greater than 0.9, the overall precisions are approximately 90%. Finally, ALR precision is also used for non-gazetteered places, and results are shown in Table 7. Precision is not used for non-gazetteered places since these places have no corresponding gazetteer entries. The result ALR precision values for non-gazetteered places are close to the ALR precisions for gazetteered places for the three input graphs (comparing Table 5 and Table 7). This is because ALRs for places from both the two categories (gazetteered place and non-gazetteered place) are derived using the same method. Being able to geo-reference places without gazetteered references results in higher recall (and overall higher precision) compared to the results by existing toponym resolution approaches.

6.2 Robustness, uncertainties, and parameter testing

The approach for disambiguating and geo-referencing anchor places is based on a novel density-based clustering algorithm, as explained in Section 4.1.2. Three input place graphs that are of various sizes, different spatial scopes, and are collected from different sources, have been tested. The results reveal that the approach is feasible to geo-reference places from different input graphs. The algorithm is able to identify multiple spatial foci even if they are far away from each other. Taking the result for the Santa Fe place graph as an example (see Figure 13 (c)), multiple clusters are identified as corresponding to spatial foci, and anchor places within these clusters are disambiguated and geo-referenced successfully. The Santa Fe descriptions are harvested from websites and possibly contain places that are not in Santa Fe.

Search spaces and spatial similarity are defined to determine the degree of satisfaction of candidate gazetteer entries given spatial relationships to some already geo-referenced anchor places. Search spaces are defined using existing models (e.g., half-plane models for cardinal direction relationships, and buffered regions for near). Features including distance, orientation and topology are considered for measuring spatial similarity. For example, given a spatial relationship near between a place to be matched and an anchor place, and two candidate gazetteer entries with the same name, the entry that is closer to the anchor place will be assigned with a higher spatial similarity. The major uncertainty comes from defining the buffer distance for near, as discussed in Section 3.3.3, and the task remains an unsolved hard problem in relevant research fields such as geographic information retrieval. Here is a trade-off problem, i.e., larger buffer distances tend to result in higher ALR precision but at the same time increase the number of false positives. The buffer distance for near in this research is defined considering both the size of the spatial context and the size of the referred relatum, as shown in Eq. 1, in order to capture at least some aspects of context. Different values of parameters in Eq. 1 are used for tuning in the implementation stage, however no significant improvement in precision value is observed. A possible reason is that a large proportion of anchor places in the used gazetteers are geo-referenced as points, thus relatum sizes make no difference for these anchor places. Also, for two of the three tested graphs, only one spatial context is derived that contains all anchor places, thus the size of the spatial context makes no difference in such situation either.

The best-matching process, as described in Section 4.2, considers reference as well as spatial similarity. Reference similarity is measured by a comprehensive algorithm considering token-wise string and semantic similarities and common abbreviations. Spatial similarity is already discussed in the previous paragraph. Different weights for the two components in order to compute the overall-similarity have been tested for tuning purposes, as shown in Table 6. The result shows that assigning reference similarity with weights around 0.7 gives the highest precision. Assigning reference similarity with a weight of 1.0 still gives quite high precision. In contrast, assigning spatial similarity with a weight of 1.0 results in nearly zero precision. A likely reason is that the obtained gazetteer entries for each place to be matched have already been filtered by spatial relationships, and reference similarity is more effective for further ranking these entries than spatial similarity.

The last step is to identify and separate out non-gazetteered places. Different thresholds have been tested for classification, and the result is shown in Table 15. The results reveal a trade-off pattern between the recall of gazetteered places and non-gazetteered places. Defining a threshold of around 0.7 results in the optimal situation, i.e., the overall highest recall values for both gazetteered and non-gazetteered places.

6.3 Failure analysis

Three main reasons have been identified that cause incorrect geo-referencing of places. First, some derived ALRs are not covering the true locations of the corresponding places. This is most likely caused by inappropriate search spaces, e.g., too small buffer distance for near. Incorrect ALRs affect both the best-matching precision (Table 5) for places annotated as gazetteered place (not including anchor places), as well as the ALR precision (Tables 5 and 7). Second, gazetteered places with stored place references too different from the actual gazetteered names tend to have lower geo-referencing precision since they are difficult to be matched by the best-matching process. Third, some place references identified as gazetteered place names are actually not gazetteered. For instance, the place reference ‘Gate 10’ referring to a non-gazetteered entrance of the University of Melbourne is identified as an anchor place because there is a gazetteer entry with the same name referring to another place. Such cases are rare, and are not considered by the proposed approach since they cannot be identified without further considering the sentence-level context of the original place description.

7 Conclusion

Natural language place descriptions occur in everyday verbal communication as a way of conveying spatial information about place. An important step of utilizing the contained knowledge from such place descriptions is to identify and geo-reference all places referred. Everyday place descriptions are flexible, vernacular, and often contain place references as synonyms or place categories, instead of officially stored place names in a gazetteer. Such place references are not known by a gazetteer, thus cannot be geo-referenced using current toponym resolution approaches which are typically based on gazetteered place name matching and disambiguation. In addition, place descriptions may also contain other places that have vague boundaries and can only be located by providing additional spatial relationships to other places, or places that are too fine-grained where environment features are no longer gazetteered (e.g., rooms). Even if some fine-grained places are gazetteered, they are usually more ambiguous and require different approaches to resolve other than some standard toponym resolution heuristics (e.g., based on population or jurisdictional containment relationship). Therefore, this research is motivated by developing a novel approach that could overcome these limitations and be able to geo-reference all places from place descriptions.

This research starts from a place graph which stores extracted place references and spatial relationships from any number of place descriptions, instead of focusing on a single document at a time. A place graph is constructed by spatial triplets extracted from place descriptions, and place references as synonyms are merged through a comprehensive similarity matching process. The complete methodology for place graph construction is provided in previous research [Kim et al. (2016a), Kim et al. (2016b)]. The merged place references allow linking some non-gazetteered place references to gazetteered names, and the stored spatial relationships provide a qualitative reference system for describing places which can be used for constraining the locations of places.

The proposed geo-referencing approach consists of three main stages. In the first stage, places in an input place graph that have at least some gazetteered place names are identified, disambiguated and geo-referenced. These places are then labelled as anchor places, and are used in the following steps to help geo-referencing the remaining places using spatial relationships. A novel density-based clustering algorithm is developed for this purpose, which is superior to comparable clustering approaches for the task of this research for several reasons (see Section 4.1.2). In the second step, for each of the remaining places, all its stored spatial relationships to the already geo-referenced anchor places are extracted and used to derive an approximate location region to constrain its likely location. Then, a comprehensive best-matching process is conducted based on comparing the stored place references and all gazetteer entries obtained within its derived approximate location region considering string, semantic, and spatial similarity. In the last stage, places that are non-gazetteered are identified and geo-referenced using their derived approximate location regions.

The developed approach has been tested with several datasets collected from different sources that are of various types (e.g., collected by survey or harvested from websites), sizes (constructed by potentially thousands of triplets), and spatial scopes (e.g., of different spatial granularities or with multiple spatial foci). The implementation was tested with over 5000 triplets in total, and the results show that the approach is feasible and applicable. In order to make a fair comparison to existing toponym resolution approache which typically only consider places with gazetteered place names, the evaluation is divided into three parts. For geo-referencing anchor places (place with gazetteered names), the novel approach has approximately 90% precision for all the tested datasets (Table 4). However, the main contribution is that the developed approach is also able to geo-reference the remaining places (as shown in Table 3, such places are approximately 70% to 85% in the input place graphs) that cannot be resolved using existing toponym resolution approaches, thus increase overall precision. The results show that approximately 20% to 40% of such places can be correctly geo-referenced (Table 5), and about 60% to 80% of these places are within their derived approximate location regions using spatial relationships (Table 5 and 7).

There are several adjustable parameters in the presented approach. Different parameter values have been tested in the implementation stage, and associated uncertainties and influences have already been discussed in the discussion section. For instance, defining the search space for near is still an open question in relevant research fields (see Section 3.3.3) and heavily relies on contextual information and user perceptions. Such knowledge currently cannot be automatically extracted and modelled from place descriptions. In future work, refined search spaces for near can be replaced to increase overall geo-referencing precision.

The applicability of the presented geo-referencing approach depends on both the richness of the input spatial relationships and place references. A richly-populated place graph tends to result in higher geo-referencing precision. As explained in Section 3.2, an input place graph may be derived from multiple place descriptions, e.g., a place graph of Melbourne can be constructed from hundreds of place descriptions about places in Melbourne. It is expected that as a place graph gets populated with more place descriptions and spatial relationships, the ALRs derived are expected to become more constrained and closer to the actual footprints thus increase the overall geo-referencing precision. Also, more non-gazetteered place references are likely to be merged with some gazetteered place names thus allow easier geo-referencing.

This research presents a feasible approach to geo-reference all referred places in everyday place descriptions. The outcome has potential benefits to various areas including geographic information retrieval which heavily relies on techniques that are able to automatically geo-reference places from text documents. Another application area is emergency services, which quickly fail when facing vernacular place descriptions with non-gazetteered place references and qualitative spatial relationships. The standard available geographic information systems (such as national address files) used in such situations are possibly not detailed enough for localization with regard to vernacular or granularity. Furthermore, the presented approach is able to enrich authoritative datasets, such as digital gazetteers and address databases, with people’s local geographic knowledge.


  • [1]
  • Ballatore et al. (2013) Andrea Ballatore, Michela Bertolotto, and David C. Wilson. 2013. Grounding Linked Open Data in WordNet: The Case of the OSM Semantic Network. Lecture Notes in Computer Science, Vol. 7820. Springer, 1–15.
  • Bennett and Agarwal (2007) Brandon Bennett and Pragya Agarwal. 2007. Semantic Categories Underlying the Meaning of ‘Place’. Lecture Notes in Computer Science, Vol. 4736. Springer, 78–95.
  • Buscaldi (2011) Davide Buscaldi. 2011. Approaches to disambiguating toponyms. SIGSPATIAL Special 3, 2 (2011), 16–19.
  • Buscaldi and Rosso (2008a) Davide Buscaldi and Paolo Rosso. 2008a. A conceptual density-based approach for the disambiguation of toponyms. International Journal of Geographical Information Science 22, 3 (2008), 301–313.
  • Buscaldi and Rosso (2008b) Davide Buscaldi and Paolo Rosso. 2008b. Map-based vs. knowledge-based toponym disambiguation. In GIR, Chris Jones and Ross Purves (Eds.). ACM, 19–22.
  • Damerau (1964) Fred J. Damerau. 1964. A technique for computer detection and correction of spelling errors. Commun. ACM 7, 3 (1964), 171–176.
  • Davies et al. (2009) Clare Davies, Ian Holt, Jenny Green, Jenny Harding, and Lucy Diamond. 2009. User needs and implications for modelling vague named places. Spatial Cognition & Computation 9, 3 (2009), 174–194.
  • Delboni et al. (2007) Tiago M. Delboni, Karla A. V. Borges, Alberto H. F. Laender, and Clodoveu A. Davis. 2007. Semantic expansion of geographic web queries based on natural language positioning expressions. Transactions in GIS 11, 3 (2007), 377–397.
  • DeLozier et al. (2015) Grant DeLozier, Jason Baldridge, and Loretta London. 2015. Gazetteer-Independent Toponym Resolution Using Geographic Word Profiles. In Twenty-Ninth AAAI Conference on Artificial Intelligence. AAAI Press.
  • Egenhofer and Franzosa (1991) Max J. Egenhofer and Robert D. Franzosa. 1991. Point-set topological spatial relations. International Journal of Geographical Information Systems 5, 2 (1991), 161–174.
  • Ester et al. (1996) Martin Ester, Hans-Peter Kriegel, Jörg S, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Second International Conference on Knowledge Discovery and Data Mining. AAAI Press, 226–231.
  • Frank (1992) Andrew U Frank. 1992. Qualitative spatial reasoning about distances and directions in geographic space. Journal of Visual Languages & Computing 3, 4 (1992), 343–371.
  • Freksa (1992) Christian Freksa. 1992. Using Orientation Information for Qualitative Spatial Reasoning. Lecture Notes in Computer Science, Vol. 639. Springer, 162–178.
  • Fu et al. (2005) Gaihua Fu, Christopher B Jones, and Alia I Abdelmoty. 2005. Ontology-based spatial query expansion in information retrieval. In On the Move to Meaningful Internet Systems, Robert Meersman and Zahir Tari (Eds.). Springer, Berlin, Heidelberg, 1466–1482.
  • Galton and Hood (2005) Antony Galton and James Hood. 2005. Anchoring: a new approach to handling indeterminate location in GIS. In Spatial Information Theory (Lecture Notes in Computer Science), Anthony Cohn and David Mark (Eds.), Vol. 3693. Springer, 1–13.
  • Garbin and Mani (2005a) Eric Garbin and Inderjeet Mani. 2005a. Disambiguating Toponyms in News. In Human Language Technology Conference. Association for Computational Linguistics, 363–370.
  • Garbin and Mani (2005b) Eric Garbin and Inderjeet Mani. 2005b. Disambiguating toponyms in news. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 363–370.
  • Golledge (1997) Reginald G. Golledge. 1997. Spatial behavior: A geographic perspective. Guilford Press.
  • Goodchild (2007) Michael F. Goodchild. 2007. Citizens as sensors: The world of volunteered geography. GeoJournal 69, 4 (2007), 211–221.
  • Goodchild (2011a) Michael F. Goodchild. 2011a. Formalizing place in geographic information systems. In Communities, Neighborhoods, and Health. Springer, 21–33.
  • Goodchild (2011b) Michael F. Goodchild. 2011b. Formalizing Place in Geographical Information Systems. In Communities, Neighborhoods, and Health: Expanding the Boundaries of Place, L. M. Burton, S. P. Kemp, M.-C. Leung, S. A. Matthews, and D. T. Takeuchi (Eds.). Springer, New York, 21–35.
  • Goodchild and Hill (2008) Michael F. Goodchild and Linda L. Hill. 2008. Introduction to digital gazetteer research. International Journal of Geographical Information Science 22, 10 (2008), 1039–1044.
  • Gouvea et al. (2008) Cleber Gouvea, Luis Fernando Fortes Garcia, Evandro Brasil da Fonseca, and Igor Wendt. 2008. Discovering Location Indicators of Toponyms from News to Improve Gazetteer-Based Geo-Referencing. In Proceedings of the X Brazilian Symposium on GeoInformatics, Marcelo Tilio Monteiro Carvalho, Marcelo Gattass, and Marco Antonio Casanova (Eds.). Pontícia Universidade Católica do Rio de Janeiro, 51–62.
  • Guo et al. (2008) Qinghua Guo, Y. Liu, and John Wieczorek. 2008. Georeferencing locality descriptions and computing associated uncertainty using a probabilistic approach. International Journal of Geographical Information Science 22, 10 (2008), 1067–1090.
  • Habib and van Keulen (2012) Mena B. Habib and Maurice van Keulen. 2012. Improving toponym disambiguation by iteratively enhancing certainty of extraction. In International Conference on Knowledge Discovery and Information Retrieval, Ana L. N. Fred and Joaquim Filipe (Eds.). SciTePress, Barcelona, Spain, 399–410.
  • Hall and Jones (2008a) Mark M. Hall and Christopher B. Jones. 2008a. Evaluating field crisping methods for representing spatial prepositions. In Proceedings of the 2nd International Workshop on Geographic Information (2008-11-24), Chris Jones and Ross Purves (Eds.). ACM, 9–10.
  • Hall and Jones (2008b) Mark M. Hall and Christopher B. Jones. 2008b. Quantifying spatial prepositions: an experimental study. In 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (2009-03-20), Walid G. Aref, Mohamed F. Mokbel, and Markus Schneider (Eds.). ACM, 62.
  • Herskovits (1985) Annette Herskovits. 1985. Semantics and Pragmatics of Locative Expressions. Cognitive Science 9, 3 (1985), 341–378.
  • Hill (2000) Linda L. Hill. 2000. Core Elements of Digital Gazetteers: Placenames, Categories, and Footprints. Lecture Notes in Computer Science, Vol. 1923. Springer, 280–290.
  • Hill (2006) Linda L. Hill. 2006. Georeferencing: the geographic associations of information. MIT Press, Cambridge, Mass.
  • Jones et al. (2001) Christopher B. Jones, Harith Alani, and Douglas Tudhope. 2001. Geographical Information Retrieval with Ontologies of Place. Lecture Notes in Computer Science, Vol. 2205. Springer, 322–335.
  • Jones and Purves (2008) Christopher B. Jones and Ross S. Purves. 2008. Geographical information retrieval. International Journal of Geographical Information Science 22, 3 (2008), 219–228.
  • Karimzadeh et al. (2013) Morteza Karimzadeh, Wenyi Huang, Siddhartha Banerjee, Jan Oliver Wallgrün, Frank Hardisty, Scott Pezanowski, Prasenjit Mitra, and Alan M. MacEachren. 2013. GeoTxt: A web API to leverage place references in text. In 7th Workshop on Geographic Information Retrieval, Chris Jones and Ross Purves (Eds.). ACM, 72–73.
  • Khan et al. (2013) Arbaz Khan, Maria Vasardani, and Stephan Winter. 2013. Extracting Spatial Information From Place Descriptions. In First ACM SIGSPATIAL International Workshop on Computational Models of Place, Simon Scheider, Benjamin Adams, Krzysztof Janowicz, Maria Vasardani, and Stephan Winter (Eds.). ACM, 62–69.
  • Kim et al. (2015) Junchul Kim, Maria Vasardani, and Stephan Winter. 2015. Harvesting large corpora for generating place graphs. In International Workshop on Cognitive Engineering for Spatial Information Processes (CESIP), Sven Bertel, Peter Kiefer, Alexander Klippel, Simon Scheider, and Tyler Thrash (Eds.), Vol. 12.
  • Kim et al. (2016a) Junchul Kim, Maria Vasardani, and Stephan Winter. 2016a. From descriptions to depictions: A dynamic sketch map drawing strategy. Spatial Cognition & Computation 16, 1 (2016), 29–53.
  • Kim et al. (2016b) Junchul Kim, Maria Vasardani, and Stephan Winter. 2016b. Similarity matching for integrating spatial information extracted from place descriptions. International Journal of Geographical Information Science 1 (2016), 1–25.
  • Leidner (2007) Jochen L. Leidner. 2007. Toponym resolution in text: annotation, evaluation and applications of spatial grounding. SIGIR Forum 41, 2 (2007), 124–126.
  • Li et al. (2006) Y. Li, Nicola Stokes, Alistair Moffat, and Lawrence Cavedon. 2006. Exploring Probabilistic Toponym Resolution for Geographic Information Retrieval. In SIGIR Workshop on Geographic Information Retrieval. ACM Press, 17–21.
  • Lieberman and Samet (2012) Michael D. Lieberman and Hanan Samet. 2012. Adaptive Context Features for Toponym Resolution in Streaming News. In 35th International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR’12. ACM Press, 731–740.
  • Lieberman et al. (2007) Michael D. Lieberman, Hanan Samet, Jagan Sankaranarayanan, and Jon Sperling. 2007. STEWARD: Architecture of a spatio-textual search engine. In 15th Annual ACM International Symposium on Advances in Geographic Information Systems, Hanan Samet, Cyrus Shahabi, and Markus Schneider (Eds.). ACM, Seattle, WA, 186–193.
  • Liu et al. (2014) Fei Liu, Maria Vasardani, and Timothy Baldwin. 2014. Automatic Identification of Locative Expressions from Social Media Text: A Comparative Analysis. In 4th International Workshop on Location and the Web, Dirk Ahlers, Erik Wilde, and Bruno Martins (Eds.). ACM, 9–16.
  • Liu et al. (2009) Yu Liu, Qing H. Guo, John Wieczorek, and Michael F. Goodchild. 2009. Positioning localities based on spatial assertions. International Journal of Geographical Information Science 23, 11 (2009), 1471–1501.
  • Liu et al. (2005) Yu Liu, Xiaoming Wang, Xin Jin, and Lun Wu. 2005. On internal cardinal direction relations. Lecture Notes in Computer Science, Vol. 3693. Springer, Berlin, Heidelberg, 283–299.
  • Miller (1995) George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39–41.
  • Moncla et al. (2014) Ludovic Moncla, Walter Renteria-Agualimpia, Javier Nogueras-Iso, and Mauro Gaio. 2014. Geocoding for texts with fine-grain toponyms: an experiment on a geoparsed hiking descriptions corpus. In 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Yan Huang, Markus Schneider, Michael Gertz, John Krumm, and Jagan Sankaranarayanan (Eds.). ACM, 183–192.
  • Purves et al. (2007) Ross S. Purves, Paul Clough, Christopher B. Jones, Avi Arampatzis, Benedicte Bucher, David Finch, Gaihua Fu, Hideo Joho, Awase Khirni Syed, Subodh Vaid, and others. 2007. The design and implementation of SPIRIT: a spatially aware search engine for information retrieval on the Internet. International Journal of Geographical Information Science 21, 7 (2007), 717–745.
  • Randell et al. (1992) David A. Randell, Zhan Cui, and Anthony G. Cohn. 1992. A Spatial Logic based on Regions and Connections. In 3rd International Conference on Knowledge Representation and Reasoning. Morgan Kaufmann, 165–176.
  • Raubal (2009) Martin Raubal. 2009. Cognitive engineering for geographic information science. Geography Compass 3, 3 (2009), 1087–1104.
  • Retz-Schmidt (1988) Gudula Retz-Schmidt. 1988. Various views on spatial prepositions. AI Magazine 9, 2 (1988), 95.
  • Ripley (1976) Brian D. Ripley. 1976. The Second-Order Analysis of Stationary Point Processes. Journal of Applied Probability 13, 2 (1976), 255–266.
  • Schlieder (1995) Christoph Schlieder. 1995. Reasoning about ordering. Lecture Notes in Computer Science, Vol. 988. Springer, 341–349.
  • Schlieder et al. (2001) Christoph Schlieder, Thomas J. Vögele, and Ubbo Visser. 2001. Qualitative Spatial Representation for Information Retrieval by Gazetteers. Lecture Notes in Computer Science, Vol. 2205. Springer, 336–351.
  • Silva et al. (2006) Mário J. Silva, Bruno Martins, Marcirio Chaves, Ana Paula Afonso, and Nuno Cardoso. 2006. Adding geographic scopes to web resources. Computers, Environment and Urban Systems 30, 4 (2006), 378–399.
  • Skoumas et al. (2016) Georgios Skoumas, Dieter Pfoser, Anastasios Kyrillidis, and Timos Sellis. 2016. Location Estimation Using Crowdsourced Spatial Relations. ACM Transactions on Spatial Algorithms and Systems 2, 2 (2016), 5.
  • Smart et al. (2010) Philip D. Smart, Christopher B. Jones, and Florian A. Twaroch. 2010. Multi-source Toponym Data Integration and Mediation for a Meta-Gazetteer Service. Lecture Notes in Computer Science, Vol. 6292. Springer, Berlin, Book section 17, 234–248. DOI: 
  • Smith and Crane (2001) David A. Smith and Gregory Crane. 2001. Disambiguating geographic names in a historical digital library. Lecture Notes in Computer Science, Vol. 2163. Springer, 127–136.
  • Smith and Mann (2003) David A Smith and Gideon S Mann. 2003. Bootstrapping toponym classifiers. In HLT-NAACL 2003 Workshop on Analysis of Geographic References, Vol. 1. Association for Computational Linguistics, 45–49.
  • Teitler et al. (2008) Benjamin E. Teitler, Michael D. Lieberman, Daniele Panozzo, Jagan Sankaranarayanan, Hanan Samet, and Jon Sperling. 2008. NewsStand: a new view on news. In 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (2009-03-20), Walid G. Aref, Mohamed F. Mokbel, and Markus Schneider (Eds.). ACM, 18.
  • Tuan (1977) Yi-Fu Tuan. 1977. Space and Place : the perspective of experience. University of Minnesota Press, Minneapolis.
  • Vasardani et al. (2013a) Maria Vasardani, Sabine Timpf, Stephan Winter, and Martin Tomko. 2013a. From Descriptions to Depictions: A Conceptual Framework. Lecture Notes in Computer Science, Vol. 8116. Springer, 299–319.
  • Vasardani et al. (2013b) Maria Vasardani, Stephan Winter, and Kai-Florian Richter. 2013b. Locating place names from place descriptions. International Journal of Geographical Information Science 27, 12 (2013), 2509–2532.
  • Wieczorek et al. (2004) John Wieczorek, Qinghua Guo, and Robert Hijmans. 2004. The point-radius method for georeferencing locality descriptions and calculating associated uncertainty. International Journal of Geographical Information Science 18, 8 (2004), 745–767.
  • Winter et al. (2016) Stephan Winter, Timothy Baldwin, Jochen Renz, Martin Tomko, and Werner Kuhn. 2016. Place knowledge as a trans-disciplinary research challenge for Geographic Information Science. In UCGIS Symposium, Jeremy Mennis (Ed.).
  • Winter et al. (2010) Stephan Winter, Rohan Bennett, Marie Truelove, Abbas Rajabifard, Matt Duckham, Allison Kealy, and Joe Leach. 2010. Spatially enabling ‘Place’ information. In Spatially Enabling Society: Research, Emerging Trends, and Critical Assessment, Abbas Rajabifard (Ed.). GSDI Association.
  • Winter and Freksa (2012) Stephan Winter and Christian Freksa. 2012. Approaching the notion of place by contrast. Journal of Spatial Information Science 5, 1 (2012), 31–50.
  • Worboys (2001) Michael F. Worboys. 2001. Nearness relations in environmental space. International Journal of Geographical Information Science 15, 7 (2001), 633–651.
  • Zhang et al. (2012) Xiao Zhang, Baojun Qiu, Prasenjit Mitra, Sen Xu, Alexander Klippel, and Alan M. MacEachren. 2012. Disambiguating Road Names in Text Route Descriptions using Exact-All-Hop Shortest Path Algorithm. In ECAI (Frontiers in Artificial Intelligence and Applications), Luc De Raedt, Christian Bessière, Didier Dubois, Patrick Doherty, Paolo Frasconi, Fredrik Heintz, and Peter J. F. Lucas (Eds.), Vol. 242. IOS Press, 876–881.