DeepAI
Log In Sign Up

Disambiguating fine-grained place names from descriptions by clustering

08/17/2018
by   Hao Chen, et al.
The University of Melbourne
0

Everyday place descriptions often contain place names of fine-grained features, such as buildings or businesses, that are more difficult to disambiguate than names referring to larger places, for example cities or natural geographic features. Fine-grained places are often significantly more frequent and more similar to each other, and disambiguation heuristics developed for larger places, such as those based on population or containment relationships, are often not applicable in these cases. In this research, we address the disambiguation of fine-grained place names from everyday place descriptions. For this purpose, we evaluate the performance of different existing clustering-based approaches, since clustering approaches require no more knowledge other than the locations of ambiguous place names. We consider not only approaches developed specifically for place name disambiguation, but also clustering algorithms developed for general data mining that could potentially be leveraged. We compare these methods with a novel algorithm, and show that the novel algorithm outperforms the other algorithms in terms of disambiguation precision and distance error over several tested datasets.

READ FULL TEXT VIEW PDF
10/09/2017

Geo-referencing Place from Everyday Natural Language Descriptions

Natural language place descriptions in everyday communication provide a ...
02/06/2020

Learning Fine Grained Place Embeddings with Spatial Hierarchy from Human Mobility Trajectories

Place embeddings generated from human mobility trajectories have become ...
02/17/2022

A fine-grained, versatile index of remoteness to characterize place-level rurality

Rural-urban classifications are essential for analyzing geographic, demo...
09/25/2019

ALCNN: Attention-based Model for Fine-grained Demand Inference of Dock-less Shared Bike in New Cities

In recent years, dock-less shared bikes have been widely spread across m...
05/28/2015

Like Partying? Your Face Says It All. Predicting the Ambiance of Places with Profile Pictures

To choose restaurants and coffee shops, people are increasingly relying ...

1 Introduction

Everyday place descriptions are a way of encoding and transmitting spatial knowledge about places between individuals (Vasardani et al, 2013a, b). Also, the web provides a plethora of place descriptions such as news articles, social media texts, trip guides, and tourism articles (Kim et al, 2015). An example of a place description from the web is shown in Figure 1. For utilizing the expressed place-related knowledge in information systems the place names need to be identified and georeferenced (or located). A typical approach deploys place name gazetteers: directories of known names and their locations. Since many place names are ambiguous – with multiple gazetteer entries – the approach also includes a disambiguation process. The whole process consists of two steps: place name recognition (from text) and place name disambiguation, and is often called toponym resolution (Leidner, 2008). This research focuses on the second challenge, i.e., place name disambiguation, with everyday place descriptions as the target document source.

Figure 1: An example of a short description about Federation Square, a landmark in Melbourne, with several place names being mentioned (Source: http://www.travelandleisure.com/travel-guide/melbourne/things-to-do/federation-square).

Everyday place descriptions often contain place names of fine-grained features (e.g., names of streets, buildings and local points of interest). Most studies in the field of toponym resolution focus on larger geographic features such as populated places (e.g., cities or towns) or natural geographic features (e.g., rivers or mountains). For these features, disambiguation heuristics can leverage the size, population, or containment relationships of candidate places, possibly based on external knowledge bases (e.g., WordNet or Wikipedia). Such heuristics quickly fail when dealing with the fine-grained places in everyday place descriptions. Fine-grained places are often significantly more frequent and more similar to each other than those larger (natural or political) gazetteered places. Even disambiguation approaches based on machine-learning techniques are difficult to be applied for fine-grained places due to the lack of good-quality training data, as well as the challenge of locating previously-unseen place names.

In this research we use map-based clustering approaches that have been developed for place name disambiguation. Map-based approaches should be relatively robust for fine-grained places as they only require knowledge of the locations of ambiguous candidate entries. However, it remains to be seen whether these algorithms are suitable for the task of this research. Some of them are defined for large geographic features and may not perform equally well on fine-grained places. Some algorithms are parameter-sensitive, and require manual input, and thus substantial pre-knowledge of the data. Therefore, we will also take a look at more generic clustering algorithms that exist in fields such as statistics, pattern recognition, and machine learning. In particular, we will compare existing clustering algorithms with a novel algorithm that is is designed to be robust, parameter- and granularity-independent. We will show that the new algorithm, despite being parameter-independent, achieves state-of-art disambiguation precision and minimum distance error for several tested datasets.

The contributions of this paper are:

  1. a comparison of different clustering algorithms for disambiguating fine-grained place names extracted from everyday place descriptions;

  2. an in-depth analysis of algorithms from five categories (ad-hoc, density-based, hierarchical-based, partitioning relocation-based, and others) in terms of performance, reasons, and relative suitability of the task for each; and

  3. a new clustering algorithm which out-performs the other tested algorithms for the collected datasets.

Accordingly, compared to existing algorithms, the advantages of the new algorithm are:

  1. it does not require manual input of parameter values and works well for data with different contexts, i.e., size of spatial coverage, distance between places, levels of granularity (parameter-independent).

  2. it achieves the highest average disambiguation precision and has overall minimal distance errors for the tested datasets, compared to other algorithms even with their best-performing parameter values. Note that these values are typically hard to determine without pre-knowledge of the data; and

  3. its performance is robust for descriptions with different contexts. Compared to other algorithms, it has low variation in both precision and distance error for different input data.

The remainder of the paper is structured as follows: in Section 2 a review of relevant clustering algorithms is given. Section 3 proposes a new algorithm. Section 4 explains the input dataset as well as the experiment. Section 5 presents the obtained results as well as the corresponding discussions. Section 6 concludes this paper.

2 Related work

In the following section, related work in disambiguating place names from text, as well as relevant clustering algorithms is introduced.

2.1 Place name disambiguation

Place name disambiguation, also known as toponym disambiguation, is the task of disambiguating a place name with multiple corresponding gazetteer entries. For example, GeoNames111http://www.geonames.org/ lists 14 populated places ‘Melbourne’ world-wide. Various approaches have been proposed in the past years mainly in the context of Geographic Information Retrieval (GIR), in order to georeference place names in text or geotagging whole documents. Typically, place name disambiguation is done by considering context place names, i.e., other place names occurred in the same document (discourse), and computing the likelihood of each of the candidate gazetteer entry to correspond this place name. The likelihood is computed as a score given some available knowledge of the context place names as well as the place name to be disambiguated, such as their locations or spatial containment relationships. For example, if ‘Melbourne’ and ‘Florida’ occur together in a document, then the place name ‘Melbourne’ is more likely to be corresponding to the gazetteer entry ‘Melbourne, Florida, United States’ rather than ‘Melbourne, Victoria, Australia’. There are also more recent language modeling approaches based on machine-learning techniques that not only consider context place names, but also other non-geographical words as well (e.g., Cheng et al, 2010; Roller et al, 2012; Wing and Baldridge, 2014). Many geotagging systems – systems that determine the geo-focus for the entire document for geographic information retrieval purposes (e.g., Teitler et al, 2008; Lieberman et al, 2007) – heavily rely on place name recognition and disambiguation.

Depending on the knowledge used, disambiguation approaches can generally be classified into map-, knowledge-, and machine learning-based

(Buscaldi, 2011). Map-based approaches rely mainly on the locations of the gazetteer entries of places names from a document, and use heuristics such as minimum point-wise distance, minimum convex hull, or closest to the centroid of all entries locations for disambiguation (e.g., Smith and Crane, 2001; Amitay et al, 2004). Previous studies that focus on disambiguating fine-grained places (e.g., Derungs et al, 2012; Moncla et al, 2014; Palacio et al, 2015), are largely based on map-based approaches as well. Knowledge-based methods leverage external knowledge of places such as containment relationships, population, or prominence (e.g., Buscaldi and Rosso, 2008a; Adelfio and Samet, 2013). Machine learning-based approaches have the advantage of using non-geographical context words such as events, person names, or organization names to assist disambiguation, through creating models from training data representing the likelihood of seeing each of these context word associated with places (Smith and Mann, 2003; Roberts et al, 2010). The selection of the disambiguation approach is usually task- and data source-dependent (Buscaldi, 2011), and it is also common that different approaches are used in hybrid manners.

2.2 Relevant clustering algorithms

Clustering is a division of data into meaningful groups of objects. A variety of algorithms exist, e.g., a review of clustering algorithms for data mining is given by Berkhin (2006). In this section, we introduce clustering algorithms from two categories: ones that have been used for place name disambiguation before (including ad-hoc ones), as well as selected ones from the data mining community. These algorithms will be compared to the newly developed algorithm later in this paper. For the task of place name disambiguation, the input to these algorithms are the locations of all ambiguous candidate gazetteer entries of all place names from a document, in the form of a point cloud.

2.2.1 Clustering algorithms used for place name disambiguation

The Overall minimum distance heuristic aims at selecting gazetteer entries so that they are as geographically close to each other as possible. The closeness is typically measured either by the average location-wise distance, or area of the convex hull of these locations. An illustration of the algorithm is given in Figure 2 (left): for each combination of ambiguous place name entries (one entry for each place name), create a cluster. Then, choose the minimum cluster representing the disambiguated locations, according to one of the measurement methods. This algorithm has been used in (Leidner et al, 2003; Amitay et al, 2004; Habib et al, 2012) and will generate only one cluster.

The centroid based heuristic is explained in Figure 2

(right). The algorithm first computes the geographic focus (centroid) of all ambiguous entry locations, and calculates the distance of each entry location to it. Then, two standard deviations of the calculated distances are used as a threshold to exclude entry locations that are too far away from the centroid. Next, the centroid is recalculated based on the remaining entry locations. Finally, for each place name, select the entry that is closest to the centroid for disambiguation. The algorithm is used in

(Smith and Crane, 2001; Buscaldi and Rosso, 2008b) and will also derive only one cluster.

The Minimum distance to unambiguous referents heuristic consists of two-steps. It first identifies unambiguous place names, i.e., place names with only one gazetteer entry, or place names that can be easily disambiguated based on some heuristics (e.g., when the method is used in conjunction with knowledge-based methods). Then, use a scoring function for the disambiguation of the remaining ambiguous entries, such as based on average minimum distance to those unambiguous entry locations, or weighed average distance considering times of occurrence in document or textual distance. The method appears in (Smith and Crane, 2001; Buscaldi and Magnini, 2010) and again will generate one cluster.

Figure 2: Clustering by overall minimum distance (left), and clustering by closeness to the centroid of all locations (right). Each symbol (other than the yellow one) represents the location of an ambiguous gazetteer entry of a place name.

The DBSCAN algorithm (Density Based Spatial Clustering of Applications with Noise) is a density-based method that relies on two parameters: the neighborhood distance threshold , and the minimum number of points to form a cluster MinPts. There is no straightforward way to fit the parameters without pre-knowledge of the data. Moncla et al. use DBSCAN for the purpose of place name disambiguation (Moncla et al, 2014)

, and the parameters in their case were empirically adjusted, since the authors have good understanding of the spatial coverages of the input data as hiking itineraries. A heuristic is proposed to estimate the value of parameters based on

k-dist graph (a line plot representing the distances to the k-st nearest neighbor of each point) in the paper of DBSCAN (Ester et al, 1996). However, it is not trivial to detect the threshold, which requires a selection of value k as well as knowledge of the percentage of noise within the data.

2.2.2 General clustering algorithms for data mining

This section introduces clustering algorithms from four groups: density-based, hierarchical-based, partitioning relocation-based, and uncategorized ones.

Using DBSCAN requires a-priori knowledge of the input data to determine the parameters. Some data, such as everyday descriptions in this research, have potentially various conversational contexts, and thus distances between the places mentioned. The algorithm OPTICS (Ordering Points To Identify the Clustering Structure) (Ankerst et al, 1999) address the problem by building an augmented ordering of data which is consistent with DBSCAN, but covers a spectrum of all different . The OUTCLUST algorithm exploits local density to find clusters that are mostly deviating from the overall population (clustering by exceptions) (Angiulli, 2006) given k, the number of nearest neighbors for computing local densities, as well as f

, a frequency threshold, for detecting outliers.

Hierarchical clustering algorithms typically build cluster hierarchies and flexibly partition data at different granularity levels. The main disadvantage is the vagueness of when to terminate the iterative process of merging or dividing subclusters. CURE (Clustering Using REpresentatives) (Guha et al, 1998) samples an input dataset and uses an agglomeration process to produce the requested number of clusters. CHAMELEON (Karypis et al, 1999) leverages dynamic modelling method for cluster aggregation considering k-nearest neighbor connectivity graph. HDBSCAN (Campello et al, 2013) extends DBSCAN based on excluding border-points from the clusters and follows the definition of density-levels.

Partitioning relocation clustering divides data into several subsets, and certain greedy heuristics are then used for iterative optimization. The KMeans algorithm (Hartigan and Wong, 1979) divides the data into k

clusters through some random initial samples as well as an iterative process to update the centroids of the clusters until convergence. A Gaussian Mixture Model (GMM)

(Celeux and Govaert, 1992)

attempts to find a mixture of probability distributions that best model the input dataset through methods such as the Expectation-Maximization (EM) algorithm. KMeans is often regarded as a special case of GMM.

There are other algorithms that do not belong to the previous three categories. The SNN (Shared Nearest Neighbours) algorithm (Ertöz et al, 2003) blends a density based approach by first constructing a linkage matrix representing the similarity, e.g., distance, among shared nearest neighbors based on

-nearest neighbors (KNN). The remaining part of the algorithm is similar to DBSCAN. Spectral clustering relies on the eigenvalues of the similarity matrix (e.g., KNN) of the data and performs partition of the data into the required number of clusters. Compared to KMeans, spectral clustering cares about connectivity instead of compactness (e.g., geometrical proximity). Kohen’s Self Organizing Maps (SOM)

(Kohonen, 1998)

is an artificial neural network-based clustering technique applying competitive learning using a grid of neurons. It is able to perform dimensionality reduction and map high-dimensional data to (typically) two-dimensional representation.

3 A new robust, parameter-independent algorithm

The task of this research is the following: Given a place description with gazetteered place names extracted, , each name has a set of (one or more) corresponding gazetteer entries that it can be matched to. In order to disambiguate each place name and link it to the entry that it is actually referring to (e.g., to ), clustering algorithms can be used to either minimize the geographic distances between the disambiguated entries according to some objective function (e.g., minimal average pairwise distance), or to derive high-density clusters that are likely to represent the geographic extents where the original descriptions are embedded. The input to such a clustering algorithm is a 2-dimensional point cloud with the locations of all ambiguous entries .

The task is to select clusters by these objectives rather than to classify input data into several clusters. Such clusters will then be used for disambiguation, since they are expected to capture the true entries that the place names actually refer to. Points not captured by these clusters will be regard as noise. Therefore, certain clustering algorithms seem more suitable for this task than others, e.g., DBSCAN over KMeans. Furthermore, algorithms that are not parameter-sensitive or require no parameter are preferable, as place descriptions may have various spatial coverages, distance between places, and levels of granularity, thus no pre-knowledge can be assumed. In this section, we propose a novel density-based clustering algorithm DensityK. The algorithm is robust, parameter-independent, and consists of three steps.

3.1 Step one: computing point-wise distance matrix

In the first step, the algorithm computes all point-wise distances of an input point cloud, and the time complexity is ( is the number of input points). The time complexity can be reduced to with a distance dictionary to avoid re-computation (but needs memory). The worst case time complexity is equal to DBSCAN, both without any indexing mechanism for neighborhood queries. In practice, DBSCAN is expected to be faster since it requires a defined distance threshold and only considers point-wise distances below or equal to the value. With an index, e.g., R-Tree, the computation time can be reduced. is also the worst case time complexity for algorithms that require computing neighborhood distances, e.g., OUTCLUST, SNN, and HDBSCAN. Still, a distance upper-bound value can be enforced for DensityK as an optional parameter to facilitate proceeding time, with an indexing approach similar to DBSCAN.

3.2 Step two: deriving cluster distance

In the second step, DensityK analyzes the computed point-wise distances, and derives a cluster distance automatically. The cluster distance is similar to the parameter in DBSCAN, and will be used in the next step for generating clusters.

First, a DensityK function is computed given the point-wise distances in the first step, as shown in Function 1. represents the average point density for points within a given distance interval for all points in an annular region. The reason to apply annular search region for computing point density instead of circular region (i.e., ) is because we found the former one leads to better clustering results. A comparison of applying the two search regions is given later in this section. In Function 1, the expression represents the number of points that are at a distance between and (including ) from point . If there is no point within all the search regions for all points for a distance interval , skip to the next interval (). Thus, is aways positive. The denominator of the left side of the function is the area of the annular region. is for discretizing the function and is set to 100 in this research. The resulting cluster distance threshold will be the integer multiple of . We will demonstrate below in this section that the clustering result is little sensitive to the value of .

(1)

The approach is inspired by Ripley’s function (Ripley, 1976) which was originally designed to assess the degree of departure of a point set from complete spatial randomness, ranging from spatial homogeneity to a clustered pattern. Ripley’s K function cannot be used to derive clusters nor cluster distances, yet the idea of detecting point density accordingly to distance threshold meets our interest. The goal of this research is to derive a cluster distance threshold which leads to clusters with significantly large point densities. DensityK is a new algorithm with a different purpose than Ripley’s function, but Ripley’s function can be regarded as a cumulative version of the DensityK function. If the point-wise distances from the last step are sorted, the time complexity of computing DensityK function is as it makes at most n comparisons regarding different values of .

The function is able to show values of with significantly large point densities. Two illustrative examples are given in Figure 3 (a) and (b) with different input data. For each of the two sample functions, starts at a non-zero value for the first : 100 (the value of ), which means there are some points that are within 100 from other points in the input point cloud. As grows, the value of continues to decrease. For different input data, it is also possible that starts from a low value, and then increases until a maximum value is reached, after which it starts to decrease again.

Figure 3: Two example DensityK functions from different input data with cluster distance highlighted (a, b), and comparisons of DensityK functions generated based on annular and circular search regions for the same data as in (a) and (b) respectively (c, d).

Next, the mean and standard deviation of all values (a finite set since the function is discretized by ) are calculated. Then, the 2 rule is applied, and the minimum value of is selected as the cluster distance, that is , and . The derived cluster distances are also shown in Figure 3 (a), (b). Intuitively, the cluster distance is the value of at the ‘valley’ of a DensityK - a visually identifiable (at least roughly) x-value where the decrease pace of value dramatically changes, leading to values close to zero. It is found that the resulting cluster distances always sit somewhere at the ‘valley’ of the functions (in terms of values) for different input data, and the derived clusters afterwards match quite well to the actual spatial contexts (spatial extents where the descriptions are embedded).

A comparison of annular and circular (replacing all by in Function 1) search regions is shown in Figure 3 (c) and (d), with the same input data as in (a) and (b) respectively. When tested on sample data, it is found that when applying annular regions, the derived clusters are always more constrained (as the computed cluster distances are smaller) and closer to the actual spatial contexts than those derived from circular regions. Such more constrained clusters are preferred as they are more likely to exclude ambiguous entries. It is found that they lead to higher disambiguation precision on the tested data as well. This phenomenon is most likely because when applying annular regions, the DensityK functions are more sensitive to the change of local density. In comparison, applying circular regions results in smoother density functions and possibly much larger cluster distances derived.

DensityK function is little sensitive to the value of . As shown in Figure 4, the DensityK function plots generated for the same input data with three different values 100, 250, and 500 are similar, and the cluster distances derived are the same. should be set to a constant, small number (e.g., the values in Figure 4) for all input data, just for the purpose of discretization. Such a small number works well for various input data, even those with large cluster distances. Note that there is no single-optimal cluster distance for disambiguation. For example, different cluster distances from and may lead to the same disambiguation result for a given input; however, a cluster distance with value for the same input may increase or reduce the disambiguation precision, depending on the distances between the actual locations of the place names.

Figure 4: DensityK function generated with four different intervals for the same input point cloud: 100, 250, 500 (meters).

Algorithm 1 explains the whole procedure of this step, with sorted point-wise distances from the last step as input. The first part of the algorithm computes for different values, and stores tuples of in the list variable KFunction. Then, the cluster distance is derived given KFunction.

Input: : a sorted list of distance floats in meters
Output: : a float in meters

1: an empty list of 2-element tuples
2: maxValue(PointWiseDistances)
3: length(PointWiseDistances)
4:for  in iterate(, MaxDistance, do loop of (min, max, interval)
5:     
6:     for distance in PointWiseDistances do
7:         if  then
8:              PointCountInRadius += 1
9:         end if
10:     end for
11:     if  then
12:         
13:         
14:          Function 1
15:     end if
16:end for
17:
18: getDensities(KFunction)
19: getMeanAndStd(Densities)
20:
21: getCorrespondingDistance()
22:return ClusterDistance
Algorithm 1 Computing cluster distance threshold.

3.3 Step three: deriving clusters and disambiguation

The procedure of deriving clusters is similar to DBSCAN. Points that are within the cluster distance threshold are merged into the same cluster. The last step is to assign each place name with a location for disambiguation. To do so, the derived clusters are ranked by their contained number of points in descending order. Then, for each place name, choose the entry that first appears in one of the cluster according to the ranking, and the first cluster an entry appears is called a top-cluster for this place name. For example, if an entry of a place name appears in the cluster with the largest number of points, the entry will be selected for disambiguation. If no corresponding entry of the place name is found in the first cluster, then the next cluster is chosen, until one entry is found. Thus, the worst case time complexity of this step is ( is the number of clusters derived). In practice, as most places names are expected to be located within the first cluster, the time complexity is close to . The reason that we consider multiple clusters derived instead of only the first cluster is because it is possible that the input place names are from multiple spatial foci, i.e., the locations of some of the named places are relatively far away. In such cases, these isolated place names will be missed by the first cluster thus cannot be disambiguated correctly. The complete disambiguation procedure of this step is given in Algorithm 2.

Input: Clusters, PlaceNamesAndEntries as an list of 2-element tuples
Output: DisambiguatedPlaceNames

1:
2: rankDescendent(Clusters)
3:for Place in getPlaces(PlaceNamesAndEntriesdo
4:     for Cluster in RankedClusters do
5:         for Entry in getCorrespondingEntries(Entry, PlaceNamesAndEntriesdo
6:              if Entry in Cluster then
7:                  
8:                  Goto 3
9:              end if
10:         end for
11:     end for
12:end for
13:return DisambiguatedPlaceNames
Algorithm 2 Disambiguation using the derived clusters.

4 Experiment on comparison of the clustering algorithms

This section describes the input datasets, preprocessing procedure, used gazetteer and parser, and the final input to the algorithm to be tested. Then, the experiment settings in terms of algorithms and values used for their parameters are introduced.

4.1 Dataset and preprocessing

Two sets of place descriptions are used in the experiment. The first one contains 42 descriptions submitted by graduate students about the University of Melbourne campus, which are relatively focused in spatial extent (Vasardani et al, 2013a). The second one was harvested from web texts (e.g., Wikipedia, tourist sites, and blogs) about places around and inside Melbourne, Australia (Kim et al, 2015). The two datasets cover more than 1000 distinct gazetteered places. Two example descriptions from the two datasets are shown below respectively, with gazetteered place names highlighted:

“… If you go into the Old Quad, you will reach a square courtyard and at the back of the courtyard. You can either turn left to go to the Arts Faculty Building, or turn right into the John Medley Building and Wilson Hall […] If you continue walk along the road on the right side where you’re facing Union House, you can see the Beaurepaire and Swimming Pool. There will also be a sport tracks and the University Oval behind it …”

“… St Margaret’s School is an independent, non-denominational day school with a co-educational primary school to Year 4 and for girls from Year 5 to Year 12. The school is located in Berwick, a suburb of Melbourne […] In 2006, St Margaret’s School Council announced its decision to establish a brother school for St Margaret’s. This school opened in 2009 named Berwick Grammar School, currently catering for boys in Years 5 to 12, with one year level being added each year … ”

Place name recognition is outside the scope of this research, and we used a previously-developed parser to extract place names from each of the descriptions (Liu, 2013). Then, three gazetteers were used in conjunction for retrieving (ambiguous) entries for the extracted names, aiming for completeness: OpenStreetMap Nominatim geocoder 222https://nominatim.openstreetmap.org/, GoogleV3 geocoder 333https://developers.google.com/maps/documentation/geocoding/intro, and GeoNames 444http://www.geonames.org/. For example, the name St Margaret’s School has a total of 11 corresponding entries from the three gazetteers. The retrieved entries from the three sources were then synthesized, and duplicated entries referring to the same places were removed. The numbers of ambiguous gazetteer entries retrieved are shown in Figure 5, representing the ambiguities of these place names.

Figure 5: Numbers of ambiguous gazetteer entries of places names from the two datasets, campus (left) and Melbourne (right).

Next, the extracted place names are manually linked to their corresponding gazetteer entries to create the groundtruth data for evaluation. For each description document, the input to the algorithms to be tested in the experiment below is the locations of all ambiguous gazetteer entries of place names extracted from the document, as a point cloud. An illustrative example is provided below in Figure 6 based on a document from the campus dataset. The ground truth locations of these place names (the locations of their corresponding gazetteer entries), which are inside or near the University of Melbourne campus, are highlighted by red color in the bottom-right corner. For the algorithms to be tested below, each place name is considered as a successful disambiguation if it is correctly linked to its corresponding gazetteer entry.

Figure 6: An example input point cloud of ambiguous gazetteer entry locations of a set of place names from the campus dataset, with ground truth locations highlighted in red color.

4.2 Experiment setup

A total of 16 algorithms are evaluated based on their performance using the datasets: overall minimum distance (OMD), centroid, minimum distance to unambiguous referents (DTUR), DBSCAN, DBSCAN with automatically determined parameter (k-dist), OPTICS, OUTCLUST, CURE, CHAMELEON, HDBSCAN, KMeans, GMM, SNN, Spectral, SOM and DensityK. For k-dist, the author did not give a straightforward way to determine a threshold. Therefore, we use the 2 rule in the same way as it is used in DensityK (Algorithm 1), to enable a fair comparison. For algorithms that have not been used for place name disambiguation before (i.e., from k-dist to SOM), Algorithm 2 is used on the generated clusters for disambiguation. In case a top-cluster of a place name contains more than one gazetteer entries of this place name, the place name cannot be disambiguated and the case will be regarded as a failure. Different parameters of the algorithms are tested, as shown in Table 1.

Parameter Notion Value Algorithms
Distance threshold (meters) 200, 2000, 20000  DBSCAN
No. of nearest neighbors 5, 10, 25 OUTCLUST, SNN, Chameleon, Spectral
No. of clusters to derive 3, 5, 10, 20 OPTICS, CURE, KMeans, GMM, Spectral
Minimum points in cluster 1, 5, 10 DBSCAN, k-dist
Frequency threshold 0.1, 0.2, 0.5 OUTCLUST
Weighting coefficient 0.1, 1, 10 Chameleon
SOM dimension (5, 5), (10, 10), (20, 20) SOM
Table 1: Parameter configurations of algorithms to be tested for place name disambiguation.

There is a number of algorithmic features that are important in the place name disambiguation task. The first one is robustness: that an algorithm should ideally work on different input datasets and have mimimum variance in precision and distance error. The next feature is minimum parameter-dependency. A parameter-free algorithm, or an algorithm with parameters automatically determinable, is desirable. Again, this is because for place name disambiguation, no pre-knowledge such as distances between places, or the extent of the space should be assumed for an input. Lastly, an algorithm should also ideally be parameter-insensitive, i.e., modifying parameter values will not lead to significantly different results. Regarding these features, the degree of satisfaction of each of these algorithms when used for fine-grained place name disambiguation will be discussed.

5 Clustering algorithm performance results

Table 2 presents the precision of each algorithm on the tested datasets, and the precisions are based on the best-performing parameter configurations of these algorithms. DensityK achieves the highest precisions, followed by DBSCAN. This is not surprising, as DensityK is designed to be more flexible in determining cluster distances compared to DBSCAN. In the remaining part of this section, the clustering results by each algorithm are discussed individually and compared with each other. This comparison provides a better insight of whether each of these algorithms is suitable for the task of this research, regarding both the feature requirements and performance.

Category Algorithm Precision
Ad-hoc OMD 76.7%
Centroid 57.2%
DTUR 69.3%
Density-based DBSCAN 81.5%
DBSCAN k-dist 75.4%
OPTICS 73.2%
OUTCLUST 70.6%
Hierarchical-based CURE 78.9%
CHAMELEON 58.3%
HDBSCAN 75.7%
Partitioning relocation-based KMeans 73.4%
GMM 80.8%
Others SNN 70.5%
Spectral 74.4%
SOM 73.1%
The new algorithm DensityK 83.1%
Table 2: Average precision of each algorithm with the best-performing parameters on the tested datasets.

The clustering results generated by algorithms used for place name disambiguation in the literature, i.e., overall minimum distance, centroid, minimum distance to unambiguous referents, and DBSCAN, are shown in Figure 7, ranked by number of points contained. A major drawback of the overall minimum distance as well as the minimum distance to unambiguous referents methods is that they are sensitive to noise place names: place names with their actual location not captured by gazetteer. For example, the place name Union House is referring to a building in the University of Melbourne campus. Its true location has no corresponding gazetteer entry, and the ambiguous gazetteer entries retrieved for this place name in the input point cloud are elsewhere around the world with the same name. Such cases are common for fine-grained place names, while prominent place names (e.g., natural or political) are less likely to be missing in a gazetteer. Another disadvantage of the overall minimum distance method is scalability, as its time cost is significantly larger (over ten times) than other algorithms for most of the dataset tested, particularly for documents with large number of place names and high ambiguities. The centroid-based method performs badly as the input point cloud is spread over the earth, and the centroid is somewhere in the middle and far from the actual focus of the groundtruth locations.

Figure 7: Clustering results generated by established clustering algorithms for place name disambiguation.

DBSCAN is robust against noise place names, as it can capture the spatial context (the highlighted red region shown in Figure 6) of the original description and neglect entries outside of it. For the example point cloud, when the parameter is set to 2000, the resulting disambiguation precision is higher than with other values selected from Table 1. More groundtruth entries are missed by the cluster generated with a value of 200, and more ambiguous entries are included with a value of 20000. For the clusters generated by the k-dist method, the value of determined in this case is roughly 300, which is significantly larger than the most suitable value (somewhere between 1000 and 2000). Consequently, k-dist performs badly in this case.

Figure 8 shows clustering results generated by two other density-based clustering algorithms OPTICS and OUTCLUST for the example input data. OPTICS is designed to overcome the limitation of parameter-dependency of DBSCAN, thus it is expected to perform similar to DBSCAN with the best-performing parameters. The result shows that although OPTICS is more flexible in deriving clusters of various densities based on the tested datasets, this is actually a disadvantage for the task of this research. OPTICS tends to aggregate points from the ground truth spatial context with other points that are relatively close to it, despite that these marginal points have relatively larger local densities. In addition, the parameter NumberOfClusters () of OPTICS is problematic to define. Nevertheless, it is found that setting the value to 10 generally leads to optimal results regardless of input. OUTCLUST has the same drawback of merging nearby points from the spatial context, and it is decided by both parameters and . The two parameters are more sensitive to input data compared to of OPTICS, and there is no straightforward method to determine the values either. A large input value will result in few clusters, as more data points will be regarded as neighbors, and vice versa. Compared to OPTICS, OUTCLUST focuses more on relative density by considering nearest neighbors rather than absolute density, thus, boundary points that are relatively close to some clusters while isolated from others, are more likely to be merged.

Figure 8: Clustering results generated by density-based clustering algorithms.

Clustering results by hierarchical clustering algorithms are shown in Figure 9. CURE requires parameter , similar to OPTICS. The derived clusters by CURE are generally similar to OPTICS. CHAMELEON is more parameter-sensitive than CURE, and the resulting disambiguation precision is not as good as CURE even with the best-performing parameters. As for HDBSCAN, although it does not require any mandatory input parameters, the resulting precision for some input data is only slightly worse than DensityK. However, HDBSCAN is not robust against different input data – it performs quite well for some data, but significantly worse for others. We discuss this in more detail later in this section.

Figure 9: Clustering results generated by hierarchical clustering algorithms.

Clustering results for using the partitioning relocation-based algorithm are shown in Figure 10. The KMeans algorithm aims at minimizing inter-cluster distances and dividing the data into clusters. As a partition-based algorithm, it is not expected to perform well on fine-grained place name disambiguation, which is not a classification problem, and the resulting average precision is worse than HDBSCAN and CURE. For some input data, GMM performs well and achieves the same precisions as DensityK, or as DBSCAN with the best performing parameter values. The performance is generally good (measured by average precision) and robust (e.g., compared to HDBSCAN, which is discussed later). In addition, for most input data, setting different values of , once larger than 10, makes little difference to the clustering compared to algorithms such as KMeans or CURE. Still, there is no easy way to automatically determine the value of , and a single value does not always lead to the highest precisions for different input data.

Figure 10: Clustering results generated by partitioning relocation clustering algorithms.

Figure 11 shows the results using the remaining three algorithms. SNN is highly sensitive to the parameter , the number of nearest neighbors to consider, and different values often result in significantly different clustering results, as shown in the figure. A large value tends to result in only a few large, well-separated clusters, and small local variations in density have little impact. Similar to OUTCLUST, there is no easy way to determine a suitable, meaningful number of nearest neighbors to consider. Spectral clustering also has the problem of parameter sensitivity, both for and . Its precision is almost always worse than algorithms such as DBSCAN, CURE, and GMM, even with the best-performing parameter values. The resulting clusters generated by SOM are often similar in pattern to those derived by CURE or KMeans, but the average precision is much lower (even lower than Spectral clustering). One advantage of SOM is that the SOM dimension can easily be set to large numbers, which typically leads to higher precisions compared to adopting small values such as . When it is set to more than , continually increasing the values makes minimal difference to the resulting clusters, as well as precisions.

Figure 11: Clustering results generated by other clustering algorithms.

The result by DensityK is shown in Figure 12. The clusters generated are similar to DBSCAN with set to 2000 for this particular input, as shown in Figure 7. Compared to the results generated by the other algorithms, as shown in Figure 8, 9, 10 and 11, it can be seen that the first-ranking cluster (the purple circles) generated by DensityK is most focused and similar to the highlighted ground truth spatial context shown in Figure 6.

Figure 12: Clustering results generated by the DensityK algorithm.

From the tested algorithms, OPTICS, CURE, HDBSCAN, GMM, and DensityK seem to be most suitable for place name disambiguation considering the feature requirements. They provide good disambiguation precision, and either do not require input parameters (HDBSCAN and DensityK), or have parameters easy to determine and work well on various input data ( for OPTICS, CURE, and GMM). In comparison, parameters such as k or are more sensitive to input, and cannot be determined easily each time a new input is given. Here we further evaluate the robustness of the five algorithms over different input data, in terms of variation in precision and average distance error, i.e., the average distance between the ground truth locations of place names and the entries selected by these algorithms. We randomly select documents from our dataset, and the results are shown in Figure 13. DensityK has almost always the highest precision, as well as low variation compared to the other algorithms, particularly HDBSCAN and OPTICS. In terms of distance errors, DensityK has the least variance as well as overall minimum distance errors.

Figure 13: Precision (left), and average distance error in (right) by description documents.

Figure 14 shows the clustering results for places merged from the two dataset using DensityK, representing the spatial contexts of the two data sources where the descriptions are embedded, i.e., the University of Melbourne campus, and Melbourne.

Figure 14: Clusters derived for merged places from the campus dataset (top) and the Melbourne dataset (bottom) representing spatial contexts. The left hand side shows the input point clouds of all ambiguous entries.

6 Conclusions

Place descriptions in everyday communication provide a rich source of knowledge about places. In order to utilize such knowledge in information systems, an important step is to locate the places being referred to. The problem of locating place names from text sources is often called toponym resolution, which consists of two tasks: place name identification from text, and place name disambiguation. This research looks at the second task, and more specifically, disambiguating fine-grained place names extracted from place descriptions. We focus on clustering-based disambiguation approaches, as clustering approaches require minimum pre-knowledge of the place names to be disambiguated compared to knowledge- and machine learning-based approaches.

For this purpose, we first select clustering algorithms that have been used for place name disambiguation in the literature, or are from other communities (e.g., data mining) and are regarded as promising for this task. We evaluate and compare the performance of these algorithms based on two different datasets using precision and distance error. For algorithms that require parameters, different values of each parameter are tested in a grid-search manner. We then analyze the performance and associated causes for each algorithm, its parameter-dependency and parameter-sensitivity, robustness (in terms of variance of their performance over different input data), and discuss the suitability of each algorithm for fine-grained place name disambiguation based on these criteria.

Furthermore, a new clustering algorithm, DensityK, is presented. It is designed to overcome several identified limitations of the previous algorithms. It out-performs the other tested algorithms and achieves state-of-art disambiguation precision on the test datasets. The algorithm is based on analyzing local densities of an input point cloud, which consists of all ambiguous gazetteer entries corresponding to the place names extracted from an input document. It then derives a density threshold for determining clusters that have significantly larger densities than other points. Compared to the other algorithms, DensityK is parameter-independent, robust against different input data with various spatial extents, densities, and granularities, which makes it most desirable for the task of this research. This is reflected by consistently achieving higher precision and overall minimum distance error compared to other competitive algorithms. The worst time complexity of the algorithm is same as DBSCAN (), when both are considered without any indexing mechanism for neighborhood queries. The time complexity is better than algorithms such as overall minimum distance clustering.

The focus of this research is to provide recommendations for the selection of appropriate methods of clustering-based disambiguation, for fine-grained place names from place descriptions. We have not yet considered further optimizing the developed algorithm, although we explained briefly how indexing and optional parameters can be used for facilitating processing time in Section 3.1. Optimization is important considering certain applications such as processing streaming data for goals such as geographic information retrieval. Finally, a clustering algorithm for this purpose can be used in conjunction with other knowledge- or machine-learning based approaches to enhance precision, which is beyond the scope of this research.

References

  • Adelfio and Samet (2013) Adelfio MD, Samet H (2013) Structured toponym resolution using combined hierarchical place categories. In: Proceedings of the 7th Workshop on Geographic Information Retrieval, pp 49–56
  • Amitay et al (2004) Amitay E, Amitay E, Har’El N, Har’El N, Sivan R, Sivan R, Soffer A, Soffer A (2004) Web-a-where: geotagging web content. In: Proceedings of SIGIR ’04 Conference on Research and Development in Information Retrieval, pp 273–280
  • Angiulli (2006)

    Angiulli F (2006) Clustering by exceptions. In: Proceedings of the National Conference on Artificial Intelligence, pp 312–317

  • Ankerst et al (1999) Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings of the ACM SIGMOD Conference, Philadelphia, PA, pp 49–60
  • Berkhin (2006) Berkhin P (2006) A survey of clustering data mining techniques. In: Kogan J, Nicholas C TM (ed) Grouping Multidimensional Data, Springer, Berlin, Heidelberg, pp 25–71
  • Buscaldi (2011) Buscaldi D (2011) Approaches to disambiguating toponyms. SIGSPATIAL Special 3(2):16–19
  • Buscaldi and Magnini (2010) Buscaldi D, Magnini B (2010) Grounding toponyms in an Italian local news corpus. In: Proceedings of the 6th Workshop on Geographic Information Retrieval, pp 70–75
  • Buscaldi and Rosso (2008a) Buscaldi D, Rosso P (2008a) A conceptual density-based approach for the disambiguation of toponyms. International Journal of Geographical Information Science 22(3):301–313
  • Buscaldi and Rosso (2008b) Buscaldi D, Rosso P (2008b) Map-based vs. knowledge-based toponym disambiguation. In: Proceedings of the 2nd International Workshop on Geographic Information Retrieval, pp 19–22
  • Campello et al (2013) Campello RJGB, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density estimates. In: Pei J, Tseng VS, Cao L, Motoda H XG (ed) Advances in Knowledge Discovery and Data Mining, Springer, Berlin, Heidelberg, pp 160–172
  • Celeux and Govaert (1992) Celeux G, Govaert G (1992) A classification em algorithm for clustering and two stochastic versions. Computational statistics & Data analysis 14(3):315–332
  • Cheng et al (2010) Cheng Z, Caverlee J, Lee K (2010) You are where you tweet: a content-based approach to geo-locating twitter users. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp 759–768
  • Derungs et al (2012) Derungs C, Palacio D, Purves RS (2012) Resolving fine granularity toponyms: Evaluation of a disambiguation approach. In: Proceedings of the 7th International Conference on Geographic Information Science, pp 1–5
  • Ertöz et al (2003) Ertöz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the 2003 SIAM International Conference on Data Mining, pp 47–58
  • Ester et al (1996) Ester M, Kriegel HP, Sander J, Xu X, et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, USA, pp 226–231
  • Guha et al (1998) Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: ACM Sigmod Record, vol 27, pp 73–84
  • Habib et al (2012) Habib MB, Keulen MV, van Keulen M (2012) Improving toponym disambiguation by iteratively enhancing certainty of extraction. In: Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, KDIR 2012, Barcelona, Spain, pp 399–410
  • Hartigan and Wong (1979)

    Hartigan JA, Wong MA (1979) Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society 28(1):100–108

  • Karypis et al (1999) Karypis G, Han EH, Kumar V (1999) Chameleon: Hierarchical clustering using dynamic modeling. Computer 32(8):68–75
  • Kim et al (2015) Kim J, Vasardani M, Winter S (2015) Harvesting large corpora for generating place graphs. In: Workshop on Cognitive Engineering for Spatial Information Processes, COSIT 2015, pp 20–26
  • Kohonen (1998) Kohonen T (1998) The self-organizing map. Neurocomputing 21(1-3):1–6
  • Leidner (2008) Leidner JL (2008) Toponym resolution in text: Annotation, evaluation and applications of spatial grounding of place names 41(2):124–126
  • Leidner et al (2003) Leidner JL, Sinclair G, Webber B (2003) Grounding spatial named entities for information extraction and question answering. In: Proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic References, pp 31–38
  • Lieberman et al (2007) Lieberman MD, Samet H, Sankaranarayanan J, Sperling J (2007) STEWARD: Architecture of a spatio-textual search engine. In: Samet H, Shahabi C, Schneider M (eds) Proceedings of the 15th Annual ACM International Symposium on Advances in Geographic Information Systems, Seattle, WA, pp 186–193
  • Liu (2013) Liu F (2013) Automatic identification of locative expressions from informal text. PhD thesis, University of Melbourne, Australia
  • Moncla et al (2014) Moncla L, Renteria-Agualimpia W, Nogueras-iso J, Gaio M (2014) Geocoding for texts with fine-grain toponyms : an experiment on a geoparsed hiking descriptions corpus. Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems pp 183–192
  • Palacio et al (2015) Palacio D, Derungs C, Purves R (2015) Development and evaluation of a geographic information retrieval system using fine grained toponyms. Journal of Spatial Information Science 2015(11):1–29
  • Ripley (1976) Ripley BD (1976) The Second-Order Analysis of Stationary Point Processes. Journal of Applied Probability 13(2):255–266
  • Roberts et al (2010) Roberts K, Bejan CA, Harabagiu SM (2010) Toponym Disambiguation Using Events. In: FLAIRS Conference, vol 10, p 1
  • Roller et al (2012)

    Roller S, Speriosu M, Rallapalli S, Wing B, Baldridge J (2012) Supervised text-based geolocation using language models on an adaptive grid. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp 1500–1510

  • Smith and Crane (2001) Smith DA, Crane G (2001) Disambiguating geographic names in a historical digital library. In: Research and Advanced Technology for Digital Libraries: Fifth European Conference (ECDL 2001), pp 127–136
  • Smith and Mann (2003) Smith DA, Mann GS (2003) Bootstrapping toponym classifiers. In: Proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic References, Association for Computational Linguistics, pp 45–49
  • Teitler et al (2008) Teitler BE, Lieberman MD, Panozzo D, Sankaranarayanan J, Samet H, Sperling J (2008) NewsStand: a new view on news. In: Aref WG, Mokbel MF, Schneider M (eds) Proceedings of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp 144–153
  • Vasardani et al (2013a) Vasardani M, Timpf S, Winter S, Tomko M (2013a) From descriptions to depictions: A conceptual framework. In: Tenbrink T, Stell J, Galton A, Wood Z (eds) Spatial Information Theory: 11th International Conference, COSIT 2013, Springer, pp 299–319
  • Vasardani et al (2013b) Vasardani M, Winter S, Richter KF (2013b) Locating place names from place descriptions. International Journal of Geographical Information Science 27(12):2509–2532
  • Wing and Baldridge (2014) Wing B, Baldridge J (2014) Hierarchical Discriminative Classification for Text-Based Geolocation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 336–348