1 Introduction
Everyday place descriptions are a way of encoding and transmitting spatial knowledge about places between individuals (Vasardani et al, 2013a, b). Also, the web provides a plethora of place descriptions such as news articles, social media texts, trip guides, and tourism articles (Kim et al, 2015). An example of a place description from the web is shown in Figure 1. For utilizing the expressed placerelated knowledge in information systems the place names need to be identified and georeferenced (or located). A typical approach deploys place name gazetteers: directories of known names and their locations. Since many place names are ambiguous – with multiple gazetteer entries – the approach also includes a disambiguation process. The whole process consists of two steps: place name recognition (from text) and place name disambiguation, and is often called toponym resolution (Leidner, 2008). This research focuses on the second challenge, i.e., place name disambiguation, with everyday place descriptions as the target document source.
Everyday place descriptions often contain place names of finegrained features (e.g., names of streets, buildings and local points of interest). Most studies in the field of toponym resolution focus on larger geographic features such as populated places (e.g., cities or towns) or natural geographic features (e.g., rivers or mountains). For these features, disambiguation heuristics can leverage the size, population, or containment relationships of candidate places, possibly based on external knowledge bases (e.g., WordNet or Wikipedia). Such heuristics quickly fail when dealing with the finegrained places in everyday place descriptions. Finegrained places are often significantly more frequent and more similar to each other than those larger (natural or political) gazetteered places. Even disambiguation approaches based on machinelearning techniques are difficult to be applied for finegrained places due to the lack of goodquality training data, as well as the challenge of locating previouslyunseen place names.
In this research we use mapbased clustering approaches that have been developed for place name disambiguation. Mapbased approaches should be relatively robust for finegrained places as they only require knowledge of the locations of ambiguous candidate entries. However, it remains to be seen whether these algorithms are suitable for the task of this research. Some of them are defined for large geographic features and may not perform equally well on finegrained places. Some algorithms are parametersensitive, and require manual input, and thus substantial preknowledge of the data. Therefore, we will also take a look at more generic clustering algorithms that exist in fields such as statistics, pattern recognition, and machine learning. In particular, we will compare existing clustering algorithms with a novel algorithm that is is designed to be robust, parameter and granularityindependent. We will show that the new algorithm, despite being parameterindependent, achieves stateofart disambiguation precision and minimum distance error for several tested datasets.
The contributions of this paper are:

a comparison of different clustering algorithms for disambiguating finegrained place names extracted from everyday place descriptions;

an indepth analysis of algorithms from five categories (adhoc, densitybased, hierarchicalbased, partitioning relocationbased, and others) in terms of performance, reasons, and relative suitability of the task for each; and

a new clustering algorithm which outperforms the other tested algorithms for the collected datasets.
Accordingly, compared to existing algorithms, the advantages of the new algorithm are:

it does not require manual input of parameter values and works well for data with different contexts, i.e., size of spatial coverage, distance between places, levels of granularity (parameterindependent).

it achieves the highest average disambiguation precision and has overall minimal distance errors for the tested datasets, compared to other algorithms even with their bestperforming parameter values. Note that these values are typically hard to determine without preknowledge of the data; and

its performance is robust for descriptions with different contexts. Compared to other algorithms, it has low variation in both precision and distance error for different input data.
The remainder of the paper is structured as follows: in Section 2 a review of relevant clustering algorithms is given. Section 3 proposes a new algorithm. Section 4 explains the input dataset as well as the experiment. Section 5 presents the obtained results as well as the corresponding discussions. Section 6 concludes this paper.
2 Related work
In the following section, related work in disambiguating place names from text, as well as relevant clustering algorithms is introduced.
2.1 Place name disambiguation
Place name disambiguation, also known as toponym disambiguation, is the task of disambiguating a place name with multiple corresponding gazetteer entries. For example, GeoNames^{1}^{1}1http://www.geonames.org/ lists 14 populated places ‘Melbourne’ worldwide. Various approaches have been proposed in the past years mainly in the context of Geographic Information Retrieval (GIR), in order to georeference place names in text or geotagging whole documents. Typically, place name disambiguation is done by considering context place names, i.e., other place names occurred in the same document (discourse), and computing the likelihood of each of the candidate gazetteer entry to correspond this place name. The likelihood is computed as a score given some available knowledge of the context place names as well as the place name to be disambiguated, such as their locations or spatial containment relationships. For example, if ‘Melbourne’ and ‘Florida’ occur together in a document, then the place name ‘Melbourne’ is more likely to be corresponding to the gazetteer entry ‘Melbourne, Florida, United States’ rather than ‘Melbourne, Victoria, Australia’. There are also more recent language modeling approaches based on machinelearning techniques that not only consider context place names, but also other nongeographical words as well (e.g., Cheng et al, 2010; Roller et al, 2012; Wing and Baldridge, 2014). Many geotagging systems – systems that determine the geofocus for the entire document for geographic information retrieval purposes (e.g., Teitler et al, 2008; Lieberman et al, 2007) – heavily rely on place name recognition and disambiguation.
Depending on the knowledge used, disambiguation approaches can generally be classified into map, knowledge, and machine learningbased
(Buscaldi, 2011). Mapbased approaches rely mainly on the locations of the gazetteer entries of places names from a document, and use heuristics such as minimum pointwise distance, minimum convex hull, or closest to the centroid of all entries locations for disambiguation (e.g., Smith and Crane, 2001; Amitay et al, 2004). Previous studies that focus on disambiguating finegrained places (e.g., Derungs et al, 2012; Moncla et al, 2014; Palacio et al, 2015), are largely based on mapbased approaches as well. Knowledgebased methods leverage external knowledge of places such as containment relationships, population, or prominence (e.g., Buscaldi and Rosso, 2008a; Adelfio and Samet, 2013). Machine learningbased approaches have the advantage of using nongeographical context words such as events, person names, or organization names to assist disambiguation, through creating models from training data representing the likelihood of seeing each of these context word associated with places (Smith and Mann, 2003; Roberts et al, 2010). The selection of the disambiguation approach is usually task and data sourcedependent (Buscaldi, 2011), and it is also common that different approaches are used in hybrid manners.2.2 Relevant clustering algorithms
Clustering is a division of data into meaningful groups of objects. A variety of algorithms exist, e.g., a review of clustering algorithms for data mining is given by Berkhin (2006). In this section, we introduce clustering algorithms from two categories: ones that have been used for place name disambiguation before (including adhoc ones), as well as selected ones from the data mining community. These algorithms will be compared to the newly developed algorithm later in this paper. For the task of place name disambiguation, the input to these algorithms are the locations of all ambiguous candidate gazetteer entries of all place names from a document, in the form of a point cloud.
2.2.1 Clustering algorithms used for place name disambiguation
The Overall minimum distance heuristic aims at selecting gazetteer entries so that they are as geographically close to each other as possible. The closeness is typically measured either by the average locationwise distance, or area of the convex hull of these locations. An illustration of the algorithm is given in Figure 2 (left): for each combination of ambiguous place name entries (one entry for each place name), create a cluster. Then, choose the minimum cluster representing the disambiguated locations, according to one of the measurement methods. This algorithm has been used in (Leidner et al, 2003; Amitay et al, 2004; Habib et al, 2012) and will generate only one cluster.
The centroid based heuristic is explained in Figure 2
(right). The algorithm first computes the geographic focus (centroid) of all ambiguous entry locations, and calculates the distance of each entry location to it. Then, two standard deviations of the calculated distances are used as a threshold to exclude entry locations that are too far away from the centroid. Next, the centroid is recalculated based on the remaining entry locations. Finally, for each place name, select the entry that is closest to the centroid for disambiguation. The algorithm is used in
(Smith and Crane, 2001; Buscaldi and Rosso, 2008b) and will also derive only one cluster.The Minimum distance to unambiguous referents heuristic consists of twosteps. It first identifies unambiguous place names, i.e., place names with only one gazetteer entry, or place names that can be easily disambiguated based on some heuristics (e.g., when the method is used in conjunction with knowledgebased methods). Then, use a scoring function for the disambiguation of the remaining ambiguous entries, such as based on average minimum distance to those unambiguous entry locations, or weighed average distance considering times of occurrence in document or textual distance. The method appears in (Smith and Crane, 2001; Buscaldi and Magnini, 2010) and again will generate one cluster.
The DBSCAN algorithm (Density Based Spatial Clustering of Applications with Noise) is a densitybased method that relies on two parameters: the neighborhood distance threshold , and the minimum number of points to form a cluster MinPts. There is no straightforward way to fit the parameters without preknowledge of the data. Moncla et al. use DBSCAN for the purpose of place name disambiguation (Moncla et al, 2014)
, and the parameters in their case were empirically adjusted, since the authors have good understanding of the spatial coverages of the input data as hiking itineraries. A heuristic is proposed to estimate the value of parameters based on
kdist graph (a line plot representing the distances to the kst nearest neighbor of each point) in the paper of DBSCAN (Ester et al, 1996). However, it is not trivial to detect the threshold, which requires a selection of value k as well as knowledge of the percentage of noise within the data.2.2.2 General clustering algorithms for data mining
This section introduces clustering algorithms from four groups: densitybased, hierarchicalbased, partitioning relocationbased, and uncategorized ones.
Using DBSCAN requires apriori knowledge of the input data to determine the parameters. Some data, such as everyday descriptions in this research, have potentially various conversational contexts, and thus distances between the places mentioned. The algorithm OPTICS (Ordering Points To Identify the Clustering Structure) (Ankerst et al, 1999) address the problem by building an augmented ordering of data which is consistent with DBSCAN, but covers a spectrum of all different . The OUTCLUST algorithm exploits local density to find clusters that are mostly deviating from the overall population (clustering by exceptions) (Angiulli, 2006) given k, the number of nearest neighbors for computing local densities, as well as f
, a frequency threshold, for detecting outliers.
Hierarchical clustering algorithms typically build cluster hierarchies and flexibly partition data at different granularity levels. The main disadvantage is the vagueness of when to terminate the iterative process of merging or dividing subclusters. CURE (Clustering Using REpresentatives) (Guha et al, 1998) samples an input dataset and uses an agglomeration process to produce the requested number of clusters. CHAMELEON (Karypis et al, 1999) leverages dynamic modelling method for cluster aggregation considering knearest neighbor connectivity graph. HDBSCAN (Campello et al, 2013) extends DBSCAN based on excluding borderpoints from the clusters and follows the definition of densitylevels.
Partitioning relocation clustering divides data into several subsets, and certain greedy heuristics are then used for iterative optimization. The KMeans algorithm (Hartigan and Wong, 1979) divides the data into k
clusters through some random initial samples as well as an iterative process to update the centroids of the clusters until convergence. A Gaussian Mixture Model (GMM)
(Celeux and Govaert, 1992)attempts to find a mixture of probability distributions that best model the input dataset through methods such as the ExpectationMaximization (EM) algorithm. KMeans is often regarded as a special case of GMM.
There are other algorithms that do not belong to the previous three categories. The SNN (Shared Nearest Neighbours) algorithm (Ertöz et al, 2003) blends a density based approach by first constructing a linkage matrix representing the similarity, e.g., distance, among shared nearest neighbors based on
nearest neighbors (KNN). The remaining part of the algorithm is similar to DBSCAN. Spectral clustering relies on the eigenvalues of the similarity matrix (e.g., KNN) of the data and performs partition of the data into the required number of clusters. Compared to KMeans, spectral clustering cares about connectivity instead of compactness (e.g., geometrical proximity). Kohen’s Self Organizing Maps (SOM)
(Kohonen, 1998)is an artificial neural networkbased clustering technique applying competitive learning using a grid of neurons. It is able to perform dimensionality reduction and map highdimensional data to (typically) twodimensional representation.
3 A new robust, parameterindependent algorithm
The task of this research is the following: Given a place description with gazetteered place names extracted, , each name has a set of (one or more) corresponding gazetteer entries that it can be matched to. In order to disambiguate each place name and link it to the entry that it is actually referring to (e.g., to ), clustering algorithms can be used to either minimize the geographic distances between the disambiguated entries according to some objective function (e.g., minimal average pairwise distance), or to derive highdensity clusters that are likely to represent the geographic extents where the original descriptions are embedded. The input to such a clustering algorithm is a 2dimensional point cloud with the locations of all ambiguous entries .
The task is to select clusters by these objectives rather than to classify input data into several clusters. Such clusters will then be used for disambiguation, since they are expected to capture the true entries that the place names actually refer to. Points not captured by these clusters will be regard as noise. Therefore, certain clustering algorithms seem more suitable for this task than others, e.g., DBSCAN over KMeans. Furthermore, algorithms that are not parametersensitive or require no parameter are preferable, as place descriptions may have various spatial coverages, distance between places, and levels of granularity, thus no preknowledge can be assumed. In this section, we propose a novel densitybased clustering algorithm DensityK. The algorithm is robust, parameterindependent, and consists of three steps.
3.1 Step one: computing pointwise distance matrix
In the first step, the algorithm computes all pointwise distances of an input point cloud, and the time complexity is ( is the number of input points). The time complexity can be reduced to with a distance dictionary to avoid recomputation (but needs memory). The worst case time complexity is equal to DBSCAN, both without any indexing mechanism for neighborhood queries. In practice, DBSCAN is expected to be faster since it requires a defined distance threshold and only considers pointwise distances below or equal to the value. With an index, e.g., RTree, the computation time can be reduced. is also the worst case time complexity for algorithms that require computing neighborhood distances, e.g., OUTCLUST, SNN, and HDBSCAN. Still, a distance upperbound value can be enforced for DensityK as an optional parameter to facilitate proceeding time, with an indexing approach similar to DBSCAN.
3.2 Step two: deriving cluster distance
In the second step, DensityK analyzes the computed pointwise distances, and derives a cluster distance automatically. The cluster distance is similar to the parameter in DBSCAN, and will be used in the next step for generating clusters.
First, a DensityK function is computed given the pointwise distances in the first step, as shown in Function 1. represents the average point density for points within a given distance interval for all points in an annular region. The reason to apply annular search region for computing point density instead of circular region (i.e., ) is because we found the former one leads to better clustering results. A comparison of applying the two search regions is given later in this section. In Function 1, the expression represents the number of points that are at a distance between and (including ) from point . If there is no point within all the search regions for all points for a distance interval , skip to the next interval (). Thus, is aways positive. The denominator of the left side of the function is the area of the annular region. is for discretizing the function and is set to 100 in this research. The resulting cluster distance threshold will be the integer multiple of . We will demonstrate below in this section that the clustering result is little sensitive to the value of .
(1) 
The approach is inspired by Ripley’s function (Ripley, 1976) which was originally designed to assess the degree of departure of a point set from complete spatial randomness, ranging from spatial homogeneity to a clustered pattern. Ripley’s K function cannot be used to derive clusters nor cluster distances, yet the idea of detecting point density accordingly to distance threshold meets our interest. The goal of this research is to derive a cluster distance threshold which leads to clusters with significantly large point densities. DensityK is a new algorithm with a different purpose than Ripley’s function, but Ripley’s function can be regarded as a cumulative version of the DensityK function. If the pointwise distances from the last step are sorted, the time complexity of computing DensityK function is as it makes at most n comparisons regarding different values of .
The function is able to show values of with significantly large point densities. Two illustrative examples are given in Figure 3 (a) and (b) with different input data. For each of the two sample functions, starts at a nonzero value for the first : 100 (the value of ), which means there are some points that are within 100 from other points in the input point cloud. As grows, the value of continues to decrease. For different input data, it is also possible that starts from a low value, and then increases until a maximum value is reached, after which it starts to decrease again.
Next, the mean and standard deviation of all values (a finite set since the function is discretized by ) are calculated. Then, the 2 rule is applied, and the minimum value of is selected as the cluster distance, that is , and . The derived cluster distances are also shown in Figure 3 (a), (b). Intuitively, the cluster distance is the value of at the ‘valley’ of a DensityK  a visually identifiable (at least roughly) xvalue where the decrease pace of value dramatically changes, leading to values close to zero. It is found that the resulting cluster distances always sit somewhere at the ‘valley’ of the functions (in terms of values) for different input data, and the derived clusters afterwards match quite well to the actual spatial contexts (spatial extents where the descriptions are embedded).
A comparison of annular and circular (replacing all by in Function 1) search regions is shown in Figure 3 (c) and (d), with the same input data as in (a) and (b) respectively. When tested on sample data, it is found that when applying annular regions, the derived clusters are always more constrained (as the computed cluster distances are smaller) and closer to the actual spatial contexts than those derived from circular regions. Such more constrained clusters are preferred as they are more likely to exclude ambiguous entries. It is found that they lead to higher disambiguation precision on the tested data as well. This phenomenon is most likely because when applying annular regions, the DensityK functions are more sensitive to the change of local density. In comparison, applying circular regions results in smoother density functions and possibly much larger cluster distances derived.
DensityK function is little sensitive to the value of . As shown in Figure 4, the DensityK function plots generated for the same input data with three different values 100, 250, and 500 are similar, and the cluster distances derived are the same. should be set to a constant, small number (e.g., the values in Figure 4) for all input data, just for the purpose of discretization. Such a small number works well for various input data, even those with large cluster distances. Note that there is no singleoptimal cluster distance for disambiguation. For example, different cluster distances from and may lead to the same disambiguation result for a given input; however, a cluster distance with value for the same input may increase or reduce the disambiguation precision, depending on the distances between the actual locations of the place names.
Algorithm 1 explains the whole procedure of this step, with sorted pointwise distances from the last step as input. The first part of the algorithm computes for different values, and stores tuples of in the list variable KFunction. Then, the cluster distance is derived given KFunction.
3.3 Step three: deriving clusters and disambiguation
The procedure of deriving clusters is similar to DBSCAN. Points that are within the cluster distance threshold are merged into the same cluster. The last step is to assign each place name with a location for disambiguation. To do so, the derived clusters are ranked by their contained number of points in descending order. Then, for each place name, choose the entry that first appears in one of the cluster according to the ranking, and the first cluster an entry appears is called a topcluster for this place name. For example, if an entry of a place name appears in the cluster with the largest number of points, the entry will be selected for disambiguation. If no corresponding entry of the place name is found in the first cluster, then the next cluster is chosen, until one entry is found. Thus, the worst case time complexity of this step is ( is the number of clusters derived). In practice, as most places names are expected to be located within the first cluster, the time complexity is close to . The reason that we consider multiple clusters derived instead of only the first cluster is because it is possible that the input place names are from multiple spatial foci, i.e., the locations of some of the named places are relatively far away. In such cases, these isolated place names will be missed by the first cluster thus cannot be disambiguated correctly. The complete disambiguation procedure of this step is given in Algorithm 2.
4 Experiment on comparison of the clustering algorithms
This section describes the input datasets, preprocessing procedure, used gazetteer and parser, and the final input to the algorithm to be tested. Then, the experiment settings in terms of algorithms and values used for their parameters are introduced.
4.1 Dataset and preprocessing
Two sets of place descriptions are used in the experiment. The first one contains 42 descriptions submitted by graduate students about the University of Melbourne campus, which are relatively focused in spatial extent (Vasardani et al, 2013a). The second one was harvested from web texts (e.g., Wikipedia, tourist sites, and blogs) about places around and inside Melbourne, Australia (Kim et al, 2015). The two datasets cover more than 1000 distinct gazetteered places. Two example descriptions from the two datasets are shown below respectively, with gazetteered place names highlighted:
“… If you go into the Old Quad, you will reach a square courtyard and at the back of the courtyard. You can either turn left to go to the Arts Faculty Building, or turn right into the John Medley Building and Wilson Hall […] If you continue walk along the road on the right side where you’re facing Union House, you can see the Beaurepaire and Swimming Pool. There will also be a sport tracks and the University Oval behind it …”
“… St Margaret’s School is an independent, nondenominational day school with a coeducational primary school to Year 4 and for girls from Year 5 to Year 12. The school is located in Berwick, a suburb of Melbourne […] In 2006, St Margaret’s School Council announced its decision to establish a brother school for St Margaret’s. This school opened in 2009 named Berwick Grammar School, currently catering for boys in Years 5 to 12, with one year level being added each year … ”
Place name recognition is outside the scope of this research, and we used a previouslydeveloped parser to extract place names from each of the descriptions (Liu, 2013). Then, three gazetteers were used in conjunction for retrieving (ambiguous) entries for the extracted names, aiming for completeness: OpenStreetMap Nominatim geocoder ^{2}^{2}2https://nominatim.openstreetmap.org/, GoogleV3 geocoder ^{3}^{3}3https://developers.google.com/maps/documentation/geocoding/intro, and GeoNames ^{4}^{4}4http://www.geonames.org/. For example, the name St Margaret’s School has a total of 11 corresponding entries from the three gazetteers. The retrieved entries from the three sources were then synthesized, and duplicated entries referring to the same places were removed. The numbers of ambiguous gazetteer entries retrieved are shown in Figure 5, representing the ambiguities of these place names.
Next, the extracted place names are manually linked to their corresponding gazetteer entries to create the groundtruth data for evaluation. For each description document, the input to the algorithms to be tested in the experiment below is the locations of all ambiguous gazetteer entries of place names extracted from the document, as a point cloud. An illustrative example is provided below in Figure 6 based on a document from the campus dataset. The ground truth locations of these place names (the locations of their corresponding gazetteer entries), which are inside or near the University of Melbourne campus, are highlighted by red color in the bottomright corner. For the algorithms to be tested below, each place name is considered as a successful disambiguation if it is correctly linked to its corresponding gazetteer entry.
4.2 Experiment setup
A total of 16 algorithms are evaluated based on their performance using the datasets: overall minimum distance (OMD), centroid, minimum distance to unambiguous referents (DTUR), DBSCAN, DBSCAN with automatically determined parameter (kdist), OPTICS, OUTCLUST, CURE, CHAMELEON, HDBSCAN, KMeans, GMM, SNN, Spectral, SOM and DensityK. For kdist, the author did not give a straightforward way to determine a threshold. Therefore, we use the 2 rule in the same way as it is used in DensityK (Algorithm 1), to enable a fair comparison. For algorithms that have not been used for place name disambiguation before (i.e., from kdist to SOM), Algorithm 2 is used on the generated clusters for disambiguation. In case a topcluster of a place name contains more than one gazetteer entries of this place name, the place name cannot be disambiguated and the case will be regarded as a failure. Different parameters of the algorithms are tested, as shown in Table 1.
Parameter  Notion  Value  Algorithms 
Distance threshold (meters)  200, 2000, 20000  DBSCAN  
No. of nearest neighbors  5, 10, 25  OUTCLUST, SNN, Chameleon, Spectral  
No. of clusters to derive  3, 5, 10, 20  OPTICS, CURE, KMeans, GMM, Spectral  
Minimum points in cluster  1, 5, 10  DBSCAN, kdist  
Frequency threshold  0.1, 0.2, 0.5  OUTCLUST  
Weighting coefficient  0.1, 1, 10  Chameleon  
SOM dimension  (5, 5), (10, 10), (20, 20)  SOM 
There is a number of algorithmic features that are important in the place name disambiguation task. The first one is robustness: that an algorithm should ideally work on different input datasets and have mimimum variance in precision and distance error. The next feature is minimum parameterdependency. A parameterfree algorithm, or an algorithm with parameters automatically determinable, is desirable. Again, this is because for place name disambiguation, no preknowledge such as distances between places, or the extent of the space should be assumed for an input. Lastly, an algorithm should also ideally be parameterinsensitive, i.e., modifying parameter values will not lead to significantly different results. Regarding these features, the degree of satisfaction of each of these algorithms when used for finegrained place name disambiguation will be discussed.
5 Clustering algorithm performance results
Table 2 presents the precision of each algorithm on the tested datasets, and the precisions are based on the bestperforming parameter configurations of these algorithms. DensityK achieves the highest precisions, followed by DBSCAN. This is not surprising, as DensityK is designed to be more flexible in determining cluster distances compared to DBSCAN. In the remaining part of this section, the clustering results by each algorithm are discussed individually and compared with each other. This comparison provides a better insight of whether each of these algorithms is suitable for the task of this research, regarding both the feature requirements and performance.
Category  Algorithm  Precision 
Adhoc  OMD  76.7% 
Centroid  57.2%  
DTUR  69.3%  
Densitybased  DBSCAN  81.5% 
DBSCAN kdist  75.4%  
OPTICS  73.2%  
OUTCLUST  70.6%  
Hierarchicalbased  CURE  78.9% 
CHAMELEON  58.3%  
HDBSCAN  75.7%  
Partitioning relocationbased  KMeans  73.4% 
GMM  80.8%  
Others  SNN  70.5% 
Spectral  74.4%  
SOM  73.1%  
The new algorithm  DensityK  83.1% 
The clustering results generated by algorithms used for place name disambiguation in the literature, i.e., overall minimum distance, centroid, minimum distance to unambiguous referents, and DBSCAN, are shown in Figure 7, ranked by number of points contained. A major drawback of the overall minimum distance as well as the minimum distance to unambiguous referents methods is that they are sensitive to noise place names: place names with their actual location not captured by gazetteer. For example, the place name Union House is referring to a building in the University of Melbourne campus. Its true location has no corresponding gazetteer entry, and the ambiguous gazetteer entries retrieved for this place name in the input point cloud are elsewhere around the world with the same name. Such cases are common for finegrained place names, while prominent place names (e.g., natural or political) are less likely to be missing in a gazetteer. Another disadvantage of the overall minimum distance method is scalability, as its time cost is significantly larger (over ten times) than other algorithms for most of the dataset tested, particularly for documents with large number of place names and high ambiguities. The centroidbased method performs badly as the input point cloud is spread over the earth, and the centroid is somewhere in the middle and far from the actual focus of the groundtruth locations.
DBSCAN is robust against noise place names, as it can capture the spatial context (the highlighted red region shown in Figure 6) of the original description and neglect entries outside of it. For the example point cloud, when the parameter is set to 2000, the resulting disambiguation precision is higher than with other values selected from Table 1. More groundtruth entries are missed by the cluster generated with a value of 200, and more ambiguous entries are included with a value of 20000. For the clusters generated by the kdist method, the value of determined in this case is roughly 300, which is significantly larger than the most suitable value (somewhere between 1000 and 2000). Consequently, kdist performs badly in this case.
Figure 8 shows clustering results generated by two other densitybased clustering algorithms OPTICS and OUTCLUST for the example input data. OPTICS is designed to overcome the limitation of parameterdependency of DBSCAN, thus it is expected to perform similar to DBSCAN with the bestperforming parameters. The result shows that although OPTICS is more flexible in deriving clusters of various densities based on the tested datasets, this is actually a disadvantage for the task of this research. OPTICS tends to aggregate points from the ground truth spatial context with other points that are relatively close to it, despite that these marginal points have relatively larger local densities. In addition, the parameter NumberOfClusters () of OPTICS is problematic to define. Nevertheless, it is found that setting the value to 10 generally leads to optimal results regardless of input. OUTCLUST has the same drawback of merging nearby points from the spatial context, and it is decided by both parameters and . The two parameters are more sensitive to input data compared to of OPTICS, and there is no straightforward method to determine the values either. A large input value will result in few clusters, as more data points will be regarded as neighbors, and vice versa. Compared to OPTICS, OUTCLUST focuses more on relative density by considering nearest neighbors rather than absolute density, thus, boundary points that are relatively close to some clusters while isolated from others, are more likely to be merged.
Clustering results by hierarchical clustering algorithms are shown in Figure 9. CURE requires parameter , similar to OPTICS. The derived clusters by CURE are generally similar to OPTICS. CHAMELEON is more parametersensitive than CURE, and the resulting disambiguation precision is not as good as CURE even with the bestperforming parameters. As for HDBSCAN, although it does not require any mandatory input parameters, the resulting precision for some input data is only slightly worse than DensityK. However, HDBSCAN is not robust against different input data – it performs quite well for some data, but significantly worse for others. We discuss this in more detail later in this section.
Clustering results for using the partitioning relocationbased algorithm are shown in Figure 10. The KMeans algorithm aims at minimizing intercluster distances and dividing the data into clusters. As a partitionbased algorithm, it is not expected to perform well on finegrained place name disambiguation, which is not a classification problem, and the resulting average precision is worse than HDBSCAN and CURE. For some input data, GMM performs well and achieves the same precisions as DensityK, or as DBSCAN with the best performing parameter values. The performance is generally good (measured by average precision) and robust (e.g., compared to HDBSCAN, which is discussed later). In addition, for most input data, setting different values of , once larger than 10, makes little difference to the clustering compared to algorithms such as KMeans or CURE. Still, there is no easy way to automatically determine the value of , and a single value does not always lead to the highest precisions for different input data.
Figure 11 shows the results using the remaining three algorithms. SNN is highly sensitive to the parameter , the number of nearest neighbors to consider, and different values often result in significantly different clustering results, as shown in the figure. A large value tends to result in only a few large, wellseparated clusters, and small local variations in density have little impact. Similar to OUTCLUST, there is no easy way to determine a suitable, meaningful number of nearest neighbors to consider. Spectral clustering also has the problem of parameter sensitivity, both for and . Its precision is almost always worse than algorithms such as DBSCAN, CURE, and GMM, even with the bestperforming parameter values. The resulting clusters generated by SOM are often similar in pattern to those derived by CURE or KMeans, but the average precision is much lower (even lower than Spectral clustering). One advantage of SOM is that the SOM dimension can easily be set to large numbers, which typically leads to higher precisions compared to adopting small values such as . When it is set to more than , continually increasing the values makes minimal difference to the resulting clusters, as well as precisions.
The result by DensityK is shown in Figure 12. The clusters generated are similar to DBSCAN with set to 2000 for this particular input, as shown in Figure 7. Compared to the results generated by the other algorithms, as shown in Figure 8, 9, 10 and 11, it can be seen that the firstranking cluster (the purple circles) generated by DensityK is most focused and similar to the highlighted ground truth spatial context shown in Figure 6.
From the tested algorithms, OPTICS, CURE, HDBSCAN, GMM, and DensityK seem to be most suitable for place name disambiguation considering the feature requirements. They provide good disambiguation precision, and either do not require input parameters (HDBSCAN and DensityK), or have parameters easy to determine and work well on various input data ( for OPTICS, CURE, and GMM). In comparison, parameters such as k or are more sensitive to input, and cannot be determined easily each time a new input is given. Here we further evaluate the robustness of the five algorithms over different input data, in terms of variation in precision and average distance error, i.e., the average distance between the ground truth locations of place names and the entries selected by these algorithms. We randomly select documents from our dataset, and the results are shown in Figure 13. DensityK has almost always the highest precision, as well as low variation compared to the other algorithms, particularly HDBSCAN and OPTICS. In terms of distance errors, DensityK has the least variance as well as overall minimum distance errors.
Figure 14 shows the clustering results for places merged from the two dataset using DensityK, representing the spatial contexts of the two data sources where the descriptions are embedded, i.e., the University of Melbourne campus, and Melbourne.
6 Conclusions
Place descriptions in everyday communication provide a rich source of knowledge about places. In order to utilize such knowledge in information systems, an important step is to locate the places being referred to. The problem of locating place names from text sources is often called toponym resolution, which consists of two tasks: place name identification from text, and place name disambiguation. This research looks at the second task, and more specifically, disambiguating finegrained place names extracted from place descriptions. We focus on clusteringbased disambiguation approaches, as clustering approaches require minimum preknowledge of the place names to be disambiguated compared to knowledge and machine learningbased approaches.
For this purpose, we first select clustering algorithms that have been used for place name disambiguation in the literature, or are from other communities (e.g., data mining) and are regarded as promising for this task. We evaluate and compare the performance of these algorithms based on two different datasets using precision and distance error. For algorithms that require parameters, different values of each parameter are tested in a gridsearch manner. We then analyze the performance and associated causes for each algorithm, its parameterdependency and parametersensitivity, robustness (in terms of variance of their performance over different input data), and discuss the suitability of each algorithm for finegrained place name disambiguation based on these criteria.
Furthermore, a new clustering algorithm, DensityK, is presented. It is designed to overcome several identified limitations of the previous algorithms. It outperforms the other tested algorithms and achieves stateofart disambiguation precision on the test datasets. The algorithm is based on analyzing local densities of an input point cloud, which consists of all ambiguous gazetteer entries corresponding to the place names extracted from an input document. It then derives a density threshold for determining clusters that have significantly larger densities than other points. Compared to the other algorithms, DensityK is parameterindependent, robust against different input data with various spatial extents, densities, and granularities, which makes it most desirable for the task of this research. This is reflected by consistently achieving higher precision and overall minimum distance error compared to other competitive algorithms. The worst time complexity of the algorithm is same as DBSCAN (), when both are considered without any indexing mechanism for neighborhood queries. The time complexity is better than algorithms such as overall minimum distance clustering.
The focus of this research is to provide recommendations for the selection of appropriate methods of clusteringbased disambiguation, for finegrained place names from place descriptions. We have not yet considered further optimizing the developed algorithm, although we explained briefly how indexing and optional parameters can be used for facilitating processing time in Section 3.1. Optimization is important considering certain applications such as processing streaming data for goals such as geographic information retrieval. Finally, a clustering algorithm for this purpose can be used in conjunction with other knowledge or machinelearning based approaches to enhance precision, which is beyond the scope of this research.
References
 Adelfio and Samet (2013) Adelfio MD, Samet H (2013) Structured toponym resolution using combined hierarchical place categories. In: Proceedings of the 7th Workshop on Geographic Information Retrieval, pp 49–56
 Amitay et al (2004) Amitay E, Amitay E, Har’El N, Har’El N, Sivan R, Sivan R, Soffer A, Soffer A (2004) Webawhere: geotagging web content. In: Proceedings of SIGIR ’04 Conference on Research and Development in Information Retrieval, pp 273–280

Angiulli (2006)
Angiulli F (2006) Clustering by exceptions. In: Proceedings of the National Conference on Artificial Intelligence, pp 312–317
 Ankerst et al (1999) Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings of the ACM SIGMOD Conference, Philadelphia, PA, pp 49–60
 Berkhin (2006) Berkhin P (2006) A survey of clustering data mining techniques. In: Kogan J, Nicholas C TM (ed) Grouping Multidimensional Data, Springer, Berlin, Heidelberg, pp 25–71
 Buscaldi (2011) Buscaldi D (2011) Approaches to disambiguating toponyms. SIGSPATIAL Special 3(2):16–19
 Buscaldi and Magnini (2010) Buscaldi D, Magnini B (2010) Grounding toponyms in an Italian local news corpus. In: Proceedings of the 6th Workshop on Geographic Information Retrieval, pp 70–75
 Buscaldi and Rosso (2008a) Buscaldi D, Rosso P (2008a) A conceptual densitybased approach for the disambiguation of toponyms. International Journal of Geographical Information Science 22(3):301–313
 Buscaldi and Rosso (2008b) Buscaldi D, Rosso P (2008b) Mapbased vs. knowledgebased toponym disambiguation. In: Proceedings of the 2nd International Workshop on Geographic Information Retrieval, pp 19–22
 Campello et al (2013) Campello RJGB, Moulavi D, Sander J (2013) Densitybased clustering based on hierarchical density estimates. In: Pei J, Tseng VS, Cao L, Motoda H XG (ed) Advances in Knowledge Discovery and Data Mining, Springer, Berlin, Heidelberg, pp 160–172
 Celeux and Govaert (1992) Celeux G, Govaert G (1992) A classification em algorithm for clustering and two stochastic versions. Computational statistics & Data analysis 14(3):315–332
 Cheng et al (2010) Cheng Z, Caverlee J, Lee K (2010) You are where you tweet: a contentbased approach to geolocating twitter users. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp 759–768
 Derungs et al (2012) Derungs C, Palacio D, Purves RS (2012) Resolving fine granularity toponyms: Evaluation of a disambiguation approach. In: Proceedings of the 7th International Conference on Geographic Information Science, pp 1–5
 Ertöz et al (2003) Ertöz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the 2003 SIAM International Conference on Data Mining, pp 47–58
 Ester et al (1996) Ester M, Kriegel HP, Sander J, Xu X, et al (1996) A densitybased algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD96), Portland, Oregon, USA, pp 226–231
 Guha et al (1998) Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: ACM Sigmod Record, vol 27, pp 73–84
 Habib et al (2012) Habib MB, Keulen MV, van Keulen M (2012) Improving toponym disambiguation by iteratively enhancing certainty of extraction. In: Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, KDIR 2012, Barcelona, Spain, pp 399–410

Hartigan and Wong (1979)
Hartigan JA, Wong MA (1979) Algorithm AS 136: A kmeans clustering algorithm. Journal of the Royal Statistical Society 28(1):100–108
 Karypis et al (1999) Karypis G, Han EH, Kumar V (1999) Chameleon: Hierarchical clustering using dynamic modeling. Computer 32(8):68–75
 Kim et al (2015) Kim J, Vasardani M, Winter S (2015) Harvesting large corpora for generating place graphs. In: Workshop on Cognitive Engineering for Spatial Information Processes, COSIT 2015, pp 20–26
 Kohonen (1998) Kohonen T (1998) The selforganizing map. Neurocomputing 21(13):1–6
 Leidner (2008) Leidner JL (2008) Toponym resolution in text: Annotation, evaluation and applications of spatial grounding of place names 41(2):124–126
 Leidner et al (2003) Leidner JL, Sinclair G, Webber B (2003) Grounding spatial named entities for information extraction and question answering. In: Proceedings of the HLTNAACL 2003 Workshop on Analysis of Geographic References, pp 31–38
 Lieberman et al (2007) Lieberman MD, Samet H, Sankaranarayanan J, Sperling J (2007) STEWARD: Architecture of a spatiotextual search engine. In: Samet H, Shahabi C, Schneider M (eds) Proceedings of the 15th Annual ACM International Symposium on Advances in Geographic Information Systems, Seattle, WA, pp 186–193
 Liu (2013) Liu F (2013) Automatic identification of locative expressions from informal text. PhD thesis, University of Melbourne, Australia
 Moncla et al (2014) Moncla L, RenteriaAgualimpia W, Noguerasiso J, Gaio M (2014) Geocoding for texts with finegrain toponyms : an experiment on a geoparsed hiking descriptions corpus. Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems pp 183–192
 Palacio et al (2015) Palacio D, Derungs C, Purves R (2015) Development and evaluation of a geographic information retrieval system using fine grained toponyms. Journal of Spatial Information Science 2015(11):1–29
 Ripley (1976) Ripley BD (1976) The SecondOrder Analysis of Stationary Point Processes. Journal of Applied Probability 13(2):255–266
 Roberts et al (2010) Roberts K, Bejan CA, Harabagiu SM (2010) Toponym Disambiguation Using Events. In: FLAIRS Conference, vol 10, p 1

Roller et al (2012)
Roller S, Speriosu M, Rallapalli S, Wing B, Baldridge J (2012) Supervised textbased geolocation using language models on an adaptive grid. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp 1500–1510
 Smith and Crane (2001) Smith DA, Crane G (2001) Disambiguating geographic names in a historical digital library. In: Research and Advanced Technology for Digital Libraries: Fifth European Conference (ECDL 2001), pp 127–136
 Smith and Mann (2003) Smith DA, Mann GS (2003) Bootstrapping toponym classifiers. In: Proceedings of the HLTNAACL 2003 Workshop on Analysis of Geographic References, Association for Computational Linguistics, pp 45–49
 Teitler et al (2008) Teitler BE, Lieberman MD, Panozzo D, Sankaranarayanan J, Samet H, Sperling J (2008) NewsStand: a new view on news. In: Aref WG, Mokbel MF, Schneider M (eds) Proceedings of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp 144–153
 Vasardani et al (2013a) Vasardani M, Timpf S, Winter S, Tomko M (2013a) From descriptions to depictions: A conceptual framework. In: Tenbrink T, Stell J, Galton A, Wood Z (eds) Spatial Information Theory: 11th International Conference, COSIT 2013, Springer, pp 299–319
 Vasardani et al (2013b) Vasardani M, Winter S, Richter KF (2013b) Locating place names from place descriptions. International Journal of Geographical Information Science 27(12):2509–2532
 Wing and Baldridge (2014) Wing B, Baldridge J (2014) Hierarchical Discriminative Classification for TextBased Geolocation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 336–348