A Pragmatic Guide to Geoparsing Evaluation

by   Milan Gritta, et al.
University of Cambridge

Empirical methods in geoparsing have thus far lacked a standard evaluation framework described as the task, data and metrics used to establish state-of-the-art systems. Evaluation is further made inconsistent, even unrepresentative of real-world usage, by the lack of distinction between the different types of toponyms, which necessitates new guidelines, a consolidation of metrics and a detailed toponym taxonomy with implications for Named Entity Recognition (NER). To address these deficiencies, our manuscript introduces such framework in three parts. Part 1) Task Definition: clarified via corpus linguistic analysis proposing a fine-grained Pragmatic Taxonomy of Toponyms with new guidelines. Part 2) Evaluation Data: shared via a dataset called GeoWebNews to provide test/train data to enable immediate use of our contributions. In addition to fine-grained Geotagging and Toponym Resolution (Geocoding), this dataset is also suitable for prototyping machine learning NLP models. Part 3) Metrics: discussed and reviewed for a rigorous evaluation with appropriate recommendations for NER/Geoparsing practitioners.


Cascaded Models for Better Fine-Grained Named Entity Recognition

Named Entity Recognition (NER) is an essential precursor task for many n...

CLUENER2020: Fine-grained Named Entity Recognition Dataset and Benchmark for Chinese

In this paper, we introduce the NER dataset from CLUE organization (CLUE...

Few-NERD: A Few-Shot Named Entity Recognition Dataset

Recently, considerable literature has grown up around the theme of few-s...

CLUENER2020: Fine-grained Name Entity Recognition for Chinese

In this paper, we introduce the NER dataset from CLUE organization (CLUE...

PEYMA: A Tagged Corpus for Persian Named Entities

The goal in the NER task is to classify proper nouns of a text into clas...

Robustness Gym: Unifying the NLP Evaluation Landscape

Despite impressive performance on standard benchmarks, deep neural netwo...

Addressing Barriers to Reproducible Named Entity Recognition Evaluation

To address what we believe is a looming crisis of unreproducible evaluat...

Code Repositories


The accompanying code and data for the Springer 2017 publication "What's missing in geographical parsing?" in Language Resources and Evaluation.

view repo

1 Introduction

Geoparsing aims to translate toponyms in free text into geographic coordinates. Toponyms are weakly defined as “place names”; however, we will clarify and extend this underspecified definition in Section 3. For the headline: “Springfield robber escapes from Waldo County Jail. Maine police have launched an investigation.”, the geoparsing pipeline is: (1) Toponym recognition and extraction [Springfield, Waldo County Jail, Maine], this step is called Geotagging; and (2) Disambiguating and linking each to geographic coordinates [(45.39, -68.13), (44.42, -69.01), (45.50, -69.24)], this step is called Toponym Resolution (also Geocoding). Geoparsing is an atomic constituent of many Geographic Information Retrieval (GIR), Extraction (GIE) and Analysis (GIA) tasks such as determining a document’s geographic scope steinberger2013introduction , Twitter-based disaster response de2018taggs and mapping avvenuti2018crismap , spatio-temporal analysis of tropical research literature palmblad2017spatiotemporal , business news analysis abdelkader2015brands , disease detection and monitoring allen2017global as well as analysis of historical events such as the Irish potato famine tateosian2017tracking . Geoparsing can be evaluated in a highly rigorous manner, enabling a robust comparison of state-of-the-art (SOTA) methods. This manuscript provides an end-to-end Pragmatic Guide to Geoparsing Evaluation for that purpose. End-to-end means to 1) critically review and extend the definition of a toponym, i.e. what is to be evaluated in geoparsing and why it is important/useful; 2) review, recommend and create high-quality open resources to expedite future research; and 3) outline, review and consolidate metrics for each stage of the geoparsing pipeline, i.e. how to robustly evaluate.

Due to the essential NER component in geoparsing systems santos2015using ; delozier2015gazetteer ; karimzadeh2013geotxt ; gritta2017s ; jurgens2015geolocation , our investigation and proposals have a strong focus on NER’s location extraction

capability. We demonstrate that off-the-shelf NER taggers are inadequate for location extraction due to the lack of ability to extract and classify the pragmatic types of toponyms (Table

1). In an attempt to assign coordinates to an example sentence, “A French bulldog bit an Australian tourist in a Spanish resort.”, virtually all current NER tools fail to differentiate between the literal and associative uses of these adjectival toponyms111Throughout the paper, we use the term Literal to denote a toponym (or its context) that refers directly to the physical location and the term Associative for a toponym/concept (or its context) that is only associated with a place. The full exposition follows in Section 3.. A more detailed example (Table 2) and a survey of previous work (Section 2.1) show that the definition and handling of toponyms is inconsistent and unfit for advanced geographic NLP research. In fact, beyond “a place name” definition based on shallow syntactical and/or morphological rules, a deep pragmatic/contextual toponym definition in GIR and NER has not been defined in previous work. This underspecification results in erroneous and unrepresentative real-world extraction of place names incurring precision errors and recall errors. To that end, we propose a Pragmatic Taxonomy of Toponyms required for a rigorous geoparsing evaluation, which includes recommended datasets and metrics.

Why a Pragmatic Guide?

Pragmatics pustejovsky1991generative is the theory of a generative approach to word meaning, i.e. how context contributes to and changes the semantics of words and phrases. This is the first time, to our best knowledge, that the definition of fine-grained toponym types has been quantified in such detail using a representative sample of general topic, globally distributed news articles. We release the GeoWebNews dataset to challenge researchers to develop Machine Learning (ML) algorithms to evaluate classification/tagging performance based on deep pragmatics rather than shallow syntactic features. Section 2 gives a background on Geoparsing, NER, GIE and GIR. We present the new taxonomy in Section 3, describing and categorising toponym types. Section 4 introduces the GeoWebNews dataset, annotation and resources. We also evaluate geotagging and toponym resolution on the new dataset, comparing the performance of a trained sequence tagging model to Spacy NER and Google NLP. Finally, in Section 5, we conduct a comprehensive review of current evaluation methods and justify the recommended framework.

1.1 Summary of the most salient findings

Toponym semantics have been underspecified in NLP literature. Toponyms can refer to physical places or entities associated with a place as we propose in a new taxonomy. Their distribution in a sample of 200 news articles is 53% literal and 47% associative. Until now, this type of fine-grained toponym analysis was not possible. We provide a dataset annotated by (computational) linguists enabling immediate evaluation of our theoretical proposals. GeoWebNews.xml can be used to evaluate Geotagging, NER, Toponym Resolution and to develop ML models from limited training data. A total of 2,720 toponyms were annotated with Geonames. Data augmentation was added with an extra 3,460 annotations although effective implementation is challenging. We also found that popular NER taggers appear not to use context information/semantics, relying instead on the entity’s primary word sense (Table 2

). We show that this issue can be addressed by training an effective geotagger from limited training data (F-Score=88.6), outperforming Google Cloud NLP (F-Score=83.2) and Spacy NLP/NER (F-Score=74.9). In addition, effective 2-class (Literal versus Associative) geotagging (F-Score=77.6) is also feasible. The best toponym resolution scores for GeoWebNews.xml were 95% Accuracy@161km, an AUC of 0.06 and a Mean Error of 188km. Lastly, we provide a critical review of available metrics and important nuances of evaluation such as database choice, system scope, data domain/distribution, statistical testing, etc. All recommended resources are available on GitHub


2 Background

Before we critically review how to rigorously evaluate geoparsing and introduce a new dataset, we first need to clarify what exactly is to be evaluated and why. We focus on the pragmatics of toponyms for improved geoparsing of events described in text. This requires differentiating literal from associative types as well as significantly increasing toponym recall by including locational entities ignored by current models. When a word spells like a place (shares its orthographic form), it does not necessarily mean it is a place or has equivalent meaning, e.g.: “Paris (a person) said that Parisian (associative toponym) artists don’t have to live in Paris (literal toponym).” and “Iceland (a UK supermarket) doesn’t sell Icelandic (associative toponym) food, it’s not even the country of Iceland (literal toponym).” To advance research in toponym extraction and other associated NLP tasks, we need to move away from the current practice of mostly ignoring the context of a toponym, relying on the entity’s dominant word sense and treating toponyms as semantically equivalent (see Table 2). The consequences of this are disagreements and incompatibilities in toponym classification leading to unrepresentative real-world performance. It is difficult to speculate whether the reason for this underspecification is the lack of available quality training data hence lower traction in the NLP community or the satisfaction with a simplified approach, however, we aim to encourage active research and development through our contributions.

2.1 Geographic datasets and the linguistics of toponyms

Previous work in annotation of geographic NLP datasets constitutes our primary source of enquiry into recent research practices, especially the lack of linguistic definition of toponym types. An early definition of an Extended Named Entity Hierarchy sekine2002extended was based on geographic feature types333https://nlp.cs.nyu.edu/ene/version7_1_0Beng.html, i.e. address, country, region, GPE, etc. Geoparsing and NER require a deeper contextual perspective based on how toponyms are used in practice by journalists, writers and social media users, something that a static database lookup cannot determine. CoNLL 2002 sang2002introduction and 2003 tjong2003introduction similarly offer no definition of a toponym beyond what is naively thought of as a location, i.e. a word spelled like a place with its primary word sense being a location. Schemes such as ACE doddington2004automatic bypass toponym type distinction, classifying entities such as governments via a simplification to a single tag GPE: A Geo-Political Entity. Modern NER parsers such as Spacy honnibal-johnson:2015:EMNLP use similar schemes weischedel2013ontonotes to collapse different taxonomic types into a single tag avoiding the need for deeper understanding of context. A simplified tag set (LOC, ORG, PER, MISC) based on Wikipedia nothman2013learning is used by NER taggers such as Illinois NER redman2016illinois and Stanford NLP manning-EtAl:2014:P14-5 , featured in Table 2. The table shows the popular NER taggers’ classification behaviour indicating weak and inconsistent usage of context.
The SpatialML mani2010spatialml scheme is focused only on spatial reasoning e.g. X location north of Y. Metonymy markert2002metonymy , which is a substitution of a related entity for a concept originally meant, was acknowledged but not annotated due to the lack of training of Amazon Mechanical Turk annotators. Facilities were always tagged in the SpatialML corpus regardless of the context in which they’re being used. The corpus is available at a cost of $500-$1,000. The Message Understanding Conferences (MUC) hirschman1998evolution have historically not tagged adjectival forms of locations such as “American exporters”; we assert that there is no difference between that and “U.S. exporters”, which would almost certainly be annotated. The Location Referring Expression corpus matsuda2015annotating has annotated toponyms including locational expressions such as parks, buildings, bus stops and facilities in 10,000 Japanese tweets. Systematic polysemy alonso2013annotation has been taken into account for facilities, but not extended to other toponyms.

The WoTR corpus delozier2016data of historical US documents also did not define toponyms. However, browsing the dataset, expressions such as “Widow Harrow’s house” and “British territory” were annotated. In Section 3, we shall claim this is beyond the scope of toponyms, i.e. “house” and “territory” should not be tagged. The authors do acknowledge, but do not annotate metonymy, demonyms and nested entities. Systematic polysemy in GIR such as metonymy should be differentiated during toponym extraction and classification, something acknowledged as a problem more than ten years ago leveling2008metonymy . Section 3 elaborates on the taxonomy of toponyms beyond just the metonymic cases. Geocorpora wallgrun2018geocorpora is a Twitter-based geoparsing corpus with around 6,000 toponyms with buildings and facilities annotated. The authors acknowledge that toponyms are frequently used in a metonymic manner, however, these cases have not been annotated after browsing the open dataset. Adjectival toponyms have also been left out. We show that these constitute around 13% of all toponyms thus should be included to boost recall.

The LGL corpus lieberman2010geotagging loosely defines toponyms as “spatial data specified using text”. Low emphasis is placed on geotagging, focusing instead on toponym resolution. Authors agree that NER is inadequate for GIR tasks. It is often the case that papers emphasise the geographic ambiguity of toponyms but not their semantic ambiguity. The CLUST dataset lieberman2011multifaceted

by the same author, describes toponyms simply as “textual references to geographic locations”. Homonyms are discussed as is the low recall and related issues of NER taggers, which makes them unsuitable for achieving high geotagging fidelity. Metonymy was not annotated, some adjectival toponyms have been tagged though sparsely and inconsistently. There is no distinction between literal and associative toponyms. Demonyms were tagged but with no special annotation hence treated as ordinary locations with no descriptive statistics offered. TR-News

kamalloo2018coherent is a quality geoparsing corpus despite the paucity of annotation details or IAA figures in the paper. A brief analysis of the open dataset showed that embedded toponyms, facilities and adjectival toponyms were annotated, which substantially increases recall, although no special tags were used hence unable to gather descriptive statistics. Homonyms, coercion, metonymy, demonyms and languages were not annotated and nor was the distinction between literal, mixed and associative toponyms. With that, we still recommended it as a suitable resource for geoparsing in the latter sections.

PhD Theses

are themselves comprehensive collections of a large body of relevant research and therefore important sources of prior work. Despite this not being the convention in NLP publishing, we outline the prominent PhD theses from the past 10+ years to show that toponym types have not been organised into a pragmatic taxonomy and that evaluation metrics in geocoding are in need of review and consolidation. We also cite their methods and contributions as additional background for discussions throughout the paper. The earliest comprehensive research on toponym resolution originated in a 2007 PhD thesis

leidner2008toponym . Toponyms were specified as “names of places as found in a text”. The work recognised the ambiguity of toponyms in different contexts and was often cited by later research papers though until now, these linguistic regularities have not been formally and methodically studied, counted, organised and released as high fidelity open resources. A geographic mining thesis da2008geographically defined toponyms as “geographic names” or “place names”. It mentions homonyms, which are handled with personal name exclusion lists rather than learned by contextual understanding. A Wikipedia GIR thesis overell2009geographic has no definition of toponyms and limits the analysis to nouns only. Another GIR thesis andogah2010geographically discusses the geographic hierarchy of toponyms as found in gazetteers, i.e. feature types instead of linguistic types. A toponym resolution thesis buscaldi2010toponym describes toponyms as “place names”, once again mentions metonymy without handling these cases citing lack of resources, which our work provides.

A Twitter geolocation thesis han2014improving provides no toponym taxonomy, nor does the Named Entity Linking thesis dos2013linking . A GIR thesis moncla2015automatic defines a toponym as a spatial named entity, i.e. a location somewhere in the world bearing a proper name, discusses syntactical rules and typography of toponyms but not their semantics. The authors recognise this as an issue in geoparsing but no solution is proposed. The GIA thesis ferres2017knowledge acknowledges but doesn’t handle cases of metonymy, homonymy and non-literalness while describing a toponym as “a geographical place name”. Recent Masters theses also follow the same pattern such as a toponym resolution thesis kolkman2015cross , which says a toponym is a “word of phrase that refers to a location”. While none of these definitions are incorrect, they are very much underspecified. Another Toponym Resolution thesis delozier2016data acknowledges relevant linguistic phenomena such as metonymy and demonyms, however, no resources, annotation or taxonomy is given; toponyms were established as “named geographic entities”. This background section presented a multitude of research contributions using, manipulating and referencing toponyms, however, without a deep dive into their pragmatics, i.e. what is a toponym from a linguistic point of view and the practical NLP implications of that. Without an agreement on the what, why and how of geoparsing, the evaluation of SOTA systems cannot be consistent and robust.

3 A Pragmatic Taxonomy of Toponyms

While the evaluation metrics, covered in Section 5, are relevant only to geoparsing, Sections 3 and 4 have implications for core NLP tasks such as NER. To begin with defining a Toponym Taxonomy, we start with a location. A location is any of the potentially infinite physical points on Earth identifiable by coordinates. With that in mind, a toponym is any named entity that labels a particular location. Toponyms are thus a subset of locations as most locations do not have proper names. Further to the definition and extending the work from the Background section, toponyms exhibit various degrees of literalness as their referents may not be physical locations but other entities as is the case with metonyms, languages, homonyms, demonyms, some embedded toponyms and associative modifiers, all covered shortly.

Structurally, toponyms occur in clauses, which are the smallest grammatical units expressing a full proposition. Within clauses, which serve as the context, toponyms are embedded in noun phrases (NP). A toponym can occur as the head of the NP, for example “Accident in Melbourne.” Toponyms also frequently modify NP heads. Modifiers can occur before or after the NP head such as in “President of Mongolia” versus “Mongolian President” and can have an adjectival form “European cities” or a noun form “Europe’s cities”.

In theory, though not always obvious in practice, the classification of the toponym type is driven by (1) the semantics of the NP; which is conditional on (2) the NP context of the surrounding clause; these may be combined in a hybrid approach dong2015hybrid . It is this interplay of semantics and context, seen in Table 1, that determines the type of the following toponyms (literals=bold, associative=italics): “The Singapore project is sponsored by Australia.” and “He has shown that in Europe and last year in Kentucky.” and “The soldier was operating in Manbij with Turkish troops when the bomb exploded.” As a result of our corpus linguistic analysis, we propose two top-level taxonomic types (a) literal: where something is happening or is physically located; and (b) associative: a concept that is associated with a toponym (Table 1). We also propose that for applied NLP, it is sufficient and feasible to distinguish between literal and associative toponyms. Before we introduce the toponym taxonomy, we discuss another group of toponyms that are typically ignored.

3.1 Non-Toponyms

There is a group of locational entities that are typically not classified as toponyms, denoted as Non-Toponyms in this paper. We shall assert, however, that these are in fact equivalent to “regular” toponyms. There are three types, described in the next subsections: a) Embedded Literal such as “The British Grand Prix” and “Louisiana Purchase”; b) Embedded Associative, for example: “Toronto Police” and “Brighton City Council”. Embedded toponyms are nested inside other entities, which is a well-explored NLP task in multiple languages marquez2007semeval ; byrne2007nested ; and c) Coercion, which is when a polysemous entity has its less dominant word sense coerced to the location class by the context. Failing to extract Non-Toponyms lowers real-world recall, missing out on valuable geographical data. In our diverse and broadly-sourced dataset, Non-Toponyms constituted 16% of all toponyms. In Figure 1, these types have a red border.

Figure 1: Pragmatic Taxonomy of Toponyms. Types with a red border are Non-Toponyms. Classification algorithm: If the context indicates a literal or is ambiguous/mixed, then the type is literal. If the context is associative, then (a) for non-modifiers the toponym is associative (b) for modifiers, if the head is mobile and/or abstract, then the toponym is associative, otherwise it is literal.
Toponym Type NP Semantics Indicates NP Context Indicates
Literals Noun Literal Type Literal Type
Literal Modifiers Noun/Adjectival Literal Literal or Associative
Mixed Noun/Adjectival Literal Ambiguous or Mixed
Coercion Non-Toponym Literal Type
Embedded Literal Non-Toponym Literal Type
Embedded NonLit Non-Toponym Associative Type
Metonymy Noun Literal Type Associative Type
Languages Adjectival Literal Type Associative Type
Demonyms Adjectival Literal Type Associative Type
Non-Lit Modifiers Noun/Adjectival Literal Associative Type
Homonyms Noun Literal Type Associative Type
Table 1: The interplay between context and semantics determines the type. The top five are the literals, the bottom six are the associative types. Examples of each type can be found in Figure 1. NP head must be strongly indicative of a literal type, e.g.: “The British weather doesn’t seem to like us today.”

3.2 Literal Toponyms

These types refer to places where something is happening or is physically located. This subtle but important distinction from associative toponyms allows for higher quality geographic analysis. For instance, the phrase “Swedish people” (who could be anywhere) is not the same as “people in Sweden”, so we differentiate this group from the associative group (Table 1). Only the latter mention refers to Swedish “soil” and can/should be precessed separately.

A Literal

is what is most commonly and too narrowly thought of as a location, e.g. “Harvests in Australia were very high.” and “South Africa is baking in 40C degree heat.” For these toponyms, the semantics and context both indicate it is a literal toponym, which refers directly to a physical location.


refers to polysemous entities typically classified as Non-Toponyms, which in a literal context have their word sense coerced to (physical) location. More formally, coercion is “an observation of grammatical and semantic incongruity, in which a syntactic structure places requirements on the types of lexical items that may appear within it.”ziegeler2007word Examples include “The University of Sussex, Sir Isaac Newton (pub), High Court is our meeting place.” and “I’m walking to Chelsea F.C., Bell Labs, Burning Man.” Extracting these toponyms increases recall and allows for a very precise location as these toponyms tend to have a small geographic footprint.

Mixed Toponyms

typically occur in an ambiguous context, e.g. “United States is generating a lot of pollution.” or “Sudan is expecting a lot of rain.” They can also simultaneously activate a literal and an associative meaning, e.g. “The north African country of Libya announced the election date.”. These cases sit somewhere between literal and associative toponyms and were assigned the Mixed tag, however, we propose to include them in the literal group.

Embedded Literals

are Non-Toponyms nested within larger entities such as “Toronto Urban Festival”, “London Olympics” and “Monaco Grand Prix”, extracted using a ’greedy algorithm’ for high recall. They are semantically, though not syntactically, equivalent to Literal Modifiers. If we ignored the case, the meaning of the phrase would not change, e.g. “Toronto urban festival”.

Noun Modifiers

are toponyms that modify literal heads (Figure 2), e.g. “You will find the UK [lake, statue, valley, base, airport] there.” and “She was taken to the South Africa [hospital, border, police station]”. The context, however, needn’t always be literal, for instance “An Adelaide court sentenced a murderer to 25 years.” or “The Vietnam office hired 5 extra staff.” providing the head is literal. Noun modifiers can also be placed after the head, for instance “We have heard much about the stunning caves of Croatia.”

Figure 2: Example noun phrases ranging from Literal to Mixed to Associative. The further to the right, the more ’detached’ the NP referent becomes from its physical location. Literal heads tend to be concrete (elections, accidents) and static (buildings, natural features) while associative heads are more abstract (promises, partnerships) and mobile (animals, products). In any case, context is the main indicator of type and needs to be combined with NP semantics.

Adjectival Modifiers

exhibit much the same pattern as noun modifiers except for the adjectival form of the toponym e.g. “It’s freezing in the Russian tundra.”, “British ports have doubled exports.”, “American schools are asking for more funding.” Adjectival modifiers are frequently and incorrectly tagged as nationalities or religious/political groups444https://spacy.io/usage/ and http://corenlp.run/ or sometimes ignored555http://services.gate.ac.uk/annie/ and IBM NLP Cloud in Table 2. altogether. Approximately 1 out of 10 adjectival modifiers is literal in our news corpus.

3.3 Associative Toponyms

Toponyms frequently refer to or are used to modify non-locational concepts, which are associated with a location rather than directly referring to their physical presence. This can occur by substituting a non-locational concept with a toponym (metonymy) or via a demonym, homonym or a language reference. Some of these look superficially like modifiers leading to NER errors.


roberts2011germans are derived from toponyms and denote the inhabitants of a country, region or city. These persons are associated with a location and have been on occasion, sparsely rather than exhaustively, annotated lieberman2010geotagging . Examples include “I think he’s Indian.”, which is equivalent to “I think he’s an Indian citizen/person.” or “An American and a Briton walk into a bar …”


can sometimes be confused for adjectival toponyms, e.g. “How do you say pragmatics in French, Spanish, English, Japanese, Chinese, Polish?” Occurrences of languages should not be interpreted as modifiers, another NER error stemming from a lack of context understanding. This is another case of a concept associated with a location that does not require coordinates.


is a figure of speech whereby a concept that was originally intended gets substituted with a related concept, for example “Madrid plays Kiev today.”, substituting sports teams with toponyms. Similarly, in “Mexico changed the law.”, the likely latent entity is the Mexican government. Metonymy was previously found to be a frequent phenomenon, around 15-20% of place mentions are metonymic markert2007semeval ; gritta2017vancouver ; leveling2008metonymy . In our dataset, it was 13.7%.

Noun Modifiers

are toponyms that modify associative noun phrase heads (Figure 2) in an associative context, for instance “China exports slowed by 7 percent.” or “Kenya’s athletes win double gold.” Noun modifiers also occur after the head as in “The President of Armenia visited the Embassy of the Republic of Armenia to the Vatican.”. Note that the event did not take place in Armenia but the Vatican, potentially identifying the wrong event location.

Adjectival Modifiers

are sporadically covered by NER taggers (Table 2) or tagging schemes hirschman1998evolution . They are semantically identical to associative noun modifiers except for their adjectival form, e.g. “Spanish sausages sales top €2M.”, “We’re supporting the Catalan club.” and “British voters undecided ahead of the Brexit referendum.”

Milan (Homomymy) Assoc. Literal Literal Literal Literal Organ. Literal
Lebanese (Language) Assoc. Literal Demon. Demon. Misc.
Syrian (Demonym) Assoc. Literal Demon. Demon. Misc.
UK origin (NounMod) Assoc. Assoc. Literal Literal Literal Literal Literal
K.of Jordan (PostMod) Assoc. Person Literal Literal Literal Organ. Person
Iraqi militia (AdjMod) Assoc. Assoc. Demon. Demon. Misc.
US Congress (Embed) Assoc. Organ. Organ. Organ. Organ. Organ. Organ.
Turkey (Metonymy) Assoc. Literal Literal Literal Literal Literal
city in Syria (Literal) Literal Literal Literal Literal Literal Literal Literal
Iraqi border (AdjMod) Literal Literal Demon. Demon. Misc.
Min.of Defense (Fac) Literal Organ. Organ. Organ. Organ. Organ. Organ.
Table 2: Popular NER taggers tested in June 2018 using official demo interfaces (incorrect labels underlined) on the sentence: Milan, who was speaking Lebanese with a Syrian of UK origin as well as the King of Jordan, reports that the Iraqi militia and the US Congress confirmed that Turkey has shelled a city in Syria, right on the Iraqi border near the Ministry of Defense.” A distinction is made only between a location and not-a-location since an associative label is unavailable. The table shows only a moderate agreement between tagging schemes. Can be derived from the API with a simple rule.

Embedded Associative

toponyms are Non-Toponyms nested within larger entities such as “US Supreme Court”, “Sydney Lottery” and “Los Angeles Times”. They are semantically, though not syntactically, equivalent to Associative Modifiers. Ignoring case would not change the meaning of the phrase “Nigerian Army” versus “Nigerian army”. However, it will wrongly change the shallow classification from ORG to LOC for most NER taggers.


and more specifically homographs, are words with identical spelling but different meaning such as Iceland (a UK grocery chain). Their meaning is determined mainly by contextual evidence hearst1991noun ; gorfein2001activation as is the case with other types. Examples include: “Brooklyn sat next to Paris.” and “Madison, Chelsea, Clinton, Victoria, Jamison and Norbury submitted a Springer paper.”

4 GeoWebNews

As our second contribution, we introduce a new dataset to enable evaluation of fine-grained sequence tagging and classification of toponyms. This will facilitate an immediate implementation of the theory from the last section. The dataset comprises 200 articles from 200 globally distributed news sites. Articles were sourced via a collaboration with the European Union’s Joint Research Centre666https://ec.europa.eu/jrc/en, collected during 1st-8th April 2018 from the European Media Monitor steinberger2013introduction using a wide range of multilingual trigger words/topics777http://emm.newsbrief.eu/. We then randomly selected exactly one article from each domain (English language only) until we reached 200 news stories. We also share the BRAT stenetorp2012brat configuration files to expedite future data annotation using the new scheme. GeoWebNews can be used to evaluate the performance of NER (locations only) known as Geotagging, Geocoding/Toponym Resolution gritta2018melbourne ; to develop Machine Learning models for sequence tagging and classification, geographic information retrieval or perhaps used in a Semantic Evaluation marquez2007semeval task. GeoWebNews is a web-scraped corpus hence a few articles may contain duplicate paragraphs or some missing words from improperly parsed web links, which is typical of what might be encountered in practical applications.

Figure 3: Example of a GeoWebNews article. An asterisk indicates an attribute. (1) modifier_type [Adjective, Noun] and/or (2) non_locational [True, False].
Figure 4: A screenshot of the BRAT Annotation Tool configuration.

4.1 Annotation Procedure and Inter-Annotator Agreement (IAA)

The annotation of 200 news articles at this level of granularity is a laborious and time-consuming effort. However, annotation quality is paramount when proposing changes/extensions to existing schemes. Therefore, instead of using crowd-sourcing, annotation was performed by the first author and two linguists from Cambridge University’s Modern and Medieval Languages Faculty888https://www.mml.cam.ac.uk/. In order to expedite the verification process, we decided to make the annotations of the first author available to our linguists as ‘pre-annotation’.

Their task was then twofold: (1) Precision Check: verification of the first author’s annotations with appropriate edits; (2) Recall Check: identification of additional annotations that may have been missed. The F-Scores for the Geotagging IAA were computed using BratUtils999https://github.com/savkov/BratUtils, which implements the MUC-7 scoring schemechinchor1998appendix . The Geotagging IAA F-Scores after adjudication were 97.2 and 96.4, for first and second annotators respectively, computed on a 12.5% sample of 336 toponyms from 10 randomly chosen articles (out of a total of 2,720 toponyms across 200 articles). The IAA for a simpler binary distinction (literal versus associative types) were 97.2 and 97.3.

4.2 Annotation of Coordinates

The Geocoding IAA with the first annotator on the same 12.5% sample of toponyms expressed as accuracy [correct or incorrect coordinates/location] was 99.7%. An additional challenge with this dataset is that some toponyms (8%) require either an extra source of knowledge such as Google Maps API, a self-compiled list of business/organisation names such as matsuda2015annotating or even human-like inference/reasoning to resolve correctly. These toponyms are facilities, buildings, street names, park names, festivals, universities and other venues. We have estimated the coordinates for these toponyms, which do not have an entry in Geonames, to the best of our ability using Google Maps API. Because of the logical annotation, these hard to resolve toponyms can be excluded from evaluation, which is what we did. We excluded 209 of these toponyms and a further 110 demonym, homonym and language types without coordinates, resolving the remaining 2,401. We did not annotate articles’ geographic focus as was done for Twitter eisenstein2010latent ; roller2012supervised and Wikipedia laere2014georeferencing .

Figure 5: An augmentation of a literal training example. An associative augmentation might be {The deal was agreed by} {the chief engineer}.

4.3 Training Data Augmentation

In order to augment the 2,720 toponyms and double, even triple the training data size, two additional lexical features (NP heads only) were annotated: Literal Expressions and Associative Expressions101010Google Cloud NLP already tags common nouns in a similar manner.. These annotations generate two separate components (a) the NP context and (b) the NP head itself. In terms of distribution, we have literal (N=1,423) versus associative (N=2,037) context and literal (N=1,697) versus associative (N=1,763) heads, indicated by a binary non-locational attribute in Figure 3. These two interchangeable components give us multiple permutations from which to generate a larger training dataset (Fig. 5). The associative expressions are deliberately dominated by ORG-like types because this is the most frequent metonymic pairalonso2013annotation .

4.4 Geotagging GeoWebNews

For toponym extraction, we tested two best models from Table 2, Google Cloud Natural Language NLP111111https://cloud.google.com/natural-language/ and Spacy NLP121212https://spacy.io/usage/linguistic-features, then trained our own NCRF++ model yang2018ncrf++ , which is an open-source Neural Sequence Labeling Toolkit131313https://github.com/jiesutd/NCRFpp

, evaluated using 5-Fold Cross-Validation (40 articles/files per fold, 4 train folds and 1 test fold). Embeddings were initialised with 300D vectors

141414Common Crawl 42B - https://nlp.stanford.edu/projects/glove/ from GloVe pennington2014glove

in a simple form of transfer learning as training data is limited. The NCRF++ NER tagger was trained with default hyper-parameters, with two additional features, the dependency head and word shape/morphology, both extracted with the Spacy NLP parser. For this NER model, we prioritised

fast prototyping and deployment

over meticulous feature/hyperparameter tuning and extraction hence there is likely more performance to be found.

NER Model Precision Recall F-score
NCRF++ (with Literal & Assoc. labels) 79.9 75.4 77.6
Spacy NLP/NER 82.4 68.6 74.9
Google Cloud NLP 91.0 76.6 83.2
NCRF++ (with “Location” label only) 90.0 87.2 88.6
Table 3: Geotagging F-Scores for GeoWebNews featuring the best performing models. The NCRF++ models’ scores were averaged over 5 folds (=1.2-1.3).

There were significant differences in precision and recall between off-the-shelf and custom models. Spacy NER and Google NER achieved a precision of 82.4 and 91 respectively while achieving a lower recall of 68.6 and 76.6 respectively. The NCRF++ tagger exhibited a balanced classification behaviour (90 precision, 87.2 recall). It achieved the highest F-Score of 88.6 in spite of only a modest amount of training examples. It shows that open-source NER models are powerful and adaptable given carefully annotated training data.

Geotagging with two labels

was also evaluated with a custom NCRF++ model. The mean F-Score over 5 folds was 77.6 (=1.7), which is higher than Spacy (74.9) and includes the “literalness” distinction, i.e. which toponyms refer to the physical location versus ones with an associative relationship. It demonstrates the feasibility of geotagging on two levels, extracting and treating toponyms separately in downstream tasks. For example, literal toponyms may be given a higher weighting for the purposes of geolocating an event. In order to incorporate this functionality into NER, training a custom sequence tagger is currently the best option for a two-label toponym extraction.

(1) No Aug. (2) Partial Aug. (3) Full Aug. (4) Ensemble of (1, 2, 3)
88.6 88.2 88.4 88.5
Table 4: F-Scores for NCRF++ models with 5-Fold Cross-Validation. No improvement was observed for the augmented or ensemble setups over baseline.

Table 4 shows three additional augmentation experiments (numbered 2, 3, 4) we have compared to the best NCRF++ model (1). We hypothesised that data augmentation, i.e. adding additional modified training instances would lead to a boost in performance, however, this did not materialise. An ensemble of models (4) also did not beat the baseline NCRF++ model (1). Due to time constraints, we have not extensively experimented with elaborate data augmentation and encourage further research into other implementations.

4.5 Geocoding GeoWebNews

For the evaluation of Toponym Resolution, we have excluded the followings: (a) the most difficult to resolve toponyms such as street names, building names, festival venues and so on, which account for 8% of total, without an entry in Geonames and often requiring a reference to additional resources; and (b) demonyms, languages and homonyms, accounting for 4% of toponyms as these are not locations hence do not have coordinates. The final count was 2,401 (88%) toponyms. We have used the CamCoder 2018 gritta2018melbourne default model/setup151515https://github.com/milangritta/Geocoding-with-Map-Vector

for the resolution of toponyms and a strong population heuristic as a baseline. The scores are reported in Table


Two evaluation setups have been considered: (1) Using Spacy NER for geotagging, then scoring the 1,547 true positives with a matching record in Geonames; and (2) Using Oracle NER to resolve all 2,401 toponyms, which have been normalised, i.e. provided with a proper location name that can be looked up in Geonames. We recommend setup (1) in the first instance as the most representative of actual usage. In practice, this means extracting fewer toponyms, up to 30-50% fewer depending on the dataset and NER. A small disadvantage is that we evaluate on a subsample hence not absolutely sure how the system performs on the rest of the data. However, in our previous work, we found that a random 50%+ sample is representative of the full dataset. The advantage of this approach is that it is the most realistic way of evaluating geocoding. For more insights, we also include the latter “laboratory comparison” (i.e. setup 2) as the scores may differ. This can only be done if it is possible to perform geocoding separately. The advantage of this is that all toponyms will be found in the knowledge base assessing the disambiguation performance across the whole dataset. The disadvantage is that it can give the system an unfair edge and it is less representative of real-world needs.

Setup/Description Mean Err Acc@161km AUC # of Toponyms
(1) Spacy NER + CamCoder 188 95 0.06 1,547
(1) Spacy NER + Population 210 95 0.07
(2) Oracle NER + CamCoder 232 94 0.06 2,401
(2) Oracle NER + Population 250 94 0.07
Table 5: Geocoding scores for GeoWebNews.

The overall errors are very low indicating low ambiguity, i.e. geocoding difficulty of the dataset. Other datasets gritta2017s

are more challenging with errors 2-5 times greater. The main observation is the difference between the number of toponyms resolved. When provided with a database name for each extracted toponym, it is possible to evaluate the whole dataset and get a sense of the pure disambiguation performance. However, in reality, geotagging is performed first, which reduces that number significantly. In addition, any toponyms that cannot be matched against a lexicon or database, will also be discarded. As a

geoparsing pipeline with Spacy NER + CamCoder, we can see that 94-95% of the 1,547 correctly recognised toponyms were resolved to within 161km.

The number of recognised toponyms could be increased with a “normalisation lexicon” that maps non-standard forms such as adjectives (“Asian”, “Russian”, “Congolese”) to their database names. Spacy NER provides a separate class for these toponyms; NORP

, nationalities or religious or political groups. A lexicon/matcher could be built with a gazetteer-based statistical n-gram model such as


that uses multiple knowledge bases; alternatively using a rule-based system

volz2007towards . For unknown toponyms, approximating the geographic representation from places that co-occur with it in other documents henrich2008determining may be an option. Finally, not all errors can be easily, or at all, evaluated in a conventional setup. Suppose an NER tagger has 80% precision. This means 20% of potential toponyms, i.e. false positives will not be evaluated as they are not annotated in the dataset. However, when the system is operational, these toponyms will be used in downstream processing. In practice, this subset carries some unknown penalty that NLP practitioners hope is not too large. For downstream tasks, however, this is something that may be included in the error analysis as we will demonstrate in our next publication.

5 Standard Evaluation Metrics

The previous two sections established what is to be evaluated and why, then introduced new resources with possible implications for NER. In this part, we focus on critically reviewing existing geoparsing metrics, i.e. how to best assess geoparsing models. In order to reliably determine the SOTA and estimate the practical usefulness of these models in downstream applications, we propose a holistic, consistent and rigorous evaluation framework.

Full Pipeline

Considering the task objective and available metrics, the recommended approach is to evaluate geoparsing as two separate components. Researchers and practitioners do not typically tackle both stages at once delozier2015gazetteer ; tobin2010evaluation ; karimzadeh2013geotxt ; wing2014hierarchical ; wing2011simple ; gritta2018melbourne . More importantly, it is difficult to diagnose errors and target improvements without this separation. The best practice is to evaluate geotagging first, then obtain geocoding metrics for the true positives, i.e. the subset of correctly identified toponyms. We recommend a minimum of 50% recall of annotated toponyms for a representative sample and a robust evaluation.

5.1 Geotagging Metrics

There is a strong agreement on the appropriate geotagging evaluation metric so most attention will focus on toponym resolution. As a subtask of NER, geotagging is evaluated using the F-Score, which is also our recommended metric and an established standard for this stage of geoparsing lieberman2011multifaceted . Figures for precision and recall may also be reported as some applications may trade precision for recall or may deem precision/recall errors more costly.

5.2 Toponym Resolution Metrics

Several geocoding metrics have been used in previous work and can be divided into three groups depending on their output format. We shall assert that the most ’fit for purpose’ output of a geoparser is a pair of coordinates, not a categorical value/label or a ranked list of toponyms. In practice, most end users would expect exactly one solution making ranked lists user-unfriendly. Knowing that the correct solution is found somewhere in the top k results leaves users with an uncertain choice of final solution and can give unduly flattering results during evaluationsantos2015using . Ranked lists may be acceptable if subjected to further human judgement/correction but not as the final output. With set-based metrics such as the F-Score, there are several issues: a) Database incompatibility; geoparsers built with different knowledge bases that cannot be aligned make fair benchmarking infeasible. This is also an issue with datasets labelled with proprietary database identifiers/keys. b) The all-or-nothing approach implies that every incorrect answer (e.g. error greater than 5-10km) is equally wrong. This is not the case, geocoding errors are continuous variables, notcategorical variables hence the F-Score is unsuitable for toponym resolution. c) Underspecification of recall/precision; to our best knowledge, the following evaluation question is uncertain and/or unanswered. Is a correctly geotagged toponym with an error greater than Xkm a false positive or a false negative? This is important for accurate precision and recall figures. Set-based metrics and ranked lists are prototypical cases of trying to fit the wrong evaluation metric to a task. Finally, population has not consistently featured in geocoding evaluation but it is capable of beating many existing systems delozier2015gazetteer ; gritta2017s . Therefore, we recommend the usage of this strong baseline as a necessary component of evaluation. We now briefly discuss each metric group.

Coordinates-based (continuous)

metrics are the recommended group when the output of a geoparser is a pair of coordinates. An error is defined as the distance from predicted coordinates to gold coordinates. Average Error is a regularly used metric delozier2016data ; hulden2015kernel analogous to a sum function thus informs of total error as well. Accuracy@Xkm is the percentage of errors resolved within Xkm of gold coordinates. grover2010use and tobin2010evaluation used accuracy within 5km, santos2015using ; dredze2013carmen used accuracy at 5, 50, 250km, related works on tweet geolocation speriosu2013text ; zheng2018survey ; han2014improving ; roller2012supervised use accuracy at 161km. We recommend the more lenient 161km as it covers errors stemming from database misalignment. Median Error is a simple metric to interpret wing2011simple ; speriosu2013text but is otherwise uninformative as the error distribution is non-normal hence not recommended. The Area Under the Curve gritta2017s ; jurgens2015geolocation is another coordinate-based metric, which follows in a separate subsection.


metrics and more specifically, the F-Score, has been used alongside coordinates-based metrics leidner2008toponym ; andogah2010geographically to evaluate the performance of the full pipeline. A true positive was judged as a correctly geotagged toponym and one resolved to within a certain distance. This ranges from 5km andogah2010geographically ; lieberman2012adaptive to 10 miles kamalloo2018coherent ; lieberman2010geotagging to all of the previous thresholds kolkman2015cross including 100km and 161km. In cases where WordNet has been used as the ground truth buscaldi2010toponym an F-Score might be appropriate given WordNet’s structure but it is not possible to make a comparison with a coordinates-based geoparser. Another problem with it is the all-or-nothing scoring. For example, Vancouver, Portland, Oregon is an acceptable output if Vancouver, BC, Canada was the expected answer. Similarly, the implicit suggestion that Vancouver, Portland is equally wrong as Vancouver, Australia is erroneous. Furthermore, using F-Score exclusively for the full pipeline does not allow for evaluation of individual geoparsing components making identifying problems more difficult. Finally, the F-Score offers little insight beyond the existing geotagging and geocoding evaluation scores. As a result, it is not fit for toponym resolution.


metrics such as Eccentricity, Cross-Entropy, Mean Reciprocal Rank, Mean Average Precision and other variants (Accuracy@k, Precision@k) have sometimes been used or suggested karimzadeh2016performance ; craswell2009mean . However, due to the aforementioned format of output, ranked geocoding results are not recommended. These metrics are appropriate for Geographic Information Retrieval.

(a) Original Errors
(b) Logged Errors
Figure 6: Computing the Area Under the Curve by integrating the Logged Errors. In Figure (b), . This means 33% of the maximum geocoding error () was committed. Large errors at the right tail of the distribution (a) matter less hence are deemphasized with the log function.

Area Under the Curve (AUC)

is a recent metric used for toponym resolution evaluationgritta2017s ; jurgens2015geolocation . It is not to be confused with other AUC variants, which include the AUC of ROC, AUC for measuring blood plasma in Pharmacokinetics161616The branch of pharmacology concerned with the movement of drugs within the body. or the AUC of the Precision/Recall curve. The calculation uses the standard calculus method to integrate the area under the curve of , using the Trapezoid Rule as an approximation of the definite integral:

where 20,039 km is

of Earth’s circumference, i.e. Max Error. The original errors, highly skewed in Figure

6a, are scaled down using the natural logarithm resulting in Figure 6b. The result divides into the total area of the graph to compute the AUC. The logarithm decreases the effect of outliers

that tend to distort the Mean Error. This allows for evaluation of smaller errors that would otherwise be suppressed by outliers.

5.3 Recommended Metrics for Toponym Resolution

There is no single metric that covers every important aspect of geocoding, therefore based on the previous paragraphs, we recommend: The AUC as a comprehensive metric as it accounts for every error, is suitable for a rigorous comparison but needs some care to be taken to understand. Accuracy@161km is a fast and intuitive way to inform of “correct” (error within 100 miles of gold coordinates) resolutions but ignores the rest of the error distribution. Mean Error is a measure of average and total error but it hides the full distribution, treats all errors as equal and is prone to distortion by outliers. Therefore, using all three metrics gives a holistic view of geocoding performance as they compensate for each others’ weaknesses while testing different aspects of toponym resolution. The winning model should perform well across all three metrics. As a final recommendation, an informative and intuitive way to assess the full pipeline would be to indicate how many toponyms were successfully extracted and geocoded/tested as in Table 5. Using the Accuracy@161km, we can then infer the percentage of correctly recognised and resolved toponyms to estimate the performance of the combined system.

5.4 Important Considerations for Evaluation

The Choice of Database

of geographic knowledge (a) used by geoparsing systems; and (b) used for labelling test/train datasets can cause challenges and must be noted in the evaluation report. In order to make a fair comparison between models and datasets, the toponym coordinates would ideally come from the same knowledge base. Incompatibilities between global gazetteers have been previously studied acheson2017quantitative quantifying the key differences. However, the most popular and open-source geoparsers and datasets use Geonames171717https://www.geonames.org/export/ allowing for an “apples to apples” comparison (unless indicated otherwise).
In case it is required, we also propose a database alignment method for an empirically robust comparison of geoparsing models and datasets with incompatible coordinate data. The adaptation process involves a post-edit to the output coordinates. For each toponym, retrieve its nearest candidate by measuring the distance from the predicted coordinates, generated using a different knowledge base, to the Geonames toponym coordinates. Finally, output the Geonames Coordinates to allow for a reliable comparison.

Resolution scope

needs to be noted when comparing geoparsers, although it is less likely to be an issue in practice. Different systems cover part of or the whole of the world surface, i.e. geoparsers with: Local Coverage such as a country-specific models matsuda2015annotating versus Global Coverage, which is the case with most geoparsers. It is not possible to fairly compare these two types of systems.

The train/dev/test data domains,

i.e. the homogeneity or heterogeneity of the evaluation datasets is a vital consideration. The nature and source of the evaluation datasets must be noted as performance will be higher on in-domain data, i.e. all partitions come from the same corpus. When training data comes from a different distribution than the test data, e.g. News Articles versus Wikipedia Pages, the model that can generalise to out-of-domain test data should be recognised as superior even if the scores are similar.

Statistical significance

tests need to be conducted when making a comparison between two geoparsers (unless a large performance gap makes this unnecessary). There are two options: 1)

k-fold cross-validation followed by a t-test

for both stages or 2) the McNemar’s test for Geotagging and the Wilcoxon Signed-Rank Test for Geocoding. The k-fold cross-validation is only suitable when a model is to be trained from scratch on k-1 folds, k

times. For evaluation of trained geoparsers, we recommend using the latter options with similar statistical power, e.g. when it is infeasible to train several deep learning models.

K-Fold Cross-Validation works by generating 5-10 folds that satisfy the i.i.d. requirement for a parametric test dror2018hitchhiker . This means folds should (a) not be randomised to satisfy the identically distributed requirement; (b) come from the same domain such as news text to satisfy the identically distributed requirement; (c) come from separate/disjoint files/articles, preferably from separate sources as well to satisfy the independent requirement. GeoWebNews satisfies those requirements by design. The number of folds will depend on the size of the dataset, i.e. fewer folds for a smaller dataset and vice versa. Following that, we obtain scores for each fold, perform a t-test and report the p-value. There is a debate as to whether a p-value of 0.05 is rigorous enough; we think 0.01 would be preferred but in any case, the lower the more robust. Off-the-shelf geoparsers should be tested as follows:

For Geotagging, use McNemar’s test

, a non-parametric statistical hypothesis test suitable for matched pairs produced by binary classification or sequence tagging algorithms


. McNemar’s test compares the disagreement rate between two models using a contingency table of the outputs of two models. It computes the probability of two models ’making mistakes’ at the same rate, using chi-squared distribution with one degree of freedom. If the probability of obtaining the computed statistic is less than 0.05, we reject the null hypothesis. For a more robust result, a lower threshold is preferred. This test is not well-approximated for contingency table values less than 25, however, if using multiple of our recommended datasets, this is highly unlikely.

For Toponym Resolution, use a two-tailed Wilcoxon Signed-Rank Test wilcoxon1945individual for computational efficiency as the number of test samples across multiple datasets can be large (10,000+). Geocoding errors follow a power law distribution (see Fig. 6a) with many outliers among the largest errors hence the non-parametric test. This sampling-free test compares the matched samples of geocoding errors. The null hypothesis assumes that the ranked differences between models’ errors are centred around zero, i.e. model one is right approximately as much as model two. Finally, report the p-value and z-statistic.

Unsuitable Datasets

Previous works in this research topic leidner2004towards ; andogah2010geographically ; santos2015using ; leidner2008toponym have evaluated with their own labelled data but those resources cannot be located. Of those that can, we briefly discuss the reasons for their unsuitability. AIDA hoffart2011robust is a geo-annotated CoNLL 2003 NER dataset, however, the proprietary CoNLL 2003 data is required to build it. Moreover, the CoNLL file format does not allow for original text reconstruction due to the missing whitespace. SpatialML mani2010spatialml ; mani2008spatialml datasets are primarily focused on spatial expressions in natural language documents and are not freely available ($500-$1,000 for a license181818https://catalog.ldc.upenn.edu/LDC2011T02). Twitter datasets such as GeoCorpora wallgrun2018geocorpora experience a gradual decline in completeness as users delete their tweets and deactivate profiles. WoTR delozier2016creating and CLDW rayson2017deeply are suitable only for digital humanities due to their historical nature and localised coverage, which is problematic to resolvebutler2017alts . CLUST lieberman2011multifaceted is a corpus of clustered streaming news of global events, similar to LGL. However, it contains only 223 toponym annotations. TUD-Loc2013 katz2013learn provides incomplete coverage, i.e. no adjectival or embedded toponyms, however, it may generate extra training data with some editing effort.

5.5 Recommended Datasets

We recommend evaluation with the following open-source datasets: (1) WikToR gritta2017s is a large collection of programmatically annotated Wikipedia articles and although quite artificial, to our best knowledge, it’s the most difficult test for handling toponym ambiguity (Wikipedia coordinates). (2) Local Global Lexicon (LGL) lieberman2010geotagging is a global collection of local news articles (Geonames coordinates) and likely the most frequently cited geoparsing dataset. (3) GeoVirus gritta2018melbourne is a WikiNews-based geoparsing dataset centred around disease reporting (Wikipedia coordinates) with global coverage though without adjectival toponym coverage. (4) TR-NEWS kamalloo2018coherent is a new geoparsing news corpus of local and global articles (Geonames coordinates) with excellent toponym coverage and metadata. (5) Naturally, we also recommend GeoWebNews for a complete, fine-grained, expertly annotated and broadly sourced evaluation dataset.

5.6 Geoparsing Evaluation Steps

As we conclude this section, here is a summary of the recommended evaluation:

  1. Review (and report) important geoparsing considerations in Section 5.4.

  2. Use a proprietary or custom NER tagger to extract toponyms using the recommended dataset(s) as demonstrated in Section 4.4.

  3. Evaluate geotagging using F-Score as the recommended metric and report statistical significance with McNemar’s Test (Section 5.4).

  4. Evaluate toponym resolution using Accuracy@161, AUC and Mean Error as the recommended metrics, see Section 4.5 for an example.

  5. Optional: Evaluate geocoding in “laboratory setting” as per Section 4.5.

  6. Report the number of toponyms resolved and the statistical significance using the Wilcoxon Signed-Rank Test (Section 5.4).

5.7 Conclusion and Future Work

Wittgenstein’s Ruler

from Nassim N. Taleb’s book, Fooled by Randomness taleb2005fooled deserves a mention as we close this paper with the main conclusions. It says: “Unless you have confidence in the ruler’s reliability, if you use a ruler to measure a table you may also be using the table to measure the ruler.” In the case of Geoparsing and NER, this translates into: “Unless you have confidence in the dataset’s reliability, if you use the dataset to evaluate the real-world, you may also be using the real-world to evaluate the dataset’s reliability.” If research is optimised towards, rewarded for and benchmarked on a possibly unreliable or unrepresentative dataset, does it matter how it performs in the real world of Pragmatics? The biases in machine learning reflect human biases so unless we improve training data annotation and generation, machine learning algorithms will not reflect linguistic reality. We must pay close attention to the correctness, diversity, timeliness and particularly the representativeness of the dataset, which also influences the soundness of the task objective. Many models still tune their performance on CoNLL NER ’03 data yadav2018survey , for example at COLING 2018 yang2018design and ACL 2018 gregoric2018named , which comes from a single source (Reuters News) tjong2003introduction and is shallowly annotated (not open-source either). It is important to ask whether models ’successfully’ trained and evaluated on data that does not closely mirror the real-world linguistics is the goal to aim for.

In this manuscript, we introduced a detailed pragmatic taxonomy of toponyms as a way to increase NER/Geotagging recall and to differentiate literal uses (53%) of place names from associative uses (47%) in multi-source general news text. This helps clarify the task objective, quantifies type occurrences, informs of common NER mistakes and enables innovative handling of toponyms in downstream tasks. In order to expedite future research, address the lack of resources and contribute towards replicability and extendability goodman2016does ; cacho2018reproducible , we shared the annotation framework, recommended datasets and any tools/code required for fast and easy extension. The NCRF++ model trained with just over 2,000 examples showed that it can outperform SOTA taggers such as Spacy NER and Google NLP for location extraction. The NCRF++ model can also achieve an F-Score of 77.6 in a two-label setting (literal, associative) showing that fine-grained toponym extraction is feasible. Finally, we critically reviewed current practices in geoparsing evaluation and presented our best recommendations for a fair, holistic and intuitive performance assessment. Future work may focus on generalising our pragmatic evaluation to other NER classes as well as extend the coverage to languages other than English.


  • (1) Abdelkader, A., Hand, E., Samet, H.: Brands in newsstand: spatio-temporal browsing of business news. In: Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems, p. 97. ACM (2015)
  • (2) Acheson, E., De Sabbata, S., Purves, R.S.: A quantitative analysis of global gazetteers: Patterns of coverage for common feature types. Computers, Environment and Urban Systems 64, 309–320 (2017)
  • (3) Al-Olimat, H.S., Thirunarayan, K., Shalin, V., Sheth, A.: Location name extraction from targeted text streams using gazetteer-based statistical language models. arXiv preprint arXiv:1708.03105 (2017)
  • (4) Allen, T., Murray, K.A., Zambrana-Torrelio, C., Morse, S.S., Rondinini, C., Di Marco, M., Breit, N., Olival, K.J., Daszak, P.: Global hotspots and correlates of emerging zoonotic diseases. Nature communications 8(1), 1124 (2017)
  • (5) Alonso, H.M., Pedersen, B.S., Bel, N.: Annotation of regular polysemy and underspecification. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol. 2, pp. 725–730 (2013)
  • (6) Andogah, G.: Geographically constrained information retrieval. University Library Groningen][Host] (2010)
  • (7) Avvenuti, M., Cresci, S., Del Vigna, F., Fagni, T., Tesconi, M.: Crismap: a big data crisis mapping system based on damage detection and geoparsing. Information Systems Frontiers pp. 1–19 (2018)
  • (8) de Bruijn, J.A., de Moel, H., Jongman, B., Wagemaker, J., Aerts, J.C.: Taggs: Grouping tweets to improve global geoparsing for disaster response. Journal of Geovisualization and Spatial Analysis 2(1), 2 (2018)
  • (9) Buscaldi, D., et al.: Toponym disambiguation in information retrieval. Ph.D. thesis (2010)
  • (10) Butler, J.O., Donaldson, C.E., Taylor, J.E., Gregory, I.N.: Alts, abbreviations, and akas: historical onomastic variation and automated named entity recognition. Journal of Map & Geography Libraries 13(1), 58–81 (2017)
  • (11) Byrne, K.: Nested named entity recognition in historical archive text. In: Semantic Computing, 2007. ICSC 2007. International Conference on, pp. 589–596. IEEE (2007)
  • (12) Cacho, J.R.F., Taghva, K.: Reproducible research in document analysis and recognition. In: Information Technology-New Generations, pp. 389–395. Springer (2018)
  • (13) Chinchor, N.: Appendix b: Muc-7 test scores introduction. In: Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, April 29-May 1, 1998 (1998)
  • (14) Craswell, N.: Mean reciprocal rank. In: Encyclopedia of Database Systems, pp. 1703–1703. Springer (2009)
  • (15) DeLozier, G., Baldridge, J., London, L.: Gazetteer-independent toponym resolution using geographic word profiles. In: AAAI, pp. 2382–2388 (2015)
  • (16) DeLozier, G., Wing, B., Baldridge, J., Nesbit, S.: Creating a novel geolocation corpus from historical texts. LAW X p. 188 (2016)
  • (17) DeLozier, G.H.: Data and methods for gazetteer independent toponym resolution. Ph.D. thesis (2016)
  • (18) Dietterich, T.G.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation 10(7), 1895–1923 (1998)
  • (19) Doddington, G.R., Mitchell, A., Przybocki, M.A., Ramshaw, L.A., Strassel, S., Weischedel, R.M.: The automatic content extraction (ace) program-tasks, data, and evaluation. In: LREC, vol. 2, p. 1 (2004)
  • (20) Dong, L., Wei, F., Sun, H., Zhou, M., Xu, K.: A hybrid neural model for type classification of entity mentions. In: IJCAI, pp. 1243–1249 (2015)
  • (21) Dredze, M., Paul, M.J., Bergsma, S., Tran, H.: Carmen: A twitter geolocation system with applications to public health. In: AAAI workshop on expanding the boundaries of health informatics using AI (HIAI), vol. 23, p. 45 (2013)
  • (22)

    Dror, R., Baumer, G., Shlomov, S., Reichart, R.: The hitchhiker’s guide to testing statistical significance in natural language processing.

    In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1383–1392 (2018)
  • (23) Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: A latent variable model for geographic lexical variation. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 1277–1287. Association for Computational Linguistics (2010)
  • (24) Ferrés Domènech, D.: Knowledge-based and data-driven approaches for geographical information access (2017)
  • (25) Goodman, S.N., Fanelli, D., Ioannidis, J.P.: What does research reproducibility mean? Science translational medicine 8(341), 341ps12–341ps12 (2016)
  • (26) Gorfein, D.S.: An activation-selection view of homograph disambiguation: A matter of emphasis. On the consequences of meaning selection: Perspectives on resolving lexical ambiguity pp. 157–173 (2001)
  • (27) da Graça Martins, B.E.: Geographically aware web text mining. Ph.D. thesis, Universidade de Lisboa (Portugal) (2008)
  • (28)

    Gregoric, A.Z., Bachrach, Y., Coope, S.: Named entity recognition with parallel recurrent neural networks.

    In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol. 2, pp. 69–74 (2018)
  • (29) Gritta, M., Pilehvar, M.T., Collier, N.: Which melbourne? augmenting geocoding with maps. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1285–1296 (2018)
  • (30) Gritta, M., Pilehvar, M.T., Limsopatham, N., Collier, N.: Vancouver welcomes you! minimalist location metonymy resolution. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1248–1259 (2017)
  • (31) Gritta, M., Pilehvar, M.T., Limsopatham, N., Collier, N.: What’s missing in geographical parsing? (2017)
  • (32) Grover, C., Tobin, R., Byrne, K., Woollard, M., Reid, J., Dunn, S., Ball, J.: Use of the edinburgh geoparser for georeferencing digitized historical collections. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 368(1925), 3875–3889 (2010)
  • (33) Han, B.: Improving the utility of social media with natural language processing. Ph.D. thesis (2014)
  • (34) Hearst, M.: Noun homograph disambiguation using local context in large text corpora. Using Corpora pp. 185–188 (1991)
  • (35) Henrich, A., Lüdecke, V.: Determining geographic representations for arbitrary concepts at query time. In: Proceedings of the first international workshop on Location and the web, pp. 17–24. ACM (2008)
  • (36) Hirschman, L.: The evolution of evaluation: Lessons from the message understanding conferences. Computer Speech & Language 12(4), 281–305 (1998)
  • (37) Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., Weikum, G.: Robust disambiguation of named entities in text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 782–792. Association for Computational Linguistics (2011)
  • (38) Honnibal, M., Johnson, M.: An improved non-monotonic transition system for dependency parsing. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1373–1378. Association for Computational Linguistics, Lisbon, Portugal (2015). URL https://aclweb.org/anthology/D/D15/D15-1162
  • (39)

    Hulden, M., Silfverberg, M., Francom, J.: Kernel density estimation for text-based geolocation.

    In: AAAI, pp. 145–150 (2015)
  • (40) Jurgens, D., Finethy, T., McCorriston, J., Xu, Y.T., Ruths, D.: Geolocation prediction in twitter using social networks: A critical analysis and review of current practice. ICWSM 15, 188–197 (2015)
  • (41) Kamalloo, E., Rafiei, D.: A coherent unsupervised model for toponym resolution. In: Proceedings of the 2018 World Wide Web Conference on World Wide Web, pp. 1287–1296. International World Wide Web Conferences Steering Committee (2018)
  • (42) Karimzadeh, M.: Performance evaluation measures for toponym resolution. In: Proceedings of the 10th Workshop on Geographic Information Retrieval, p. 8. ACM (2016)
  • (43) Karimzadeh, M., Huang, W., Banerjee, S., Wallgrün, J.O., Hardisty, F., Pezanowski, S., Mitra, P., MacEachren, A.M.: Geotxt: a web api to leverage place references in text. In: Proceedings of the 7th workshop on geographic information retrieval, pp. 72–73. ACM (2013)
  • (44) Katz, P., Schill, A.: To learn or to rule: two approaches for extracting geographical information from unstructured text. Data Mining and Analytics 2013 (AusDM’13) 117 (2013)
  • (45) Kolkman, M.C.: Cross-domain textual geocoding: the influence of domain-specific training data. Master’s thesis, University of Twente (2015)
  • (46) Laere, O.V., Schockaert, S., Tanasescu, V., Dhoedt, B., Jones, C.B.: Georeferencing wikipedia documents using data from social media sources. ACM Transactions on Information Systems (TOIS) 32(3), 12 (2014)
  • (47) Leidner, J.L.: Towards a reference corpus for automatic toponym resolution evaluation. In: Workshop on Geographic Information Retrieval, Sheffield, UK (2004)
  • (48) Leidner, J.L.: Toponym resolution in text: Annotation, evaluation and applications of spatial grounding of place names. Universal-Publishers (2008)
  • (49) Leveling, J., Hartrumpf, S.: On metonymy recognition for geographic information retrieval. International Journal of Geographical Information Science 22(3), 289–299 (2008)
  • (50) Lieberman, M.D., Samet, H.: Multifaceted toponym recognition for streaming news. In: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp. 843–852. ACM (2011)
  • (51) Lieberman, M.D., Samet, H.: Adaptive context features for toponym resolution in streaming news. In: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pp. 731–740. ACM (2012)
  • (52) Lieberman, M.D., Samet, H., Sankaranarayanan, J.: Geotagging with local lexicons to build indexes for textually-specified spatial data. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), pp. 201–212. IEEE (2010)
  • (53) Mani, I., Doran, C., Harris, D., Hitzeman, J., Quimby, R., Richer, J., Wellner, B., Mardis, S., Clancy, S.: Spatialml: annotation scheme, resources, and evaluation. Language Resources and Evaluation 44(3), 263–280 (2010)
  • (54) Mani, I., Hitzeman, J., Richer, J., Harris, D., Quimby, R., Wellner, B.: Spatialml: Annotation scheme, corpora, and tools. In: LREC (2008)
  • (55) Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60 (2014). URL http://www.aclweb.org/anthology/P/P14/P14-5010
  • (56) Markert, K., Nissim, M.: Metonymy resolution as a classification task. In: Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, pp. 204–213. Association for Computational Linguistics (2002)
  • (57) Markert, K., Nissim, M.: Semeval-2007 task 08: Metonymy resolution at semeval-2007. In: Proceedings of the 4th International Workshop on Semantic Evaluations, pp. 36–41. Association for Computational Linguistics (2007)
  • (58) Màrquez, L., Villarejo, L., Martí, M.A., Taulé, M.: Semeval-2007 task 09: Multilevel semantic annotation of catalan and spanish. In: Proceedings of the 4th International Workshop on Semantic Evaluations, pp. 42–47. Association for Computational Linguistics (2007)
  • (59) Matsuda, K., Sasaki, A., Okazaki, N., Inui, K.: Annotating geographical entities on microblog text. In: Proceedings of The 9th Linguistic Annotation Workshop, pp. 85–94 (2015)
  • (60) Moncla, L.: Automatic reconstruction of itineraries from descriptive texts. Ph.D. thesis, Université de Pau et des Pays de l’Adour; Universidad de Zaragoza (2015)
  • (61) Nothman, J., Ringland, N., Radford, W., Murphy, T., Curran, J.R.: Learning multilingual named entity recognition from wikipedia. Artificial Intelligence 194, 151–175 (2013)
  • (62) Overell, S.E.: Geographic information retrieval: Classification, disambiguation and modelling. Ph.D. thesis, Citeseer (2009)
  • (63) Palmblad, M., Torvik, V.I.: Spatiotemporal analysis of tropical disease research combining europe pmc and affiliation mapping web services. Tropical medicine and health 45(1), 33 (2017)
  • (64) Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543 (2014)
  • (65) Pustejovsky, J.: The generative lexicon. Computational linguistics 17(4), 409–441 (1991)
  • (66) Rayson, P., Reinhold, A., Butler, J., Donaldson, C., Gregory, I., Taylor, J.: A deeply annotated testbed for geographical text analysis: The corpus of lake district writing. In: Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities, pp. 9–15. ACM (2017)
  • (67) Redman, T., Sammons, M.: Illinois named entity recognizer: Addendum to ratinov and roth’09 reporting improved results. Tech. rep., Technical report. http://cogcomp. cs. illinois. edu/papers/neraddendum-2016. pdf (2016)
  • (68) Roberts, M.: Germans, queenslanders and londoners: The semantics of demonyms. In: ALS2011: Australian Linguistics Society Annual Conference: Conference proceedings (2011)
  • (69) Roller, S., Speriosu, M., Rallapalli, S., Wing, B., Baldridge, J.: Supervised text-based geolocation using language models on an adaptive grid. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1500–1510. Association for Computational Linguistics (2012)
  • (70) Sang, K., Tjong, E.: Introduction to the conll-2002 shared task: Language-independent named entity recognition. Tech. rep., cs/0209010 (2002)
  • (71) Santos, J., Anastácio, I., Martins, B.: Using machine learning methods for disambiguating place references in textual documents. GeoJournal 80(3), 375–392 (2015)
  • (72) dos Santos, J.T.L.: Linking entities to wikipedia documents. Ph.D. thesis, PhD thesis, Instituto Superior Técnico, Lisboa (2013)
  • (73) Sekine, S., Sudo, K., Nobata, C.: Extended named entity hierarchy (2002)
  • (74) Speriosu, M., Baldridge, J.: Text-driven toponym resolution using indirect supervision. In: ACL (1), pp. 1466–1476 (2013)
  • (75) Steinberger, R., Pouliquen, B., Van der Goot, E.: An introduction to the europe media monitor family of applications. arXiv preprint arXiv:1309.5290 (2013)
  • (76) Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., Tsujii, J.: Brat: a web-based tool for nlp-assisted text annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 102–107. Association for Computational Linguistics (2012)
  • (77) Taleb, N.: Fooled by randomness: The hidden role of chance in life and in the markets, vol. 1. Random House Incorporated (2005)
  • (78) Tateosian, L., Guenter, R., Yang, Y.P., Ristaino, J.: Tracking 19th century late blight from archival documents using text analytics and geoparsing. In: Free and Open Source Software for Geospatial (FOSS4G) Conference Proceedings, vol. 17, p. 17 (2017)
  • (79) Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the conll-2003 shared task: Language-independent named entity recognition. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pp. 142–147. Association for Computational Linguistics (2003)
  • (80) Tobin, R., Grover, C., Byrne, K., Reid, J., Walsh, J.: Evaluation of georeferencing. In: proceedings of the 6th workshop on geographic information retrieval, p. 7. ACM (2010)
  • (81) Volz, R., Kleb, J., Mueller, W.: Towards ontology-based disambiguation of geographical identifiers. In: I3 (2007)
  • (82) Wallgrün, J.O., Karimzadeh, M., MacEachren, A.M., Pezanowski, S.: Geocorpora: building a corpus to test and train microblog geoparsers. International Journal of Geographical Information Science 32(1), 1–29 (2018)
  • (83) Weischedel, R., Palmer, M., Marcus, M., Hovy, E., Pradhan, S., Ramshaw, L., Xue, N., Taylor, A., Kaufman, J., Franchini, M., et al.: Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA (2013)
  • (84) Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics bulletin 1(6), 80–83 (1945)
  • (85) Wing, B., Baldridge, J.: Hierarchical discriminative classification for text-based geolocation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 336–348 (2014)
  • (86) Wing, B.P., Baldridge, J.: Simple supervised document geolocation with geodesic grids. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 955–964. Association for Computational Linguistics (2011)
  • (87) Yadav, V., Bethard, S.: A survey on recent advances in named entity recognition from deep learning models. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 2145–2158 (2018)
  • (88) Yang, J., Liang, S., Zhang, Y.: Design challenges and misconceptions in neural sequence labeling. arXiv preprint arXiv:1806.04470 (2018)
  • (89) Yang, J., Zhang, Y.: Ncrf++: An open-source neural sequence labeling toolkit. arXiv preprint arXiv:1806.05626 (2018)
  • (90) Zheng, X., Han, J., Sun, A.: A survey of location prediction on twitter. IEEE Transactions on Knowledge and Data Engineering (2018)
  • (91) Ziegeler, D.: A word of caution on coercion. Journal of Pragmatics 39(5), 990–1028 (2007)