Building a Massive Corpus for Named Entity Recognition using Free Open Data Sources

08/13/2019 ∙ by Daniel Specht Menezes, et al. ∙ puc-rio Toyota Technological Institute at Chicago 0

With the recent progress in machine learning, boosted by techniques such as deep learning, many tasks can be successfully solved once a large enough dataset is available for training. Nonetheless, human-annotated datasets are often expensive to produce, especially when labels are fine-grained, as is the case of Named Entity Recognition (NER), a task that operates with labels on a word-level. In this paper, we propose a method to automatically generate labeled datasets for NER from public data sources by exploiting links and structured data from DBpedia and Wikipedia. Due to the massive size of these data sources, the resulting dataset -- SESAME Available at -- is composed of millions of labeled sentences. We detail the method to generate the dataset, report relevant statistics, and design a baseline using a neural network, showing that our dataset helps building better NER predictors.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The vast amounts of data available from public sources such as Wikipedia can be readily used to pre-train machine learning models in an unsupervised fashion – for example, learning word embeddings [word2vec]. However, large labeled datasets are still often required to successfully train complex models such as deep neural networks, collecting them remain an obstacle for many tasks.

In particular, a fundamental application in Natural Language Processing (NLP) is Named Entity Recognition (NER), which aims to delimit and categorize mentions to entities in text. Currently, deep neural networks present state-of-the-art results for NER, but require large amounts of annotated data for training.

Unfortunately, such datasets are a scarce resource whose construction is costly due to the required human-made, word-level annotations. In this work we propose a method to construct labeled datasets without human supervision for NER, using public data sources structured according to Semantic Web principles, namely, DBpedia and Wikipedia.

Our work can be described as constructing a massive, weakly-supervised dataset (i.e. a silver standard corpora). Using such datasets to train predictors is typically denoted distant learning and is a popular approach to training large deep neural networks for tasks where manually-annotated data is scarce. Most similar to our approach are [wiki_ner_joel_1] and [ner_wiki_portugues], which automatically create datasets from Wikipedia – a major difference between our method and [ner_wiki_portugues] is that we use an auxiliary NER predictor to capture missing entities, yielding denser annotations.

Using our proposed method, we generate a new, massive dataset for Portuguese NER, called SESAME (Silver-Standard Named Entity Recognition dataset), and experimentally confirm that it aids the training of complex NER predictors.

The methodology to automatically generate our dataset is presented in Section III. Data preprocessing and linking, along with details on the generated dataset, are given in Section IV. Section LABEL:sec:baseline presents a baseline using deep neural networks.

Ii Data sources

We start by defining what are the required features of the public data sources to generate a NER dataset. As NER involves the delimitation and classification of named entities, we must find textual data where we have knowledge about which entities are being mentioned and their corresponding classes. Throughout this paper, we consider an entity class to be either person, organization, or location.

The first step to build a NER dataset from public sources is to first identify whether a text is about an entity, so that it can be ignored or not. To extract information from relevant text, we link the information captured by the DBpedia [dbpedia] database to Wikipedia [wikipedia] – similar approaches were used in [dbpedia_wikipedia_ner]. The main characteristics of the selected data sources, DBpedia and Wikipedia, and the methodology used for their linkage are described in what follows next.

Ii-a Wikipedia

Wikipedia is an open, cooperative and multilingual encyclopedia that seeks to register in electronic format knowledge about subjects in diverse domains. The following features make Wikipedia a good data source for the purpose of building a NER dataset.

  • High Volume of textual resources built by humans

  • Variety of domains addressed

  • Information boxes: resources that structure the information of articles homogeneously according to the subject

  • Internal links: links a Wikipedia page to another, based on mentions

The last two points are key as they capture human-built knowledge about text is related to the named entities. Their relevance is described in more detail ahead.

Ii-A1 Infobox

Wikipedia infoboxes [infocaixa] are fixed-format tables, whose structure (key-value pairs) are dictated by the article’s type (e.g. person, movie, country) – an example is provided in Figure 1. They present structured information about the subject of the article, and promote structure reuse for articles with the same type. For example, in articles about people, infoboxes contain the date of birth, awards, children, and so on.

Through infoboxes, we have access to relevant human-annotated data: the article’s categories, along with terms that identify its subject e.g. name, date of birth. In Figure 1, note that there are two fields that can be used to refer to the entity of the article: ”Nickname” and ”Birth Name”.

Fig. 1: Example of a Wikipedia infobox for a person entity. It consists of a key-value table whose keys depend on the type of the corresponding entity – for a person entity, common keys include name, birth date, and so on.

Infoboxes can be exploited to discover whether the article’s subject is an entity of interest – that is, a person, organization or location – along with its relevant details. However, infoboxes often contain inconsistencies that must be manually addressed, such as redundancies e.g. different infoboxes for person and for human. A version of this extraction was done by the DBpedia project, which extracts this structure, and identifies/repairs inconsistencies [dbpediainfobox].

Ii-A2 Interlinks

Interlinks are links between different articles in Wikipedia. According to the Wikipedia guidelines, only the first mention to the article must be linked. Figure 2 shows a link (in blue) to the article page of Alan Turing: following mentions to Alan Turing in the same article must not be links.

Fig. 2: Example of an interlink (blue) in a Wikipedia article. Interlinks point to other articles within Wikipedia, and follow the guideline that only the first mention to the article should contain a link.

While infoboxes provide a way to discover relevant information about a Wikipedia article, analyzing an article’s interlinks provide us access to referenced entities which are not the page’s main subject. Hence, we can parse every article on Wikipedia while searching for interlinks that point to an entity article, greatly expanding the amount of textual data to be added in the dataset.

Ii-B DBpedia

DBpedia extracts and structures information from Wikipedia into a database that is modeled based on semantic Web principles [semantic_web], applying the Resource Description Framework (RDF). Wikipedia’s structure was extracted and modelled as an ontology [dbpediaontology], which was only possible due to infoboxes.

The DBpedia ontology focused on the English language and the extracted relationships were projected for the other languages. In short, the ontology was extracted and preprocessed from Wikipedia in English and propagated to other languages using interlinguistic links. Articles whose ontology is only available in one language are ignored.

An advantage of DBpedia is that manual preprocessing was carried out by project members in order to find all the relevant connections, redundancies, and synonyms – quality improvements that, in general, require meticulous human intervention. In short, DBpedia allows us to extract a set of entities where along with its class, the terms used to refer to it, and its corresponding Wikipedia article.

Iii Building a database

The next step consists of building a structured database with the relevant data from both Wikipedia and DBpedia.

Iii-a DBpedia data extraction

Data from DBpedia was collected using a public service access [dbpedia_endpoint]. We searched over the following entity classes: people, organizations, and locations, and extracted the following information about each entity:

  • The entity’s class (person, organization, location)

  • The ID of the page (Wiki ID)

  • The title of the page

  • The names of the entity. In this case the ontology varies according to the class, for example, place-type entities do not have the ”surname” property

Iii-B Wikipedia data extraction

We extracted data from the same version of Wikipedia that was used for DBpedia, October 2016, which is available as dumps in XML format. We extracted the following information about the articles:

  • Article title

  • Article ID (a unique identifier)

  • Text of the article (in wikitext format)

Iii-C Database modelling

Figure 3 shows the structure of the database as a entity-relation diagram. Entities and articles were linked when either one of two linked articles correspond to the entity, or the article itself is about a known entity.

Fig. 3: Diagram of the database representing the links between entities and articles.

Iv Preprocessing

Iv-a Wikitext preprocessing

We are only interested in the plain text of each Wikipedia article, but its Wikitext (language used to define the article page) might contain elements such as lists, tables, and images. We remove the following elements from each article’s Wikitext:

  • Lists, (e.g. unbulled list, flatlist, bulleted list)

  • Tables (e.g. infobox, table, categorytree)

  • Files (e.g. media, archive, audio, video)

  • Domain specific (e.g. chemistry, math)

  • Excerpts with irregular indentation (e.g. outdent)

Iv-B Section Filtering

Wikipedia guidelines include sets of suggested sections, such as early life (for person entities), references, further reading, and so on. Some of the sections have the purpose of listing related resources, not corresponding to a well structured text and, therefore, can be removed with the intent to reduce noise. In particular, we remove the following sections from each article: “references”, “see also”, “bibliography”, and “external links”.

After removing noisy elements, the Wikitext of each article is converted to raw text. This is achieved through the tool MWparser [mwparserfromhell].

Iv-C Identifying entity mentions in text

The next step consists of detecting mentions to entities in the raw text. To do this, we tag character segments that exactly match one of the known names of an entity. For instance, we can tag two different entities in the following text:

Note that the word “Copacabana” can also correspond to a “Location” entity. However, some entity mentions in raw text might not be identified in case they are not present in DBpedia.

Iv-D Searching for other entities

To circumvent mentioned entities which are not present in DBpedia, we use an auxiliary NER system to detect such mentions. More specifically, we use the Polyglot [polyglot] system, a model trained on top of a dataset generated from Wikipedia.

Each mention’s tag also specifies whether the mention was detected using DBpedia or by Polyglot. The following convention was adopted for the tags:

  • Annotated (Anot) - Matched exactly with one of the the entity’s names in DBpedia

  • Predicted (Pred) - Extracted by Polyglot

Therefore, in our previous example, we have:

A predicted entity will be discarded entirely if it conflicts with an annotated one, since we aim to maximize the entities tagged using human-constructed resources as knowledge base.

Iv-E Tokenization of words and sentences

The supervised learning models explored in this paper require inputs split into words and sentences. This process, called tokenization, was carried with the NLTK toolkit

[nltk], in particular the ”Punkt” tokenization tool, which implements a multilingual, unsupervised algorithm [punkt_algorithm].

First, we tokenize only the words corresponding to mentions of an entity. In order to explicitly mark the boundaries of each entity, we use the BIO format, where we add the suffix “B” (begin) to the first token of a mention and “I” (inside) to the tokens following it. This gives us:

Second, we tokenize the remaining text, as illustrated by the following example: denotes a word token, while corresponds to a sentence token.

However, conflicts might occur between known entity tokens and the delimitation of words and sentences. More specifically, tokens corresponding to an entity must consist only of entire words (instead of only a subset of the characters of a word), and must be contained in a single sentence. In particular, we are concerned with the following cases:

(1) Entities which are not contained in a single sentence:

In this case, and compose a mention of the entity which lies both in sentence and . Under these circumstances, we concatenate all sentences that contain the entity, yielding, for the previous example:

(2) Entities which consist of only subsets (some characters) of a word, for example: