The vast amounts of data available from public sources such as Wikipedia can be readily used to pre-train machine learning models in an unsupervised fashion – for example, learning word embeddings [word2vec]. However, large labeled datasets are still often required to successfully train complex models such as deep neural networks, collecting them remain an obstacle for many tasks.
In particular, a fundamental application in Natural Language Processing (NLP) is Named Entity Recognition (NER), which aims to delimit and categorize mentions to entities in text. Currently, deep neural networks present state-of-the-art results for NER, but require large amounts of annotated data for training.
Unfortunately, such datasets are a scarce resource whose construction is costly due to the required human-made, word-level annotations. In this work we propose a method to construct labeled datasets without human supervision for NER, using public data sources structured according to Semantic Web principles, namely, DBpedia and Wikipedia.
Our work can be described as constructing a massive, weakly-supervised dataset (i.e. a silver standard corpora). Using such datasets to train predictors is typically denoted distant learning and is a popular approach to training large deep neural networks for tasks where manually-annotated data is scarce. Most similar to our approach are [wiki_ner_joel_1] and [ner_wiki_portugues], which automatically create datasets from Wikipedia – a major difference between our method and [ner_wiki_portugues] is that we use an auxiliary NER predictor to capture missing entities, yielding denser annotations.
Using our proposed method, we generate a new, massive dataset for Portuguese NER, called SESAME (Silver-Standard Named Entity Recognition dataset), and experimentally confirm that it aids the training of complex NER predictors.
Ii Data sources
We start by defining what are the required features of the public data sources to generate a NER dataset. As NER involves the delimitation and classification of named entities, we must find textual data where we have knowledge about which entities are being mentioned and their corresponding classes. Throughout this paper, we consider an entity class to be either person, organization, or location.
The first step to build a NER dataset from public sources is to first identify whether a text is about an entity, so that it can be ignored or not. To extract information from relevant text, we link the information captured by the DBpedia [dbpedia] database to Wikipedia [wikipedia] – similar approaches were used in [dbpedia_wikipedia_ner]. The main characteristics of the selected data sources, DBpedia and Wikipedia, and the methodology used for their linkage are described in what follows next.
Wikipedia is an open, cooperative and multilingual encyclopedia that seeks to register in electronic format knowledge about subjects in diverse domains. The following features make Wikipedia a good data source for the purpose of building a NER dataset.
High Volume of textual resources built by humans
Variety of domains addressed
Information boxes: resources that structure the information of articles homogeneously according to the subject
Internal links: links a Wikipedia page to another, based on mentions
The last two points are key as they capture human-built knowledge about text is related to the named entities. Their relevance is described in more detail ahead.
Wikipedia infoboxes [infocaixa] are fixed-format tables, whose structure (key-value pairs) are dictated by the article’s type (e.g. person, movie, country) – an example is provided in Figure 1. They present structured information about the subject of the article, and promote structure reuse for articles with the same type. For example, in articles about people, infoboxes contain the date of birth, awards, children, and so on.
Through infoboxes, we have access to relevant human-annotated data: the article’s categories, along with terms that identify its subject e.g. name, date of birth. In Figure 1, note that there are two fields that can be used to refer to the entity of the article: ”Nickname” and ”Birth Name”.
Infoboxes can be exploited to discover whether the article’s subject is an entity of interest – that is, a person, organization or location – along with its relevant details. However, infoboxes often contain inconsistencies that must be manually addressed, such as redundancies e.g. different infoboxes for person and for human. A version of this extraction was done by the DBpedia project, which extracts this structure, and identifies/repairs inconsistencies [dbpediainfobox].
Interlinks are links between different articles in Wikipedia. According to the Wikipedia guidelines, only the first mention to the article must be linked. Figure 2 shows a link (in blue) to the article page of Alan Turing: following mentions to Alan Turing in the same article must not be links.
While infoboxes provide a way to discover relevant information about a Wikipedia article, analyzing an article’s interlinks provide us access to referenced entities which are not the page’s main subject. Hence, we can parse every article on Wikipedia while searching for interlinks that point to an entity article, greatly expanding the amount of textual data to be added in the dataset.
DBpedia extracts and structures information from Wikipedia into a database that is modeled based on semantic Web principles [semantic_web], applying the Resource Description Framework (RDF). Wikipedia’s structure was extracted and modelled as an ontology [dbpediaontology], which was only possible due to infoboxes.
The DBpedia ontology focused on the English language and the extracted relationships were projected for the other languages. In short, the ontology was extracted and preprocessed from Wikipedia in English and propagated to other languages using interlinguistic links. Articles whose ontology is only available in one language are ignored.
An advantage of DBpedia is that manual preprocessing was carried out by project members in order to find all the relevant connections, redundancies, and synonyms – quality improvements that, in general, require meticulous human intervention. In short, DBpedia allows us to extract a set of entities where along with its class, the terms used to refer to it, and its corresponding Wikipedia article.
Iii Building a database
The next step consists of building a structured database with the relevant data from both Wikipedia and DBpedia.
Iii-a DBpedia data extraction
Data from DBpedia was collected using a public service access [dbpedia_endpoint]. We searched over the following entity classes: people, organizations, and locations, and extracted the following information about each entity:
The entity’s class (person, organization, location)
The ID of the page (Wiki ID)
The title of the page
The names of the entity. In this case the ontology varies according to the class, for example, place-type entities do not have the ”surname” property
Iii-B Wikipedia data extraction
We extracted data from the same version of Wikipedia that was used for DBpedia, October 2016, which is available as dumps in XML format. We extracted the following information about the articles:
Article ID (a unique identifier)
Text of the article (in wikitext format)
Iii-C Database modelling
Figure 3 shows the structure of the database as a entity-relation diagram. Entities and articles were linked when either one of two linked articles correspond to the entity, or the article itself is about a known entity.
Iv-a Wikitext preprocessing
We are only interested in the plain text of each Wikipedia article, but its Wikitext (language used to define the article page) might contain elements such as lists, tables, and images. We remove the following elements from each article’s Wikitext:
Lists, (e.g. unbulled list, flatlist, bulleted list)
Tables (e.g. infobox, table, categorytree)
Files (e.g. media, archive, audio, video)
Domain specific (e.g. chemistry, math)
Excerpts with irregular indentation (e.g. outdent)
Iv-B Section Filtering
Wikipedia guidelines include sets of suggested sections, such as early life (for person entities), references, further reading, and so on. Some of the sections have the purpose of listing related resources, not corresponding to a well structured text and, therefore, can be removed with the intent to reduce noise. In particular, we remove the following sections from each article: “references”, “see also”, “bibliography”, and “external links”.
After removing noisy elements, the Wikitext of each article is converted to raw text. This is achieved through the tool MWparser [mwparserfromhell].
Iv-C Identifying entity mentions in text
The next step consists of detecting mentions to entities in the raw text. To do this, we tag character segments that exactly match one of the known names of an entity. For instance, we can tag two different entities in the following text:
Note that the word “Copacabana” can also correspond to a “Location” entity. However, some entity mentions in raw text might not be identified in case they are not present in DBpedia.
Iv-D Searching for other entities
To circumvent mentioned entities which are not present in DBpedia, we use an auxiliary NER system to detect such mentions. More specifically, we use the Polyglot [polyglot] system, a model trained on top of a dataset generated from Wikipedia.
Each mention’s tag also specifies whether the mention was detected using DBpedia or by Polyglot. The following convention was adopted for the tags:
Annotated (Anot) - Matched exactly with one of the the entity’s names in DBpedia
Predicted (Pred) - Extracted by Polyglot
Therefore, in our previous example, we have:
A predicted entity will be discarded entirely if it conflicts with an annotated one, since we aim to maximize the entities tagged using human-constructed resources as knowledge base.
Iv-E Tokenization of words and sentences
The supervised learning models explored in this paper require inputs split into words and sentences. This process, called tokenization, was carried with the NLTK toolkit[nltk], in particular the ”Punkt” tokenization tool, which implements a multilingual, unsupervised algorithm [punkt_algorithm].
First, we tokenize only the words corresponding to mentions of an entity. In order to explicitly mark the boundaries of each entity, we use the BIO format, where we add the suffix “B” (begin) to the first token of a mention and “I” (inside) to the tokens following it. This gives us:
Second, we tokenize the remaining text, as illustrated by the following example: denotes a word token, while corresponds to a sentence token.
However, conflicts might occur between known entity tokens and the delimitation of words and sentences. More specifically, tokens corresponding to an entity must consist only of entire words (instead of only a subset of the characters of a word), and must be contained in a single sentence. In particular, we are concerned with the following cases:
(1) Entities which are not contained in a single sentence:
In this case, and compose a mention of the entity which lies both in sentence and . Under these circumstances, we concatenate all sentences that contain the entity, yielding, for the previous example:
(2) Entities which consist of only subsets (some characters) of a word, for example: