As web technology continues to thrive, documents containing biographical information are continuously generated and published online in large numbers (Nasar et al., 2021). These online documents contain essential facts or events related to the life of well-known and lesser-known individuals, which can be used to populate structured biographical databases (Wang et al., 2021; Smirnova and Cudré-Mauroux, 2018). These databases are capable of supporting many interesting studies in humanities, and related areas (Zhang et al., 2017) as we describe in Section 5. However, manually extracting information from a massive document collection is impossible, given the amount of information available online. Therefore, NLP methods can be used to process these documents automatically.
Previous studies have used many NLP techniques including text classification (Palmero Aprosio and Tonelli, 2015; Hogue et al., 2014), named entity recognition (NER) (Jiang, 2012) and summarisation (Zhou et al., 2004) to perform biographical information extraction, which we describe thoroughly in Section 2. However, a major weakness in these studies is that they can not be used directly to populate a database. Instead, they need to be combined with other NLP techniques to extract the structured information required for databases. A different approach, which we employ in this study, is to design biographical information extraction as a relation extraction (RE) task.
RE is the task of extracting semantic relationships between entities from a document, which can in turn be used to populate a database with relational facts contained in a piece of text. Consider the following two text pieces on two different people.
Text 1: William Shakespeare was born and raised in Warwickshire. At the age of 18, he married Anne Hathaway, with whom he had three children: Susanna Hall and twins Hamnet Shakespeare and Judith Quiney.
Text 2: Henry Baynton (23 September 1892 in Warwickshire – 2 January 1951 in London) was a British Shakespearean actor of the early 20th century.
For the texts shown above, the RE model can extract triples, which can be represented as edges in a knowledge graph, such as ¡William Shakespeare, Spouse, Anne Hathaway¿. Table 1 shows some of the relationship triples that can be extracted from the above two text pieces. Combining such triples, a system can produce a knowledge graph of relational facts between persons, occupations, and locations in the text. A knowledge graph derived from the relationships in Table 1 is shown in Figure 1.
|William Shakespeare||Birth Place||Warwickshire|
|William Shakespeare||Spouse||Anne Hathaway|
|William Shakespeare||Child||Susanna Hall|
|William Shakespeare||Child||Hamnet Shakespeare|
|William Shakespeare||Child||Judith Quiney|
|Henry Baynton||Birth Place||Warwickshire|
Knowledge graphs are commonly used by companies to provide information to end-users and understand relationships between various types of entities. Several machine learning models including recurrent neural networks (RNN)(Gormley et al., 2015; Xiao and Liu, 2016)
, convolutional neural networks (CNN)(Zeng et al., 2014; Shen and Huang, 2016)
, graph neural networks (GNN)(Baldini Soares et al., 2019; Xue et al., 2021) and transformers (Nayak and Ng, 2020; Joshi et al., 2020) have been proposed to automatically extract relationships from texts. These machine learning models use a supervised paradigm where the models require a dataset similar to Table 1 to train. Therefore, the NLP community has a growing interest in producing datasets capable of training machine learning models to perform RE. Several datasets in this area, such as NYT24 (Hoffmann et al., 2011)
, and TACRED(Zhang et al., 2017) have been released for this purpose. However, all of these datasets are manually annotated, which makes it difficult to expand RE to different genres and languages. In this paper, we propose a novel approach for producing RE datasets that is semi-supervised and can be expanded easily to other domains and languages. As far as we know, an approach such as this has not yet been proposed. We develop the first dataset of this kind and evaluate its usefulness. If the approach does prove to be useful, it will significantly reduce the burden on the manual annotation process, as well as language and domain-specific expertise.
The main contributions of this paper are the following:
We introduce Biographical; the first and the largest dataset for biographical RE built in a semi-supervised manner with ten relationship categories. We also produce a manually annotated subset that can be used for evaluation111The dataset is available at https://plumaj.github.io/biographical/.
We evaluate four machine learning models to perform biographical RE, based on state-of-the-art transformer models such as BERT (Devlin et al., 2019).
We provide important resources to the community: the dataset, the code, and the pre-trained models are made available to everyone interested in working on biographical RE using the same methodology.
The rest of the paper is structured as follows. Section 2 presents an overview of related work. Section 3 describes the data compilation process involved in this study. In Section 4 we explain the experiments carried out, as well as an evaluation of the experiments. Finally, the paper outlines an intended future study and provides conclusions.
2. Related Work
Extracting biographical information from documents is a popular research area in the NLP community. Most of these studies use different NLP techniques on open and free resources such as Wikipedia.
Text classification is one of the first NLP techniques used to extract biographical information. Biadsy et al. (2008) used an unsupervised sentence classification framework to extract biographies from Wikipedia articles. In more recent work, Palmero Aprosio and Tonelli (2015) have trained various machine learning classifiers to detect biographical sections in Wikipedia texts using a supervised approach. In a different work, Hogue et al. (2014) use Wikipedia page traffic data to determine sentences of importance in Wikipedia articles.
Text summarisation is another popular NLP technique that has been used to extract biographical information. Biadsy et al. (2008) use Wikipedia articles together with the TDT4 news corpus222https://catalog.ldc.upenn.edu/LDC2005S11
to train an unsupervised multi-document summariser for biographical information. They used a support vector machine model and achieved state-of-the-art performance at the time on the DUC2004 dataset(Over and Yen, 2004). The approach is based on the one proposed by Zhou et al. (2004)
who similarly used Wikipedia data to develop a system for summarisation using a Naive Bayes architecture.Chisholm et al. (2017) combine Wikipedia text and Wikidata information to generate one-sentence summaries from structured biographical information. First, the approach identifies potential biographical candidates from Wikidata, then learns to generate the short summaries by mapping structured information to the first sentence of the matching article in Wikipedia. Thus follows the mostly standardised pattern of the first sentence of a Wikipedia article containing most of the relevant information about a person.
However, none of these approaches can be used directly to create a knowledge graph. Therefore, more recent work in biographical information extraction has modelled the task as a RE problem. Several ML models have been developed to perform RE. Early approaches for RE were based on traditional machine learning models such as support vector machines(Liu et al., 2007)
, and decision trees(Singhal et al., 2016). But with the introduction of word embeddings and the success of neural network architectures in different areas, the NLP community has used a wide range of neural network architectures for the RE task. Zeng et al. (2014) have used a CNN architecture and a synonym dictionary to integrate semantic knowledge into the neural network. In a different approach, Zeng et al. (2014) use lexical features with the word embeddings (Turian et al., 2010) fed into a CNN to perform RE. RNNs have also been popularly used in RE. Miwa and Bansal (2016)
utilised a Tree Long Short-Term Memory (LSTM) network to perform RE.Zhou et al. (2016) used an attention-based bi-directional LSTM network on the SemEval-2010 relation classification task (Hendrickx et al., 2010) and show that it provides good results. The current state-of-the-art in RE, also used for this research, is based on neural transformers (Baldini Soares et al., 2019). These transformer models are trained using a language modelling task such as masked language modelling or next sentence prediction and then have been used to perform RE as a downstream NLP task. Results on recent RE datasets show that transformers outperform the previous architectures based on RNNs and CNNs (Xue et al., 2019; Baldini Soares et al., 2019).
All the ML models for RE mentioned above follow a supervised paradigm where an annotated dataset is required to train the ML model. The most common datasets used for this are NYT24 (Hoffmann et al., 2011), NYT29 (Riedel et al., 2010) and TACRED (Zhang et al., 2017). All these datasets have been created using manual annotation. As we mentioned before, since the annotation process is expensive, these datasets are limited in size. For example, TACRED, the largest RE dataset, has only 106,264 instances. This can prove not enough to train data-driven methods, especially those based on neural networks. Furthermore, the manual annotation process limits the expansion of RE research to different domains and languages. To address this problem, we propose a semi-supervised approach to create RE datasets using a similar approach to Chisholm et al. (2017) which we describe in the next section.
3. Data Compilation
The data compilation process is divided into two steps. The first step involves the selection of our data sources, which are one of the most fundamental aspects of the approach (Section 3.1). Our approach requires a source of textual data and a source of structured information that is related to the textual data. The second step concerns the processing of the different data sources, as well as matching operations that allow for the automatic labelling process (Section 3.2). These steps lead to the final dataset consisting of sentences, marked entities and their respective relation.
3.1. Data Sources
Our semi-supervised approach combines data from three different sources: Wikipedia333https://www.wikipedia.org/, Wikidata444https://www.wikidata.org/wiki/Wikidata:Main_Page and Pantheon555https://pantheon.world/ (Yu et al., 2016). Wikipedia serves as the main source of textual data, in the form of sentences taken from specific articles. Pantheon and Wikidata serve as our sources of structured information. We also use Pantheon to select our initial set of biographical articles from Wikipedia. We target specific biographical articles in Wikipedia that are confirmed by the Pantheon dataset. Next, we iterate over the sentences of each article and tag the named entities, including locations and dates, using spaCy666https://spacy.io/ and Stanford CoreNLP777https://stanfordnlp.github.io/CoreNLP/. Finally, we augment the structured data from the Pantheon dataset with information from Wikidata. This expanded dataset is matched to sentences in Wikipedia, allowing us to label each sentence according to the type of relation. We discuss each of the data sources in more detail in the following sections.
Wikipedia is a free, online encyclopedia that contains a large amount of information about people, and as such, serves as the backbone of our approach. It is a vast resource of textual data, that is linked to a number of different projects that relay the contained information in a structured way. The next steps in our approach will focus on connecting the structured data with the textual data.
For processing Wikipedia textual data, we follow a previously established workflow (Plum et al., 2019) which has proved to be efficient. We work with Wikipedia database backup dumps, which are an exact copy of all Wikipedia articles of a given language at a specific point in time. We use the enwiki-20190420 dump, which corresponds to the content of English Wikipedia on 20th of April 2019. Once downloaded, we extract articles corresponding to the entries in the Pantheon dataset, which is done via the Wikipedia IDs. Extracting the text can be a complex task in itself, since the structure of the XML file is not uniform, as well as including certain XML parts that have to be expanded. Since the extraction of text from Wikipedia is not our main goal and could warrant a separate project, we use an existing tool for the extraction process. The wikiextractor888https://github.com/attardi/wikiextractor package for Python converts articles to plain text. We observed some extraction problems, such as XML-tag artefacts, mismatched quotation marks, and incomplete or illegible sentences, which we remove at the processing stage with regular expressions.
In order to determine which articles in Wikipedia are biographical, i.e. containing information that pertains to a person, we use the Pantheon dataset (Yu et al., 2016). According to its creators, ”Pantheon [is] focused on biographies with a presence in 15 different languages in Wikipedia” and consists of roughly 85,000 entries. While it was initially created mostly by hand, its later iterations have used a classifier to determine and extract further entries. One particular characteristic of this dataset is that each article has to contain unambiguous links to the respective Wikipedia and Wikidata pages. This allows us to identify which articles from Wikipedia contain the relevant information. While this could be done just using Wikidata, Pantheon has been (at least partly) manually verified. Because Pantheon only includes persons whose articles are available in 15 different languages, this ensures that a person is somewhat well-known, in turn making a longer Wikipedia article more likely.
In addition, each entry includes basic information, which we match to sentences from the corresponding Wikipedia articles. This mainly includes information such as dates of birth and death, places of birth and death, and main occupation. The included information allows us to label the birthdate, deathdate, birthplace, deathplace and occupation relations, while also allowing us to confirm the name of a person. As these relations are only half of the relations we target, we use the included Wikidata ID to obtain the other half of the relations (introduced next).
Wikidata is described as a ”free, collaborative, multilingual, secondary database” that ”provides support for Wikipedia […]” (Vrandečić and Krötzsch, 2014). Wikidata ties in well with the two other sources of data that we use. Since it provides most of the information from a Wikipedia page (and often more) in a structured format, we use it to augment the Pantheon dataset. Since the Pantheon dataset provides distinct identifiers for Wikipedia and Wikidata, selecting the correct entity is a straight-forward task. Using the corresponding entries, we add the educatedAt, ofParent, sibling and hasChild relations, as well as other. In the case of the last relation, we use this to categorise any relation that is not explicitly targeted here and make sure that the information matched is not part of any of the nine other relations.
3.2. Automatic Labelling
The next step in the approach is the automatic labelling of sentences. Once we have extracted the text of each Wikipedia article, we begin processing the texts, using spaCy NER to tag persons, locations, organisations, dates, as well as Stanford CoreNLP Entity information to tag occupations in each article. It should be noted that we run spaCy at runtime, but we carried out one full annotation run with Stanford CoreNLP on all articles, which we store and subsequently only access. This is because we found Stanford CoreNLP too slow for multiple runs.
Each sentence of an article is processed in order to determine whether it is about the main person of the article. To accomplish this, the script matches the name with the person tags in the sentence, and also allows some substring matches, such as first and last name excluding any other titles, or last name only. If a match is found, the sentence is regarded as containing some information about that person. This is ensured because the sentence is taken from that person’s article and it includes that person’s name.
After a positive match is made within a sentence, we check the other tagged entities in the sentence (locations, organisations, dates and occupations) against the information provided by the Pantheon dataset and respective Wikidata entry. Each matched pair, for instance a name and a location, is then marked with eN (begin) and /eN
(end) tags, where N is either 1 or 2, depending on the position of the entity (i.e. first or last). This is followed by the respective relation tag. The following text box shows an example of this. We estimate that this approach could be extended to all relations where it would be possible match the information in a sentence in this way.
Text 1: ¡e1¿William Shakespeare¡/e1¿ was born and raised in ¡e2¿Warwickshire¡/e2¿.
We hypothesise that this simple combination of named entity tagging and string matching works because of the controlled circumstances, which were mentioned at the beginning of this section. We only allow matches involving the person who is the main subject of an article, ensuring that statements made in sentences are most likely to be about this person. This may sound quite obvious at first. However, sentences taken from articles at random, matching random people, do not necessarily contain statements about that person. If the subject of the Wikipedia article is a certain person, most, if not all, statements made mentioning that person are likely to directly relate to that person.
Another control mechanism involves the structure of Wikipedia. Often we find a number of opening paragraphs containing the most important information about a person (or other entity). First mentions of certain facts are likely to be the main information, such as the first date mentioned usually being the date of birth, first mentioned locations being the places of death and/or birth, job titles usually the corresponding (and main) occupation of the person and so on. It should be mentioned, however, that this structure can cause problems, as will be elaborated on in section 4.2.
It is important to note that not every relation is always found for every entity. We therefore tried different processing approaches for the textual data, detailed in Section 4.1. A breakdown of the number of relations per set is presented in Section 4.2. Each relation also requires slightly different handling depending on the type of information. Tasks include date normalisation, partial matching for occupations, and exact location name matching. Exact details are presented in the following sections.
3.2.1. Date-based Relations
This set of relations includes birthdate (date of birth) and deathdate (date of death). In order to match these relations, the system checks for a DATE entity in the sentence, which is normalised to YYYY-MM-DD format. We use the dateparser999https://dateparser.readthedocs.io/en/latest/ package and use the date of processing as a relative date (for rare cases such as tomorrow or today). Furthermore, we use the first match for both relations, discarding subsequent matches. This mode of processing aligns with our restrictive approach, which assumes most pertinent information to mentioned towards the beginning of a Wikipedia article, rather than towards the end.
3.2.2. Name-based Relations
This set of relations includes ofParent, sibling and hasChild, as well as educatedAt (the place of education). For these name-based relations, the system checks a sentence for PER and ORG tags. It is ensured that only full matches are accepted, even though it may seem favourable to accept partial matches, at least for anything concerning persons. This is because with persons, it can be reasonable to allow just the first or last name to match. However, we found during the manual annotation process (Section 4.2) that too many false matches occurred, caused by different persons having the same name.
3.2.3. Entity Information Relations
Only the occupation relation is included in this group. Since spaCy’s NER capabilites do not include any tags such as title or job, we opted to use Stanford CoreNLP’s entity information processing to add this relation. We could have trained the spaCy model to include a new entity type for this step. In the end, we used CoreNLP as we felt training a new relation could potentially introduce another layer of errors.
The system lookup for this relation functions in a similar way to the previous set of relations, only that instead the CoreNLP information is accessed for matching. As mentioned, we run the initial CoreNLP processing separately due to the increased run time. Again, we only allow the complete first match to be annotated. Potentially, this relation set could be extended by using further occupation information from Wikidata, which in most cases lists a number of different occupations for a person, rather than the one main occupation listed in Pantheon.
3.2.4. Other Relations
This class of relations, labelled as other in the dataset, is used for all other relations. It is essentially the zero class, that is labelled when all other lookups in a sentence have failed. The other label is then applied to an entity pair that does not appear to be part of any of the other nine relations matched. Since we obtain more sentence from this class than all the other nine combined, we randomly select sentences and balance according to the total number of all other sentences containing relations. We balance the other relation class to make it equivalent in size with the remaining nine relations combined.
If in future more relations are added to the dataset, it would be vital to ensure that these Other labelled sentences do not contain the new relations, since they could conceivably be anything.
We carried out multiple experiments to estimate the quality and usefulness of this dataset. First, we examined the effects of different processing approaches for the article texts. Next, we manually annotated a small sub-set of sentences to pinpoint potential problems and to create a gold-standard set for evaluation purposes. After re-running the compilation process, taking into account certain observations and minor processing improvements after manual annotation, we trained a number of state-of-the-art ML models using the training datasets, and evaluate the performance using the gold set.
4.1. Labelling Approaches
For the process of automatically labelling each entity pair with a corresponding relation, we work at the document and sentence levels of a relevant Wikipedia article. At the document level we carry out all the NLP processing, such as NER, and then split the article into its sentences, to process each sentence. However, we wanted to assess the effect of two further approaches of processing the articles: First, we wanted to see how well co-reference resolution performs on the Wikipedia texts, and whether it would yield more annotated sentences (Section 4.1.1). Next, we looked into addressing sentence diversity, by implementing an approach that skips the first sentence of an article (Section 4.1.2).
4.1.1. Coref Set
We hypothesise that replacing co-referential entity mentions will allow the matching algorithm to find more matches overall. This would be due to the fact that more names would be matched because of the increased presence. Detecting more names could then potentially lead to more relation matches overall. For this, we used spaCy’s built-in co-reference resolution capabilities to automatically replace entity mentions with the most probable entity. The matching step is carried out using the text where all the entities have been replaced.
Table 2 shows the number of relations found across each of the sets we compiled: normal, coref, which is described here, and skip, which is described in the next section. The last line of the table shows the total number of relations found per set.
If we compare the overall counts of the relations of the normal and coref sets, we observe a small increase. However, looking at the counts of the different relation types, we see that it is not a simple increase across the board. In fact, we see fewer matches in certain cases. Upon further inspection, we found that this was mainly due to the automatic replacement process producing illegible sentences through incorrect replacements. The two main problems were entities that are scrambled and sentences being unintelligible because every single entity mention was replaced with one main entity, that was often also too long.
The main problems are demonstrated in the two examples below. In the first sentence, an entity has been replaced many times, including an opening bracket. Cases like these were observed frequently, and with more characters added. These cases introduced matching errors in the set. In the second example, we see a nested replacement, which similarly causes matching problems.
Replaced: Born in ¡e1¿Évreux¡/e1¿, Eure, a great fan of Paris Saint-Germain Paris Saint-Germain since ¡e2¿Bernard Mendy¡/e2¿ ( childhood, Bernard Mendy ( achieved Bernard Mendy ( ambitions in 2000 when Bernard Mendy ( joined PSG from SM Caen. Original: Born in ¡e1¿Évreux¡/e1¿, Eure, a great fan of Paris Saint-Germain since his childhood, he achieved his ambitions in 2000 when he joined PSG from SM Caen.
Replaced: The hundreds of volumes contained Queen Victoria’s Queen ¡e1¿Victoria¡/e1¿’s’s personal views of […] Original: The hundreds of volumes contained Queen Victoria’s personal views of […]
4.1.2. Skip Set
The skip set was compiled to study the effects of leaving out the first sentence of an article from Wikipedia. One problem with using Wikipedia texts stems from the first sentence of an article, or rather the structure of the first sentence of an article, is seen in the following example.
William Shakespeare (bapt. 26 April 1564 – 23 April 1616) was an English playwright, poet and actor, widely regarded as the greatest writer in the English language and the world’s greatest dramatist.
We see that the date of birth (and death) occur within parentheses after the name, in addition to the fact that the sentence usually contains a large amount of summarised information. This type of sentence structure (and content) is not only extremely frequent, but also quite specific to Wikipedia, suggesting that unnatural behaviour could be learned by a machine learning model. This was observed by Chisholm et al. (2017) who exploited this for their benefit. However, for this approach, we wanted to achieve as many natural matches as we could. Therefore, we compiled a dataset that follows the previously described methodology, but skips the first sentence of each article. The hypothesis is that this forces more matches elsewhere in the article, where more natural sentences occur.
Table 2 shows the total and individual counts for each relation, as referred to previously. We see that overall, the skip set has much fewer matches than the other two sets, and it never has the highest number of individual counts in any category, although the numbers are comparable in some categories to the normal set. Regardless, some of the generally larger categories, such as birthplace and birthdate are significantly smaller than the other two sets, generally pointing towards the fact that the identification is successful, as this information is extremely common in the first sentence. It is not always certain that this information will appear later on in an article, therefore leading to a smaller number of matches.
4.2. Manual Annotation
We assessed the quality of our semi-supervised datasets before using it to train machine learning models by means of manual annotation. This was important in order to find areas where the approach fails to match data accurately, where processing methods do not work, and any other similar problems. In addition, we needed a gold standard test set for benchmarking our neural models.
As pointed out in previous sections, we extracted 100 sentences per relation across the three datasets, equalling 3000 sentences in total that we manually annotated and refer to as the gold set. The data was annotated by two persons, one native English speaker and one non-native but fluent English speaker, both postgraduate students. For each sentence, the task was to look at the relation assigned by our matching algorithm and add the correct relation if it had been labelled incorrectly. We used one of the nine indicative labels where appropriate, and the other label if a different relation was expressed. Our annotation guideline was that a human should understand by reading the sentence which relation is expressed, regardless of prior knowledge. This is demonstrated by the following examples.
The first example shows a sentence that clearly mentions the occupation E2 of the entity E1. The second example shows an implicit relation. Although it is not directly stated, the word orphaned in relation to entity E2 with the statement that E1 died, implies that E1 is the parent of E2. In the final example, the algorithm labels the sentence as expressing the parent relation between the two entities. Although this may indeed be the case, and the annotator may have prior knowledge of this, or it has been expressed in a different sentence, it is not clearly stated in this sentence.
Explicit: ¡e1¿Renate Künast¡/e1¿ (born 15 December 1955) is a German ¡e2¿politician¡/e2¿ of Bündnis 90/Die Grünen.
Implicit: A few months later ¡e1¿Apollo Korzeniowski¡/e1¿ died, leaving ¡e2¿Conrad¡/e2¿ orphaned at the age of eleven.
Unclear: Thus, ¡e1¿Janaka¡/e1¿ tries to find the best husband for ¡e2¿Sita¡/e2¿.
The Cohen’s Kappa for the inter-annotator agreement is 0.908 which indicates a very high agreement between our annotators. The annotations allowed us to make a number of observations. First, we notice that two very similar relations work very differently. While birthplace works extremely well across sets, deathplace does not. Upon further examination, we found that the first mention of the place where someone died often was also the place where a person lived. In future, cases like these may warrant a different approach to processing by our algorithm, but for now we leave it unchanged. Second, we observed that many relations in the coref set were incoherent and probably incorrect, due to imprecise replacements by the coreference resolution algorithm.
While Wikidata as a source does work quite well, categories can sometimes be ambiguous, such as the educatedAt and parent
relations. Here, we observed that the Wikidata entries contained information at odds with our interpretation of the type of entry, such aseducatedAt containing a University that is the place of work, or parent containing a person that the target entry is a parent of rather has. Since this did not occur often in our manual evaluation, we did not implement a strategy to solve this problem.
Finally, we found a number of simple processing errors that we solved by improving our regular expressions for text cleaning. We also adjusted the matching procedure for the occupation relation, to avoid matches where the occupation mentioned belonged to a different entity. This leads to a slightly smaller number of relations overall, with a detailed overview shown in Table 3.
Overall, we have formed the following impressions for each set. The normal approach works well, while not offering a very diverse set of sentences. As alluded to earlier, it is clear that this approach matches mainly the standard Wikipedia first sentence, as described in previous sections. The coref set, while seemingly the largest set, must also include the most unusable sentences and bad examples. During the course of evaluating the sentences we found this set to be imprecise, not explicit and difficult to understand due to bad replacements. Finally, we found the skip set to be very mixed in terms of success. While for some relations it seems that none of the matching has returned usable results, other relations seem to have worked very well, offering in addition a wide variety of different sentences demonstrating the desired effects.
In order to determine the performance of the matching algorithm, we present the evaluation metrics for the gold set. For this, we compared the labels produced by the automatic matching algorithm to our manually produced labels. We removed 100 sentences from the gold set that contained processing errors caused by conversion to plain text, automatic replacement of coreferences and spaCy tagging errors. Since these would all have been annotated as ”Other”, we decided to remove these sentences since they could have caused an imbalanced test set. Table4
shows the results of the evaluation for each set. We observe that most of the matches found are correct, indicated by high precision and recall scores. However, the problem withdeathplace we observed during the evaluation is confirmed here. In addition, recall drops significantly for the Other class, mainly due to the fact that this was increased because of incorrect classifications by the matching algorithm.
4.3. Neural Models
The machine learning model we used to perform relationship classification is based on transformers. Since their introduction, transformer models have shown excellent results in various NLP tasks (Devlin et al., 2019) such as text classification (Ranasinghe and Zampieri, 2020), NER (Jia et al., 2020) and question answering (Yang et al., 2019) including RE (Yamada et al., 2020; Joshi et al., 2020; Wu and He, 2019; Alt et al., 2019). In this research, we utilised the architecture introduced by Baldini Soares et al. (2019).
The input to the transformer models is the sentence with “[E1]” and “[E2]” markers marking the positions of their respective entities. Then the output hidden states of transformer at the “[E1]” and “[E2]” token positions are concatenated as the final output representation of the relationship. Finally, a linear classifier is stacked on top of the output representation. The architecture diagram is visualised in Figure 3.
We fine-tune all the parameters from the transformer as well as the linear classifier jointly by maximising the log-probability of the correct label. For all the experiments we optimised parameters (with AdamW) using a learning rate of , a maximum sequence length of , and a batch size of samples. The models were trained using a
GB RTX 3090 GPU over five epochs. As the pre-trained transformer model, we used thebert-base-uncased model available in HuggingFace (Wolf et al., 2020).
For training the BERT-based classifier, we used each of the three sets separately, as well as a combination of the three sets we refer to as all, where we remove any duplicates that might be caused by the combination. We did not focus on producing the best possible results, and rather on indicating whether the produced dataset is even suitable for training a model. Table 5 shows the evaluation results of the models trained on the four different sets. While largely the results of the matching algorithm are echoed, we observe that some other relations, including hasChild, ofParent and sibling seem to score quite low in terms of recall. When comparing to the counts per set (See Table 3) these relations are quite low in number compared to the others, possibly explaining the results.
5. Proposed Application
The availability of compiled datasets for historical research is more important than ever. While NLP methods in domains such as biomedical and news continue to be expanded greatly, smaller areas of research such as specific historical (biographical) research inherently lack these opportunities. Being able to compile datasets to train neural extraction models with relative ease, as described here, is crucial for future research. In this section, we highlight this by example of the Army List in the United Kingdom, a study that we plan to embark on in the coming months.
The Army List (25) is a biographical compendium of officers serving in the British Army. It was first published in 1840 and volumes were subsequently published annually, although this varied during wartime. Each volume lists the name and rank of every serving officer in the British Army, along with important biographical details including length of service, past roles, and current position held. The Army List is an essential starting point for any research about the careers of military officers in the period.
Despite its importance to historical research, the Army List can prove difficult to access. Copies are held by a handful of specialist archives in the United Kingdom and there has been no systematic attempt to digitise them or apply data processing to them. Each Army List contains a wealth of information that invites cross-referencing and comparison to learn more about professional and social links amongst the officer class. However, the sheer number of biographical entries, amounting to several thousand per volume, made this an impossible task for historians in the pre-digital age. Digital processing offers a solution to this problem and opens the possibility of being able to map connections in new and illuminating ways. For example, it would allow the identification of professional networks based on age, shared roles, unit associations, and overseas service. A dataset based upon it would be of enormous value to historians, and it would open exciting new avenues for research and would contribute to ongoing historiographical debates on the professional bonds of the officer class.
To enable the kind of research described above, there is clearly a need for datasets like Biographical so that systems can be trained to extract large amounts of information quickly and efficiently. Not only could the dataset we present here be used itself, but also new datasets, compiled with the method we present here. Both the dataset and method therefore present significant opportunities for application, enabling research in under-resourced areas.
We have presented Biographical, a relation extraction dataset that is semi-supervised, and described its compilation process in detail. Furthermore, we carried out a number of experiments to understand the dataset better. This included different processing approaches, a manual annotation task and the training of different neural models. Not only have these experiments investigated different ways of optimising the compilation of the dataset for different goals, they have also validated the results in terms of machine learning.
In more general terms, this work marks an exciting first step at applying data processing to historical documentation. Archival digitisation in the United Kingdom and other countries remains hesitant and inconsistent, and there has been very little data processing of that which is available. The application of more computational resources to mine the data would be of immense value to historians and those working in related fields.
In the future, we would like to address a number of different aspects concerning this dataset. First, the optimisation of the compilation process for even more precise results will be focused on. Next, we would like to extend the number of relations, and demonstrate how simple this could be. As mentioned in the previous section, we also intend to test this approach on real-world texts in collaboration with historians.
- Improving relation extraction by pre-trained language representations. In Automated Knowledge Base Construction (AKBC), External Links: Cited by: §4.3.
- Matching the blanks: distributional similarity for relation learning. In Proceedings of ACL 2019, Florence, Italy, pp. 2895–2905. External Links: Cited by: §1, §2, §4.3.
- An unsupervised approach to biography production using Wikipedia. In Proceedings of ACL 2008: HLT, pp. 807–815. Cited by: §2, §2.
- Learning to generate one-sentence biographies from Wikidata. CoRR abs/1702.0. Note: _eprint: 1702.06235 External Links: Cited by: §2, §2, §4.1.2.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL 2019: HLT, Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: item 2, §4.3.
- Improved relation extraction with feature-rich compositional embedding models. In Proceedings of EMNLP 2015, Lisbon, Portugal, pp. 1774–1784. External Links: Cited by: §1.
- SemEval-2010 task 8: multi-way classification of semantic relations between pairs of nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden, pp. 33–38. External Links: Cited by: §2.
- Knowledge-based weak supervision for information extraction of overlapping relations. In Proceedings of ACL 2011: HLT, Portland, Oregon, USA, pp. 541–550. External Links: Cited by: §1, §2.
- Unsupervised biographical event extraction using wikipedia traffic. In Proceedings of the Australasian Language Technology Association Workshop 2014, pp. 41–49. Cited by: §1, §2.
- Entity enhanced BERT pre-training for Chinese NER. In Proceedings of EMNLP 2020, Online, pp. 6384–6396. External Links: Cited by: §4.3.
- Information extraction from text. In Mining Text Data, C. C. Aggarwal and C. Zhai (Eds.), pp. 11–41. External Links: Cited by: §1.
- SpanBERT: improving pre-training by representing and predicting spans. Transactions of ACL 8, pp. 64–77. External Links: Cited by: §1, §4.3.
- Exploiting rich syntactic information for relation extraction from biomedical articles. In Proceedings of NAACL 2007: HLT, NAACL-Short ’07, USA, pp. 97–100. Cited by: §2.
- End-to-end relation extraction using LSTMs on sequences and tree structures. In Proceedings of ACL 2016, Berlin, Germany, pp. 1105–1116. External Links: Cited by: §2.
- Named entity recognition and relation extraction: state-of-the-art. ACM Comput. Surv. 54 (1). External Links: Cited by: §1.
Effective modeling of encoder-decoder architecture for joint entity and relation extraction.
Proceedings of the AAAI Conference on Artificial Intelligence34 (05), pp. 8528–8535. External Links: Cited by: §1.
- An introduction to duc-2004. National Institute of Standards and Technology. Cited by: §2.
- Recognizing biographical sections in Wikipedia. In Proceedings of EMNLP 2015, Lisbon, Portugal, pp. 811–816. External Links: Cited by: §1, §2.
- Large-scale Data Harvesting for Biographical Data. In Proceedings of (BD-2019), Cited by: §3.1.1.
- Multilingual offensive language identification with cross-lingual embeddings. In Proceedings of EMNLP 2020, Online, pp. 5838–5844. External Links: Cited by: §4.3.
- Modeling relations and their mentions without labeled text. In Machine Learning and Knowledge Discovery in Databases, J. L. Balcázar, F. Bonchi, A. Gionis, and M. Sebag (Eds.), Berlin, Heidelberg, pp. 148–163. Cited by: §2.
- Attention-based convolutional neural network for semantic relation extraction. In Proceedings of COLING 2016: Technical Papers, Osaka, Japan, pp. 2526–2536. External Links: Cited by: §1.
- Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature. J Am Med Inform Assoc 23 (4), pp. 766–772 (en). Cited by: §2.
- Relation extraction using distant supervision: a survey. ACM Comput. Surv. 51 (5). External Links: Cited by: §1.
-  (1913) The quarterly army list for the quarter ending april 1914.. His Majesty’s Stationery Office, London (eng). Cited by: §5.
Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 384–394. External Links: Cited by: §2.
- Wikidata: a free collaborative knowledgebase. Communications of the ACM 57 (10), pp. 78–85. Cited by: §3.1.3.
- Relation extraction: a brief survey on deep neural network based methods. In 2021 The 4th International Conference on Software Engineering and Information Management, ICSIM 2021, New York, NY, USA, pp. 220–228. External Links: Cited by: §1.
- Transformers: state-of-the-art natural language processing. In Proceedings of EMNLP 2020: System Demonstrations, Online, pp. 38–45. External Links: Cited by: §4.3.
- Enriching pre-trained language model with entity information for relation classification. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, New York, NY, USA, pp. 2361–2364. External Links: Cited by: §4.3.
- Semantic relation classification via hierarchical recurrent neural network with attention. In Proceedings of COLING 2016: Technical Papers, Osaka, Japan, pp. 1254–1263. External Links: Cited by: §1.
- GDPNet: refining latent multi-view graph for relation extraction. Proceedings of the AAAI Conference on Artificial Intelligence 35 (16), pp. 14194–14202. External Links: Cited by: §1.
- Fine-tuning bert for joint entity and relation extraction in chinese medical text. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Vol. , pp. 892–897. External Links: Cited by: §2.
- LUKE: deep contextualized entity representations with entity-aware self-attention. In Proceedings of EMNLP 2020, Online, pp. 6442–6454. External Links: Cited by: §4.3.
- End-to-end open-domain question answering with BERTserini. In Proceedings of NAACL 2019, Minneapolis, Minnesota, pp. 72–77. External Links: Cited by: §4.3.
- Pantheon 1.0, a manually verified dataset of globally famous biographies. Scientific data 3 (1), pp. 1–16. Cited by: §3.1.2, §3.1.
- Relation classification via convolutional deep neural network. In Proceedings of COLING 2014: Technical Papers, Dublin, Ireland, pp. 2335–2344. External Links: Cited by: §1, §2.
- Position-aware attention and supervised data improve slot filling. In Proceedings of EMNLP 2017, Copenhagen, Denmark, pp. 35–45. External Links: Cited by: §1, §1, §2.
- Multi-document biography summarization. In Proceedings of EMNLP 2004, Barcelona, Spain, pp. 434–441. External Links: Cited by: §1, §2.
- Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of ACL 2016, Berlin, Germany, pp. 207–212. External Links: Cited by: §2.