Extracting, integrating and matching instances of named entities referring to the same real-world objects remains a big challenge in the field of Natural Language Processing and the Semantic Web. Nevertheless, this task is necessary to achieve the understanding and integrity of various resources. One of the most basic and standard categories of named entities appearing in almost every type of text is referring to people. Named entity recognition (NER) models can provide us with annotations that are undifferentiated as they are all tagged with general labelperson. In order to be able to analyze text on deeper levels or find shared factors between various texts, we need contradistinction between the recognized person named entities. The most desired way is to have a unique identifier for each person appearing in the considered dataset. This identifier can be the full name of the recognized person. Ideally, each person’s mention in the dataset should be linked with a tag containing the full name of the corresponding person.
Up to our best knowledge, there are no available tools that combine the recognition and identification of persons in various domains. There are available NER models, particularly for news and common language texts. Search engines can search through particular names and match distinct names each time we query them. However, it is complicated to recognize and match mentions of specific people with named entities in an arbitrary text using a single tool.
The general workflow of the created tool – called protagonistTagger – is presented in Figure 1. It works automatically with a list of peoples’ full names and a text to be annotated given as inputs. The process of person entity linkage implemented in protagonistTagger is divided into two main phases:
named entity recognition (NER) of mentions of people in a text.
named entity disambiguation (NED) providing links between the entities recognized in unstructured text and identifiers referring to real-world objects.
2 ProtagonistTagger: Original English Literature Use Case 111The protagonistTagger tool, benchmark datasets from literary domain and annotated corpus are available at https://zenodo.org/record/4699418
The idea for creating a tool for named entity linkage dedicated for novels arose from the complexity of texts in the literary domain and the weak performance of standard methods on such complex texts . The novel is a particular type of text in terms of writing style, the links between sentences, the plot’s complexity, the number of characters, etc.
NER phase of the linkage process in the literary domain uses a pretrained standard NER model fine-tuned with data from literary domain annotated with general tag person in a semi-automatic way . It was necessary due to the relatively low performance of standard models trained primarily on web data such as blogs, news, and comments [1, 4]. In the NER phase we want to find as many potential person entities to be matched in the NED phase as possible (have the highest possible recall). The phase of NED aims at linking each mention of a protagonist in a given text with a proper tag (i.e., a full name of this protagonist), having been given the list of protagonists’ proper names predefined for each novel. The matching method is mainly based on approximate text matching. The algorithm also incorporates a set of hand-crafted rules (lexico-semantic patterns) for distinguishing between different instances of concept person and it uses several external dictionaries. This way entities’ similarities are considered on the semantic and syntactic level. The protagonisTagger addresses the problem of diminutives of basic forms of names and disambiguation of people with the same surname (by analyzing personal title preceding surname in a text).
The tool was used to prepare a corpus of 13 novels (altogether more than 50,000 sentences) and more than 35,000 annotations of literary characters. The corpus was used for the analysis of sentiment and relationships in novels. The performed analysis was much more manageable and precise, thanks to the available annotations of literary characters.
3 ProtagonistTagger: Internet News Use Case 333The adapted tool along with the new datasets: https://zenodo.org/record/5060232
In order to verify the usability of the created tool in a brand new domain, we investigated its performance on a set made from internet news written in Polish . The protagonistTagger turned out to be universal, and it was easily adapted for new data. The only required modification is connected with changing the language model from English to Polish. In the NER phase, we skipped the step of fine-tuning the standard NER model. This decision was caused by the relatively high performance of the standard NER model on the news compared with literary texts. Since external resources, such as dictionaries and syntactic rules, are language-dependent, they were disabled in this experiment. The NED phase was primarily based on the approximate string matching part of the matching algorithm.
4 Evaluation and Datasets
Testing sets for the literary domain contain sentences chosen randomly from 13 novels differing in style and genre (large – testing sets from 10 novels, and small – from distinct 3 novels). The testing sets used for protagonistTagger contain all together 1,300 sentences (100 sentences from each novel) annotated manually with a general tag person for NER testing (see Table 1) and full names of the mentioned people for NED testing, as well as for the testing of the entire tool.
|Testing set/NER model||Precision||Recall||F-measure||Mentions|
|Novels - Test_large_names||0.88||0.87||0.87|
|Novels - Test_small_names||0.83||0.83||0.83|
The protagonistTagger tool achieves high results on all the tested novels – precision and recall above 83% (see Table 2). The tool’s performance on Test_small_names shows that it can be successfully used for new distinct novels to create a larger corpus of annotated texts. The best proof of novels’ diversity in the test sets is the tool’s performance, whose precision varies from 79% to even 96% for different novels. Even though the performance is tested on various texts, the precision of the annotations remains high, proving the applicability of the proposed method in the literary domain.
The new dataset with internet news written in Polish contains around 1,000 sentences . It is annotated with 100 identifiers (full names of popular Polish individuals, e.g. politicians, actors, researchers, etc.). The overall quality of the annotations is high (both recall and precision above 78%). Even though, we applied only a few simple rules and no additional resources (e.g. dictionaries) in the NED phase.
In this paper, we propose a method and a tool for person entity linkage in various domains – protagonistTagger. We also gathered datasets to express the problem of individuals’ matching with text mentions. The method uses pretrained NER models and various techniques for NED. The initial tests were performed in the English literary domain. Most recent experiments prove the adaptability and effectiveness of the proposed approach in another language and domain of Internet news. The tool proved its effectiveness in achieving satisfactory results. The only precondition of using the tool is access to the predefined tags defining the full names (persons’ identifiers) to be linked with named entities appearing in a text. A fascinating field of future applications is annotating texts from social media and using these annotations to investigate human opinions and analyzing sentiments.
-  (2016) Evaluating and combining name entity recognition systems. In 6th Named Entity Workshop, pp. 21–27. Cited by: §2.
Construction of machine-labeled data for improving named entity recognition by transfer learning. IEEE Access 8, pp. 59684–59693. Cited by: §2.
-  (2020) Categorization of persons based on their mentions in polish news texts. JAMRIS 14. Cited by: §3, §4.
-  (2019) A replicable comparison study of ner software: stanfordnlp, nltk, opennlp, spacy, gate. In 6th SNAMS, pp. 338–343. Cited by: §2.
-  (2015) Mr. bennet, his coachman, and the archbishop walk into a bar but only one of them gets recognized: on the difficulty of detecting characters in literary texts. In EMNLP, pp. 769–774. Cited by: §2.