Log In Sign Up

ProtagonistTagger – a Tool for Entity Linkage of Persons in Texts from Various Languages and Domains

by   Weronika Łajewska, et al.
Politechnika Warszawska

Named entities recognition (NER) and disambiguation (NED) can add semantic context to the recognized named entities in texts. Named entity linkage in texts, regardless of a domain, provides links between the entities mentioned in unstructured texts and individual instances of real-world objects. In this poster, we present a tool - protagonistTagger - for person NER and NED in texts. The tool was tested on texts extracted from classic English novels and Polish Internet news. The tool's performance (both precision and recall) fluctuates between 78


page 1

page 2

page 3

page 4


Named Entity Sequence Classification

Named Entity Recognition (NER) aims at locating and classifying named en...

PAMPO: using pattern matching and pos-tagging for effective Named Entities recognition in Portuguese

This paper deals with the entity extraction task (named entity recogniti...

A Comprehensive Analysis of Acknowledgement Texts in Web of Science: a case study on four scientific domains

Analysis of acknowledgments is particularly interesting as acknowledgmen...

Domain-Specific NER via Retrieving Correlated Samples

Successful Machine Learning based Named Entity Recognition models could ...

Hybrid NER System for Multi-Source Offer Feeds

Data available across the web is largely unstructured. Offers published ...

The ApposCorpus: A new multilingual, multi-domain dataset for factual appositive generation

News articles, image captions, product reviews and many other texts ment...

Neural Entity Reasoner for Global Consistency in NER

We propose Neural Entity Reasoner (NE-Reasoner), a framework to introduc...

1 Introduction

Extracting, integrating and matching instances of named entities referring to the same real-world objects remains a big challenge in the field of Natural Language Processing and the Semantic Web. Nevertheless, this task is necessary to achieve the understanding and integrity of various resources. One of the most basic and standard categories of named entities appearing in almost every type of text is referring to people. Named entity recognition (NER) models can provide us with annotations that are undifferentiated as they are all tagged with general label

person. In order to be able to analyze text on deeper levels or find shared factors between various texts, we need contradistinction between the recognized person named entities. The most desired way is to have a unique identifier for each person appearing in the considered dataset. This identifier can be the full name of the recognized person. Ideally, each person’s mention in the dataset should be linked with a tag containing the full name of the corresponding person.

Up to our best knowledge, there are no available tools that combine the recognition and identification of persons in various domains. There are available NER models, particularly for news and common language texts. Search engines can search through particular names and match distinct names each time we query them. However, it is complicated to recognize and match mentions of specific people with named entities in an arbitrary text using a single tool.

Figure 1: Simplified process employed in our tool – protagonistTagger – with examples.

The general workflow of the created tool – called protagonistTagger – is presented in Figure 1. It works automatically with a list of peoples’ full names and a text to be annotated given as inputs. The process of person entity linkage implemented in protagonistTagger is divided into two main phases:

  1. named entity recognition (NER) of mentions of people in a text.

  2. named entity disambiguation (NED) providing links between the entities recognized in unstructured text and identifiers referring to real-world objects.

2 ProtagonistTagger: Original English Literature Use Case 111The protagonistTagger tool, benchmark datasets from literary domain and annotated corpus are available at

The idea for creating a tool for named entity linkage dedicated for novels arose from the complexity of texts in the literary domain and the weak performance of standard methods on such complex texts [5]. The novel is a particular type of text in terms of writing style, the links between sentences, the plot’s complexity, the number of characters, etc.

NER phase of the linkage process in the literary domain uses a pretrained standard NER model fine-tuned with data from literary domain annotated with general tag person in a semi-automatic way [2]. It was necessary due to the relatively low performance of standard models trained primarily on web data such as blogs, news, and comments [1, 4]. In the NER phase we want to find as many potential person entities to be matched in the NED phase as possible (have the highest possible recall). The phase of NED aims at linking each mention of a protagonist in a given text with a proper tag (i.e., a full name of this protagonist), having been given the list of protagonists’ proper names predefined for each novel. The matching method is mainly based on approximate text matching. The algorithm also incorporates a set of hand-crafted rules (lexico-semantic patterns) for distinguishing between different instances of concept person and it uses several external dictionaries. This way entities’ similarities are considered on the semantic and syntactic level. The protagonisTagger addresses the problem of diminutives of basic forms of names and disambiguation of people with the same surname (by analyzing personal title preceding surname in a text).

The tool was used to prepare a corpus of 13 novels (altogether more than 50,000 sentences) and more than 35,000 annotations of literary characters. The corpus was used for the analysis of sentiment and relationships in novels. The performed analysis was much more manageable and precise, thanks to the available annotations of literary characters.

3 ProtagonistTagger: Internet News Use Case 333The adapted tool along with the new datasets:

In order to verify the usability of the created tool in a brand new domain, we investigated its performance on a set made from internet news written in Polish [3]. The protagonistTagger turned out to be universal, and it was easily adapted for new data. The only required modification is connected with changing the language model from English to Polish. In the NER phase, we skipped the step of fine-tuning the standard NER model. This decision was caused by the relatively high performance of the standard NER model on the news compared with literary texts. Since external resources, such as dictionaries and syntactic rules, are language-dependent, they were disabled in this experiment. The NED phase was primarily based on the approximate string matching part of the matching algorithm.

4 Evaluation and Datasets

Testing sets for the literary domain contain sentences chosen randomly from 13 novels differing in style and genre (large – testing sets from 10 novels, and small – from distinct 3 novels). The testing sets used for protagonistTagger contain all together 1,300 sentences (100 sentences from each novel) annotated manually with a general tag person for NER testing (see Table 1) and full names of the mentioned people for NED testing, as well as for the testing of the entire tool.

Testing set/NER model Precision Recall F-measure Mentions
Test_large_person standard 0.84 0.8 0.82 1021
fine-tuned 0.77 0.99 0.87 1021
Test_small_person standard 0.78 0.79 0.78 273
fine-tuned 0.69 0.95 0.8 273
Internet news standard 0.84 0.94 0.89 324
Table 1: Metrics computed for the standard NER models (en_core_web_sm or pl_core_news_sm available at and the fine-tuned NER model for annotations with general label person.
Testing set Precision Recall F-measure
Novels - Test_large_names 0.88 0.87 0.87
Novels - Test_small_names 0.83 0.83 0.83
Internet news 0.8 0.78 0.78
Table 2: Performance of the protagonistTagger on various datasets.

The protagonistTagger tool achieves high results on all the tested novels – precision and recall above 83% (see Table 2). The tool’s performance on Test_small_names shows that it can be successfully used for new distinct novels to create a larger corpus of annotated texts. The best proof of novels’ diversity in the test sets is the tool’s performance, whose precision varies from 79% to even 96% for different novels. Even though the performance is tested on various texts, the precision of the annotations remains high, proving the applicability of the proposed method in the literary domain.

The new dataset with internet news written in Polish contains around 1,000 sentences [3]. It is annotated with 100 identifiers (full names of popular Polish individuals, e.g. politicians, actors, researchers, etc.). The overall quality of the annotations is high (both recall and precision above 78%). Even though, we applied only a few simple rules and no additional resources (e.g. dictionaries) in the NED phase.

5 Conclusions

In this paper, we propose a method and a tool for person entity linkage in various domains – protagonistTagger. We also gathered datasets to express the problem of individuals’ matching with text mentions. The method uses pretrained NER models and various techniques for NED. The initial tests were performed in the English literary domain. Most recent experiments prove the adaptability and effectiveness of the proposed approach in another language and domain of Internet news. The tool proved its effectiveness in achieving satisfactory results. The only precondition of using the tool is access to the predefined tags defining the full names (persons’ identifiers) to be linked with named entities appearing in a text. A fascinating field of future applications is annotating texts from social media and using these annotations to investigate human opinions and analyzing sentiments.


  • [1] R. Jiang and R. E. Banchs (2016) Evaluating and combining name entity recognition systems. In 6th Named Entity Workshop, pp. 21–27. Cited by: §2.
  • [2] J. Kim, Y. Ko, and J. Seo (2020)

    Construction of machine-labeled data for improving named entity recognition by transfer learning

    IEEE Access 8, pp. 59684–59693. Cited by: §2.
  • [3] M. Pachocki and A. Wróblewska (2020) Categorization of persons based on their mentions in polish news texts. JAMRIS 14. Cited by: §3, §4.
  • [4] X. Schmitt and S. Kubler (2019) A replicable comparison study of ner software: stanfordnlp, nltk, opennlp, spacy, gate. In 6th SNAMS, pp. 338–343. Cited by: §2.
  • [5] H. Vala and D. Jurgens (2015) Mr. bennet, his coachman, and the archbishop walk into a bar but only one of them gets recognized: on the difficulty of detecting characters in literary texts. In EMNLP, pp. 769–774. Cited by: §2.