Razmecheno: Named Entity Recognition from Digital Archive of Diaries "Prozhito"

01/24/2022
by   Timofey Atnashev, et al.
0

The vast majority of existing datasets for Named Entity Recognition (NER) are built primarily on news, research papers and Wikipedia with a few exceptions, created from historical and literary texts. What is more, English is the main source for data for further labelling. This paper aims to fill in multiple gaps by creating a novel dataset "Razmecheno", gathered from the diary texts of the project "Prozhito" in Russian. Our dataset is of interest for multiple research lines: literary studies of diary texts, transfer learning from other domains, low-resource or cross-lingual named entity recognition. Razmecheno comprises 1331 sentences and 14119 tokens, sampled from diaries, written during the Perestroika. The annotation schema consists of five commonly used entity tags: person, characteristics, location, organisation, and facility. The labelling is carried out on the crowdsourcing platfrom Yandex.Toloka in two stages. First, workers selected sentences, which contain an entity of particular type. Second, they marked up entity spans. As a result 1113 entities were obtained. Empirical evaluation of Razmecheno is carried out with off-the-shelf NER tools and by fine-tuning pre-trained contextualized encoders. We release the annotated dataset for open access.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/19/2018

pioNER: Datasets and Baselines for Armenian Named Entity Recognition

In this work, we tackle the problem of Armenian named entity recognition...
research
10/04/2021

Protagonists' Tagger in Literary Domain – New Datasets and a Method for Person Entity Linkage

Semantic annotation of long texts, such as novels, remains an open chall...
research
03/22/2021

MasakhaNER: Named Entity Recognition for African Languages

We take a step towards addressing the under-representation of the Africa...
research
08/31/2019

Entity Projection via Machine-Translation for Cross-Lingual NER

Although over 100 languages are supported by strong off-the-shelf machin...
research
09/15/2021

Low-Resource Named Entity Recognition Based on Multi-hop Dependency Trigger

This paper presents a simple and effective approach in low-resource name...
research
07/29/2021

Addressing Barriers to Reproducible Named Entity Recognition Evaluation

To address what we believe is a looming crisis of unreproducible evaluat...
research
11/21/2019

Global Health Monitor: A Web-based System for Detecting and Mapping Infectious Diseases

We present the Global Health Monitor, an online Web-based system for det...

Please sign up or login with your details

Forgot password? Click here to reset