Multilingual Event Extraction from Historical Newspaper Adverts

05/18/2023
by   Nadav Borenstein, et al.
3

NLP methods can aid historians in analyzing textual materials in greater volumes than manually feasible. Developing such methods poses substantial challenges though. First, acquiring large, annotated historical datasets is difficult, as only domain experts can reliably label them. Second, most available off-the-shelf NLP models are trained on modern language texts, rendering them significantly less effective when applied to historical corpora. This is particularly problematic for less well studied tasks, and for languages other than English. This paper addresses these challenges while focusing on the under-explored task of event extraction from a novel domain of historical texts. We introduce a new multilingual dataset in English, French, and Dutch composed of newspaper ads from the early modern colonial period reporting on enslaved people who liberated themselves from enslavement. We find that: 1) even with scarce annotated data, it is possible to achieve surprisingly good results by formulating the problem as an extractive QA task and leveraging existing datasets and models for modern languages; and 2) cross-lingual low-resource learning for historical languages is highly challenging, and machine translation of the historical datasets to the considered target languages is, in practice, often the best-performing solution.

READ FULL TEXT

page 9

page 16

research
01/16/2023

XNLI 2.0: Improving XNLI dataset and performance on Cross Lingual Understanding (XLU)

Natural Language Processing systems are heavily dependent on the availab...
research
02/19/2022

MACRONYM: A Large-Scale Dataset for Multilingual and Multi-Domain Acronym Extraction

Acronym extraction is the task of identifying acronyms and their expande...
research
04/16/2021

MetaXL: Meta Representation Transformation for Low-resource Cross-lingual Learning

The combination of multilingual pre-trained representations and cross-li...
research
05/18/2023

NollySenti: Leveraging Transfer Learning and Machine Translation for Nigerian Movie Sentiment Classification

Africa has over 2000 indigenous languages but they are under-represented...
research
11/27/2018

Cross-Lingual Approaches to Reference Resolution in Dialogue Systems

In the slot-filling paradigm, where a user can refer back to slots in th...
research
10/10/2022

HumSet: Dataset of Multilingual Information Extraction and Classification for Humanitarian Crisis Response

Timely and effective response to humanitarian crises requires quick and ...
research
06/28/2022

Placing (Historical) Facts on a Timeline: A Classification cum Coref Resolution Approach

A timeline provides one of the most effective ways to visualize the impo...

Please sign up or login with your details

Forgot password? Click here to reset