hmBERT: Historical Multilingual Language Models for Named Entity Recognition

05/31/2022
by   Stefan Schweter, et al.
0

Compared to standard Named Entity Recognition (NER), identifying persons, locations, and organizations in historical texts forms a big challenge. To obtain machine-readable corpora, the historical text is usually scanned and optical character recognition (OCR) needs to be performed. As a result, the historical corpora contain errors. Also, entities like location or organization can change over time, which poses another challenge. Overall historical texts come with several peculiarities that differ greatly from modern texts and large labeled corpora for training a neural tagger are hardly available for this domain. In this work, we tackle NER for historical German, English, French, Swedish, and Finnish by training large historical language models. We circumvent the need for labeled data by using unlabeled data for pretraining a language model. hmBERT, a historical multilingual BERT-based language model is proposed, with different sizes of it being publicly released. Furthermore, we evaluate the capability of hmBERT by solving downstream NER as part of this year's HIPE-2022 shared task and provide detailed analysis and insights. For the Multilingual Classical Commentary coarse-grained NER challenge, our tagger HISTeria outperforms the other teams' models for two out of three languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/26/2023

People and Places of Historical Europe: Bootstrapping Annotation Pipeline and a New Corpus of Named Entities in Late Medieval Texts

Although pre-trained named entity recognition (NER) models are highly ac...
research
06/21/2019

Multilingual Named Entity Recognition Using Pretrained Embeddings, Attention Mechanism and NCRF

In this paper we tackle multilingual named entity recognition task. We u...
research
10/28/2020

Character Entropy in Modern and Historical Texts: Comparison Metrics for an Undeciphered Manuscript

This paper outlines the creation of three corpora for multilingual compa...
research
01/12/2023

Adversarial Adaptation for French Named Entity Recognition

Named Entity Recognition (NER) is the task of identifying and classifyin...
research
11/09/2016

Old Content and Modern Tools - Searching Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910

Named Entity Recognition (NER), search, classification and tagging of na...
research
08/27/2022

Domain-Specific NER via Retrieving Correlated Samples

Successful Machine Learning based Named Entity Recognition models could ...
research
08/16/2022

Temporal Concept Drift and Alignment: An empirical approach to comparing Knowledge Organization Systems over time

This research explores temporal concept drift and temporal alignment in ...

Please sign up or login with your details

Forgot password? Click here to reset