Can BERT Dig It? – Named Entity Recognition for Information Retrieval in the Archaeology Domain

06/14/2021
by   Alex Brandsen, et al.
0

The amount of archaeological literature is growing rapidly. Until recently, these data were only accessible through metadata search. We implemented a text retrieval engine for a large archaeological text collection (∼ 658 Million words). In archaeological IR, domain-specific entities such as locations, time periods, and artefacts, play a central role. This motivated the development of a named entity recognition (NER) model to annotate the full collection with archaeological named entities. In this paper, we present ArcheoBERTje, a BERT model pre-trained on Dutch archaeological texts. We compare the model's quality and output on a Named Entity Recognition task to a generic multilingual model and a generic Dutch model. We also investigate ensemble methods for combining multiple BERT models, and combining the best BERT model with a domain thesaurus using Conditional Random Fields (CRF). We find that ArcheoBERTje outperforms both the multilingual and Dutch model significantly with a smaller standard deviation between runs, reaching an average F1 score of 0.735. The model also outperforms ensemble methods combining the three models. Combining ArcheoBERTje predictions and explicit domain knowledge from the thesaurus did not increase the F1 score. We quantitatively and qualitatively analyse the differences between the vocabulary and output of the BERT models on the full collection and provide some valuable insights in the effect of fine-tuning for specific domains. Our results indicate that for a highly specific text domain such as archaeology, further pre-training on domain-specific data increases the model's quality on NER by a much larger margin than shown for other domains in the literature, and that domain-specific pre-training makes the addition of domain knowledge from a thesaurus unnecessary.

READ FULL TEXT
research
09/23/2019

Portuguese Named Entity Recognition using BERT-CRF

Recent advances in language representation using neural networks have ma...
research
12/01/2021

Wiki to Automotive: Understanding the Distribution Shift and its impact on Named Entity Recognition

While transfer learning has become a ubiquitous technique used across Na...
research
12/21/2020

Domain specific BERT representation for Named Entity Recognition of lab protocol

Supervised models trained to predict properties from representations hav...
research
12/13/2021

ANEA: Automated (Named) Entity Annotation for German Domain-Specific Texts

Named entity recognition (NER) is an important task that aims to resolve...
research
03/28/2022

Using Domain Knowledge for Low Resource Named Entity Recognition

In recent years, named entity recognition has always been a popular rese...
research
04/05/2022

Multilinguals at SemEval-2022 Task 11: Transformer Based Architecture for Complex NER

We investigate the task of complex NER for the English language. The tas...
research
12/06/2018

Pathology Extraction from Chest X-Ray Radiology Reports: A Performance Study

Extraction of relevant pathological terms from radiology reports is impo...

Please sign up or login with your details

Forgot password? Click here to reset