Data Centric Domain Adaptation for Historical Text with OCR Errors

07/02/2021
by   Luisa März, et al.
0

We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general approach to imitate OCR errors in arbitrary input data. Our cross-domain as well as our in-domain results outperform several strong baselines and establish state-of-the-art results. We publish preprocessed versions of the French and Dutch Europeana NER corpora.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/08/2020

CrossNER: Evaluating Cross-Domain Named Entity Recognition

Cross-domain named entity recognition (NER) models are able to cope with...
research
06/07/2022

Searching for Optimal Subword Tokenization in Cross-domain NER

Input distribution shift is one of the vital problems in unsupervised do...
research
04/15/2021

Cross-Domain Label-Adaptive Stance Detection

Stance detection concerns the classification of a writer's viewpoint tow...
research
10/15/2018

Neural Adaptation Layers for Cross-domain Named Entity Recognition

Recent research efforts have shown that neural architectures can be effe...
research
12/22/2020

Semi-Supervised Disentangled Framework for Transferable Named Entity Recognition

Named entity recognition (NER) for identifying proper nouns in unstructu...
research
12/08/2021

Burn After Reading: Online Adaptation for Cross-domain Streaming Data

In the context of online privacy, many methods propose complex privacy a...
research
05/05/2023

DualCross: Cross-Modality Cross-Domain Adaptation for Monocular BEV Perception

Closing the domain gap between training and deployment and incorporating...

Please sign up or login with your details

Forgot password? Click here to reset