DeepAI
Log In Sign Up

Data Centric Domain Adaptation for Historical Text with OCR Errors

07/02/2021
by   Luisa März, et al.
0

We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general approach to imitate OCR errors in arbitrary input data. Our cross-domain as well as our in-domain results outperform several strong baselines and establish state-of-the-art results. We publish preprocessed versions of the French and Dutch Europeana NER corpora.

READ FULL TEXT

page 1

page 2

page 3

page 4

12/08/2020

CrossNER: Evaluating Cross-Domain Named Entity Recognition

Cross-domain named entity recognition (NER) models are able to cope with...
06/07/2022

Searching for Optimal Subword Tokenization in Cross-domain NER

Input distribution shift is one of the vital problems in unsupervised do...
08/24/2022

FactMix: Using a Few Labeled In-domain Examples to Generalize to Cross-domain Named Entity Recognition

Few-shot Named Entity Recognition (NER) is imperative for entity tagging...
10/15/2018

Neural Adaptation Layers for Cross-domain Named Entity Recognition

Recent research efforts have shown that neural architectures can be effe...
12/22/2020

Semi-Supervised Disentangled Framework for Transferable Named Entity Recognition

Named entity recognition (NER) for identifying proper nouns in unstructu...
04/15/2021

Cross-Domain Label-Adaptive Stance Detection

Stance detection concerns the classification of a writer's viewpoint tow...
12/08/2021

Burn After Reading: Online Adaptation for Cross-domain Streaming Data

In the context of online privacy, many methods propose complex privacy a...

Code Repositories