Log In Sign Up

Data Centric Domain Adaptation for Historical Text with OCR Errors

by   Luisa März, et al.

We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general approach to imitate OCR errors in arbitrary input data. Our cross-domain as well as our in-domain results outperform several strong baselines and establish state-of-the-art results. We publish preprocessed versions of the French and Dutch Europeana NER corpora.


page 1

page 2

page 3

page 4


CrossNER: Evaluating Cross-Domain Named Entity Recognition

Cross-domain named entity recognition (NER) models are able to cope with...

Searching for Optimal Subword Tokenization in Cross-domain NER

Input distribution shift is one of the vital problems in unsupervised do...

FactMix: Using a Few Labeled In-domain Examples to Generalize to Cross-domain Named Entity Recognition

Few-shot Named Entity Recognition (NER) is imperative for entity tagging...

Neural Adaptation Layers for Cross-domain Named Entity Recognition

Recent research efforts have shown that neural architectures can be effe...

Semi-Supervised Disentangled Framework for Transferable Named Entity Recognition

Named entity recognition (NER) for identifying proper nouns in unstructu...

Cross-Domain Label-Adaptive Stance Detection

Stance detection concerns the classification of a writer's viewpoint tow...

Burn After Reading: Online Adaptation for Cross-domain Streaming Data

In the context of online privacy, many methods propose complex privacy a...

Code Repositories