An Assessment of the Impact of OCR Noise on Language Models

01/26/2022
by   Konstantin Todorov, et al.
14

Neural language models are the backbone of modern-day natural language processing applications. Their use on textual heritage collections which have undergone Optical Character Recognition (OCR) is therefore also increasing. Nevertheless, our understanding of the impact OCR noise could have on language models is still limited. We perform an assessment of the impact OCR noise has on a variety of language models, using data in Dutch, English, French and German. We find that OCR noise poses a significant obstacle to language modelling, with language models increasingly diverging from their noiseless targets as OCR quality lowers. In the presence of small corpora, simpler models including PPMI and Word2Vec consistently outperform transformer-based models in this respect.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/31/2020

Language Modelling for Source Code with Transformer-XL

It has been found that software, like natural language texts, exhibits "...
research
03/26/2021

Correcting Automated and Manual Speech Transcription Errors using Warped Language Models

Masked language models have revolutionized natural language processing s...
research
09/17/2021

Language Models as a Knowledge Source for Cognitive Agents

Language models (LMs) are sentence-completion engines trained on massive...
research
05/05/2023

Adapting Transformer Language Models for Predictive Typing in Brain-Computer Interfaces

Brain-computer interfaces (BCI) are an important mode of alternative and...
research
07/07/2020

Evaluating German Transformer Language Models with Syntactic Agreement Tests

Pre-trained transformer language models (TLMs) have recently refashioned...
research
06/11/2019

Calibration, Entropy Rates, and Memory in Language Models

Building accurate language models that capture meaningful long-term depe...
research
05/17/2023

Statistical Knowledge Assessment for Generative Language Models

Generative Language Models (GLMs) have demonstrated capabilities to stor...

Please sign up or login with your details

Forgot password? Click here to reset