From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction

05/25/2020
by   Mika Hämäläinen, et al.
1

A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/19/2018

Neural Machine Translation of Text from Non-Native Speakers

Neural Machine Translation (NMT) systems are known to degrade when confr...
research
05/24/2019

An Analysis of Source-Side Grammatical Errors in NMT

The quality of Neural Machine Translation (NMT) has been shown to signif...
research
09/13/2021

Post-OCR Document Correction with large Ensembles of Character Sequence Models

In this paper, we propose a novel method based on character sequence-to-...
research
08/01/2022

iOCR: Informed Optical Character Recognition for Election Ballot Tallies

The purpose of this study is to explore the performance of Informed OCR ...
research
07/30/2023

Toward a Period-Specific Optimized Neural Network for OCR Error Correction of Historical Hebrew Texts

Over the past few decades, large archives of paper-based historical docu...
research
05/25/2021

Empirical Error Modeling Improves Robustness of Noisy Neural Sequence Labeling

Despite recent advances, standard sequence labeling systems often fail w...
research
10/19/2017

Unsupervised Context-Sensitive Spelling Correction of English and Dutch Clinical Free-Text with Word and Character N-Gram Embeddings

We present an unsupervised context-sensitive spelling correction method ...

Please sign up or login with your details

Forgot password? Click here to reset