DeepAI AI Chat
Log In Sign Up

Post-OCR Document Correction with large Ensembles of Character Sequence Models

by   Juan Ramirez-Orta, et al.

In this paper, we propose a novel method based on character sequence-to-sequence models to correct documents already processed with Optical Character Recognition (OCR) systems. The main contribution of this paper is a set of strategies to accurately process strings much longer than the ones used to train the sequence model while being sample- and resource-efficient, supported by thorough experimentation. The strategy with the best performance involves splitting the input document in character n-grams and combining their individual corrections into the final output using a voting scheme that is equivalent to an ensemble of a large number of sequence models. We further investigate how to weigh the contributions from each one of the members of this ensemble. We test our method on nine languages of the ICDAR 2019 competition on post-OCR text correction and achieve a new state-of-the-art performance in five of them. Our code for post-OCR correction is shared at


page 1

page 2

page 3

page 4


From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction

A great deal of historical corpora suffer from errors introduced by the ...

Applying the Transformer to Character-level Transduction

The transformer has been shown to outperform recurrent neural network-ba...

Upcycle Your OCR: Reusing OCRs for Post-OCR Text Correction in Romanised Sanskrit

We propose a post-OCR text correction approach for digitising texts in R...

Character Transformations for Non-Autoregressive GEC Tagging

We propose a character-based nonautoregressive GEC approach, with automa...

Lexically Aware Semi-Supervised Learning for OCR Post-Correction

Much of the existing linguistic data in many languages of the world is l...

Spelling Correction with Denoising Transformer

We present a novel method of performing spelling correction on short inp...

Hierarchical Attention Transformer Architecture For Syntactic Spell Correction

The attention mechanisms are playing a boosting role in advancements in ...