OCR Post Correction for Endangered Language Texts

11/10/2020
by   Shruti Rijhwani, et al.
0

There is little to no data available to build natural language processing models for most endangered languages. However, textual data in these languages often exists in formats that are not machine-readable, such as paper books and scanned images. In this work, we address the task of extracting text from these resources. We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages and present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting of endangered languages. We develop an OCR post-correction method tailored to ease training in this data-scarce setting, reducing the recognition error rate by 34

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/15/2018

Tools and resources for Romanian text-to-speech and speech-to-text applications

In this paper we introduce a set of resources and tools aimed at providi...
research
11/04/2021

Lexically Aware Semi-Supervised Learning for OCR Post-Correction

Much of the existing linguistic data in many languages of the world is l...
research
09/06/2018

Upcycle Your OCR: Reusing OCRs for Post-OCR Text Correction in Romanised Sanskrit

We propose a post-OCR text correction approach for digitising texts in R...
research
05/15/2023

Beqi: Revitalize the Senegalese Wolof Language with a Robust Spelling Corrector

The progress of Natural Language Processing (NLP), although fast in rece...
research
01/23/2023

Noisy Parallel Data Alignment

An ongoing challenge in current natural language processing is how its m...
research
05/19/2023

XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages

Data scarcity is a crucial issue for the development of highly multiling...

Please sign up or login with your details

Forgot password? Click here to reset