User-Centric Evaluation of OCR Systems for Kwak'wala

02/26/2023
by   Shruti Rijhwani, et al.
0

There has been recent interest in improving optical character recognition (OCR) for endangered languages, particularly because a large number of documents and books in these languages are not in machine-readable formats. The performance of OCR systems is typically evaluated using automatic metrics such as character and word error rates. While error rates are useful for the comparison of different models and systems, they do not measure whether and how the transcriptions produced from OCR tools are useful to downstream users. In this paper, we present a human-centric evaluation of OCR systems, focusing on the Kwak'wala language as a case study. With a user study, we show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents – a task that is often undertaken by endangered language community members and researchers – by over 50 potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/30/2022

Is Word Error Rate a good evaluation metric for Speech Recognition in Indic Languages?

We propose a new method for the calculation of error rates in Automatic ...
research
06/16/2023

How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese

This paper investigates the effect of tokenizers on the downstream perfo...
research
01/23/2023

Noisy Parallel Data Alignment

An ongoing challenge in current natural language processing is how its m...
research
07/01/2019

Modernizing Historical Documents: a User Study

Accessibility to historical documents is mostly limited to scholars. Thi...
research
12/13/2014

A Study of Sindhi Related and Arabic Script Adapted languages Recognition

A large number of publications are available for the Optical Character R...
research
05/07/2020

A Gaussian Process Upsampling Model for Improvements in Optical Character Recognition

Optical Character Recognition and extraction is a key tool in the automa...
research
03/08/2018

Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio -- Episode 1: Machine Transcription of the Manuscripts

In Codice Ratio is a research project to study tools and techniques for ...

Please sign up or login with your details

Forgot password? Click here to reset