On consistency scores in text data with an implementation in R

01/13/2021
by   Ke-Li Chiu, et al.
0

In this paper, we introduce a reproducible cleaning process for the text extracted from PDFs using n-gram models. Our approach compares the originally extracted text with the text generated from, or expected by, these models using earlier text as stimulus. To guide this process, we introduce the notion of a consistency score, which refers to the proportion of text that is expected by the model. This is used to monitor changes during the cleaning process, and across different corpuses. We illustrate our process on text from the book Jane Eyre and introduce both a Shiny application and an R package to make our process easier for others to adopt.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/01/2020

An Improved Classification Model for Igbo Text Using N-Gram And K-Nearest Neighbour Approaches

This paper presents an improved classification model for Igbo text using...
research
11/21/2022

Rooms with Text: A Dataset for Overlaying Text Detection

In this paper, we introduce a new dataset of room interior pictures with...
research
06/05/2023

MCTS: A Multi-Reference Chinese Text Simplification Dataset

Text simplification aims to make the text easier to understand by applyi...
research
06/05/2020

Formalizing text editors in Coq

Text editors represent one of the fundamental tools that writers use – s...
research
06/05/2020

Formalizing line editors in Coq

Text editors represent one of the fundamental tools that writers use - s...
research
07/27/2020

Next word prediction based on the N-gram model for Kurdish Sorani and Kurmanji

Next word prediction is an input technology that simplifies the process ...
research
09/14/2017

From Logic to Biology via Physics: a survey

This short text summarizes the work in biology proposed in our book, Per...

Please sign up or login with your details

Forgot password? Click here to reset