DeepAI AI Chat
Log In Sign Up

On consistency scores in text data with an implementation in R

by   Ke-Li Chiu, et al.

In this paper, we introduce a reproducible cleaning process for the text extracted from PDFs using n-gram models. Our approach compares the originally extracted text with the text generated from, or expected by, these models using earlier text as stimulus. To guide this process, we introduce the notion of a consistency score, which refers to the proportion of text that is expected by the model. This is used to monitor changes during the cleaning process, and across different corpuses. We illustrate our process on text from the book Jane Eyre and introduce both a Shiny application and an R package to make our process easier for others to adopt.


page 1

page 2

page 3

page 4


An Improved Classification Model for Igbo Text Using N-Gram And K-Nearest Neighbour Approaches

This paper presents an improved classification model for Igbo text using...

Rooms with Text: A Dataset for Overlaying Text Detection

In this paper, we introduce a new dataset of room interior pictures with...

MCTS: A Multi-Reference Chinese Text Simplification Dataset

Text simplification aims to make the text easier to understand by applyi...

Formalizing text editors in Coq

Text editors represent one of the fundamental tools that writers use – s...

Formalizing line editors in Coq

Text editors represent one of the fundamental tools that writers use - s...

Next word prediction based on the N-gram model for Kurdish Sorani and Kurmanji

Next word prediction is an input technology that simplifies the process ...

From Logic to Biology via Physics: a survey

This short text summarizes the work in biology proposed in our book, Per...