Character Entropy in Modern and Historical Texts: Comparison Metrics for an Undeciphered Manuscript

by   Luke Lindemann, et al.

This paper outlines the creation of three corpora for multilingual comparison and analysis of the Voynich manuscript: a corpus of Voynich texts partitioned by Currier language, scribal hand, and transcription system, a corpus of 294 language samples compiled from Wikipedia, and a corpus of eighteen transcribed historical texts in eight languages. These corpora will be utilized in subsequent work by the Voynich Working Group at Yale University. We demonstrate the utility of these corpora for studying characteristics of the Voynich script and language, with an analysis of conditional character entropy in Voynichese. We discuss the interaction between character entropy and language, script size and type, glyph compositionality, scribal conventions and abbreviations, positional character variants, and bigram frequency. This analysis characterizes the interaction between script compositionality, character size, and predictability. We show that substantial manipulations of glyph composition are not sufficient to align conditional entropy levels with natural languages. The unusually predictable nature of the Voynichese script is not attributable to a particular script or transcription system, underlying language, or substitution cipher. Voynichese is distinct from every comparison text in our corpora because character placement is highly constrained within the word, and this may indicate the loss of phonemic distinctions from the underlying language.


page 3

page 6

page 12

page 22

page 23

page 29

page 30

page 35


Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpus

This article introduces the Wanca 2017 corpus of texts crawled from the ...

hmBERT: Historical Multilingual Language Models for Named Entity Recognition

Compared to standard Named Entity Recognition (NER), identifying persons...

Producing Corpora of Medieval and Premodern Occitan

At a time when the quantity of - more or less freely - available data is...

Compiling and Processing Historical and Contemporary Portuguese Corpora

This technical report describes the framework used for processing three ...

Character Distributions of Classical Chinese Literary Texts: Zipf's Law, Genres, and Epochs

We collect 14 representative corpora for major periods in Chinese histor...

Neural OCR Post-Hoc Correction of Historical Corpora

Optical character recognition (OCR) is crucial for a deeper access to hi...

Know thy corpus! Robust methods for digital curation of Web corpora

This paper proposes a novel framework for digital curation of Web corpor...

Please sign up or login with your details

Forgot password? Click here to reset