Deciphering Undersegmented Ancient Scripts Using Phonetic Prior

10/21/2020
by   Jiaming Luo, et al.
3

Most undeciphered lost languages exhibit two characteristics that pose significant decipherment challenges: (1) the scripts are not fully segmented into words; (2) the closest known language is not determined. We propose a decipherment model that handles both of these challenges by building on rich linguistic constraints reflecting consistent patterns in historical sound change. We capture the natural phonological geometry by learning character embeddings based on the International Phonetic Alphabet (IPA). The resulting generative framework jointly models word segmentation and cognate alignment, informed by phonological constraints. We evaluate the model on both deciphered languages (Gothic, Ugaritic) and an undeciphered one (Iberian). The experiments show that incorporating phonetic geometry leads to clear and consistent gains. Additionally, we propose a measure for language closeness which correctly identifies related languages for Gothic and Ugaritic. For Iberian, the method does not show strong evidence supporting Basque as a related language, concurring with the favored position by the current scholarship.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2019

Neural Decipherment via Minimum-Cost Flow: from Ugaritic to Linear B

In this paper we propose a novel neural approach for automatic decipherm...
research
01/23/2023

Noisy Parallel Data Alignment

An ongoing challenge in current natural language processing is how its m...
research
06/14/2016

Word Representation Models for Morphologically Rich Languages in Neural Machine Translation

Dealing with the complex word forms in morphologically rich languages is...
research
06/09/2023

Progress on Constructing Phylogenetic Networks for Languages

In 2006, Warnow, Evans, Ringe, and Nakhleh proposed a stochastic model (...
research
04/16/2019

A Systematic Study of Leveraging Subword Information for Learning Word Representations

The use of subword-level information (e.g., characters, character n-gram...
research
08/07/2019

Ab Antiquo: Proto-language Reconstruction with RNNs

Historical linguists have identified regularities in the process of hist...
research
03/07/2017

Building a Syllable Database to Solve the Problem of Khmer Word Segmentation

Word segmentation is a basic problem in natural language processing. Wit...

Please sign up or login with your details

Forgot password? Click here to reset