An efficient automated data analytics approach to large scale computational comparative linguistics

01/31/2020
by   Gabija Mikulyte, et al.
0

This research project aimed to overcome the challenge of analysing human language relationships, facilitate the grouping of languages and formation of genealogical relationship between them by developing automated comparison techniques. Techniques were based on the phonetic representation of certain key words and concept. Example word sets included numbers 1-10 (curated), large database of numbers 1-10 and sheep counting numbers 1-10 (other sources), colours (curated), basic words (curated). To enable comparison within the sets the measure of Edit distance was calculated based on Levenshtein distance metric. This metric between two strings is the minimum number of single-character edits, operations including: insertions, deletions or substitutions. To explore which words exhibit more or less variation, which words are more preserved and examine how languages could be grouped based on linguistic distances within sets, several data analytics techniques were involved. Those included density evaluation, hierarchical clustering, silhouette, mean, standard deviation and Bhattacharya coefficient calculations. These techniques lead to the development of a workflow which was later implemented by combining Unix shell scripts, a developed R package and SWI Prolog. This proved to be computationally efficient and permitted the fast exploration of large language sets and their analysis.

READ FULL TEXT

page 10

page 14

page 36

research
08/16/2020

Discovering Lexical Similarity Through Articulatory Feature-based Phonetic Edit Distance

Lexical Similarity (LS) between two languages uncovers many interesting ...
research
12/02/2020

Linguistic Classification using Instance-Based Learning

Traditionally linguists have organized languages of the world as languag...
research
03/12/2019

An "On The Fly" Framework for Efficiently Generating Synthetic Big Data Sets

Collecting, analyzing and gaining insight from large volumes of data is ...
research
07/16/2018

Combining a Context Aware Neural Network with a Denoising Autoencoder for Measuring String Similarities

Measuring similarities between strings is central for many established a...
research
03/13/2019

Generalized de Bruijn words and the state complexity of conjugate sets

We consider a certain natural generalization of de Bruijn words, and use...
research
05/15/2023

A Crosslingual Investigation of Conceptualization in 1335 Languages

Languages differ in how they divide up the world into concepts and words...

Please sign up or login with your details

Forgot password? Click here to reset