Jambu: A historical linguistic database for South Asian languages

06/05/2023
by   Aryaman Arora, et al.
0

We introduce Jambu, a cognate database of South Asian languages which unifies dozens of previous sources in a structured and accessible format. The database includes 287k lemmata from 602 lects, grouped together in 23k sets of cognates. We outline the data wrangling necessary to compile the dataset and train neural models for reflex prediction on the Indo-Aryan subset of the data. We hope that Jambu is an invaluable resource for all historical linguists and Indologists, and look towards further improvement and expansion of the database.

READ FULL TEXT

page 4

page 9

research
05/28/2021

Bhāx1E63ācitra: Visualising the dialect geography of South Asia

We present Bhāx1E63ācitra, a dialect mapping system for South Asia built...
research
04/07/2018

Evaluating historical text normalization systems: How well do they generalize?

We highlight several issues in the evaluation of historical text normali...
research
01/27/2021

Mining Large-Scale Low-Resource Pronunciation Data From Wikipedia

Pronunciation modeling is a key task for building speech technology in n...
research
09/27/2016

AP16-OL7: A Multilingual Database for Oriental Languages and A Language Recognition Baseline

We present the AP16-OL7 database which was released as the training and ...
research
04/06/2018

Neural models of factuality

We present two neural models for event factuality prediction, which yiel...
research
04/08/2015

Mining and discovering biographical information in Difangzhi with a language-model-based approach

We present results of expanding the contents of the China Biographical D...
research
01/27/2017

Comparative Study Of Data Mining Query Languages

Since formulation of Inductive Database (IDB) problem, several Data Mini...

Please sign up or login with your details

Forgot password? Click here to reset