The Open corpus of the Veps and Karelian languages: overview and applications

by   Tatyana Boyko, et al.

A growing priority in the study of Baltic-Finnic languages of the Republic of Karelia has been the methods and tools of corpus linguistics. Since 2016, linguists, mathematicians, and programmers at the Karelian Research Centre have been working with the Open Corpus of the Veps and Karelian Languages (VepKar), which is an extension of the Veps Corpus created in 2009. The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search using various criteria of the texts (language, genre, etc.) and numerous linguistic categories (lexical and grammatical search in texts was implemented thanks to the generator of word forms that we created earlier). A corpus of 3000 texts was compiled, texts were uploaded and marked up, the system for classifying texts into languages, dialects, types and genres was introduced, and the word-form generator was created. Future plans include developing a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs. Owing to continuous functional advancements in the corpus manager and ongoing VepKar enrichment with new material and text markup, users can handle a wide range of scientific and applied tasks. In creating the universal national VepKar corpus, its developers and managers strive to preserve and exhibit as fully as possible the state of the Veps and Karelian languages in the 19th-21st centuries.


Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpus

This article introduces the Wanca 2017 corpus of texts crawled from the ...

An investigation into language complexity of World-of-Warcraft game-external texts

We present a language complexity analysis of World of Warcraft (WoW) com...

Novel Language Resources for Hindi: An Aesthetics Text Corpus and a Comprehensive Stop Lemma List

This paper is an effort to complement the contributions made by research...

Quantifying French Document Complexity

Measuring a document's complexity level is an open challenge, particular...

DISCO PAL: Diachronic Spanish Sonnet Corpus with Psychological and Affective Labels

Nowadays, there are many applications of text mining over corpus from di...

Producing Corpora of Medieval and Premodern Occitan

At a time when the quantity of - more or less freely - available data is...

Please sign up or login with your details

Forgot password? Click here to reset