Producing Corpora of Medieval and Premodern Occitan

04/26/2019
by   Jean-Baptiste Camps, et al.
0

At a time when the quantity of - more or less freely - available data is increasing significantly, thanks to digital corpora, editions or libraries, the development of data mining tools or deep learning methods allows researchers to build a corpus of study tailored for their research, to enrich their data and to exploit them.Open optical character recognition (OCR) tools can be adapted to old prints, incunabula or even manuscripts, with usable results, allowing the rapid creation of textual corpora. The alternation of training and correction phases makes it possible to improve the quality of the results by rapidly accumulating raw text data. These can then be structured, for example in XML/TEI, and enriched.The enrichment of the texts with graphic or linguistic annotations can also be automated. These processes, known to linguists and functional for modern languages, present difficulties for languages such as Medieval Occitan, due in part to the absence of big enough lemmatized corpora. Suggestions for the creation of tools adapted to the considerable spelling variation of ancient languages will be presented, as well as experiments for the lemmatization of Medieval and Premodern Occitan.These techniques open the way for many exploitations. The much desired increase in the amount of available quality texts and data makes it possible to improve digital philology methods, if everyone takes the trouble to make their data freely available online and reusable.By exposing different technical solutions and some micro-analyses as examples, this paper aims to show part of what digital philology can offer to researchers in the Occitan domain, while recalling the ethical issues on which such practices are based.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/12/2021

Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique

Major advancement in the performance of machine translation models has b...
research
10/28/2020

Character Entropy in Modern and Historical Texts: Comparison Metrics for an Undeciphered Manuscript

This paper outlines the creation of three corpora for multilingual compa...
research
06/11/2020

Provenance for Linguistic Corpora Through Nanopublications

Research in Computational Linguistics is dependent on text corpora for t...
research
08/10/2016

An assessment of orthographic similarity measures for several African languages

Natural Language Interfaces and tools such as spellcheckers and Web sear...
research
09/23/2021

Corpus and Models for Lemmatisation and POS-tagging of Old French

Old French is a typical example of an under-resourced historic languages...
research
06/08/2022

The Open corpus of the Veps and Karelian languages: overview and applications

A growing priority in the study of Baltic-Finnic languages of the Republ...
research
12/11/2019

A Collaborative Ecosystem for Digital Coptic Studies

Scholarship on underresourced languages bring with them a variety of cha...

Please sign up or login with your details

Forgot password? Click here to reset