LatinCy: Synthetic Trained Pipelines for Latin NLP

05/07/2023
by   Patrick J. Burns, et al.
0

This paper introduces LatinCy, a set of trained general purpose Latin-language "core" pipelines for use with the spaCy natural language processing framework. The models are trained on a large amount of available Latin data, including all five of the Latin Universal Dependency treebanks, which have been preprocessed to be compatible with each other. The result is a set of general models for Latin with good performance on a number of natural language processing tasks (e.g. the top-performing model yields POS tagging, 97.41 accuracy). The paper describes the model training, including its training data and parameterization, and presents the advantages to Latin-language researchers of having a spaCy model available for NLP work.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/27/2019

Simple Natural Language Processing Tools for Danish

This technical note describes a set of baseline tools for automatic proc...
research
04/28/2020

Unnatural Language Processing: Bridging the Gap Between Synthetic and Natural Language Data

Large, human-annotated datasets are central to the development of natura...
research
04/19/2020

The Cost of Training NLP Models: A Concise Overview

We review the cost of training large-scale language models, and the driv...
research
05/23/2022

DistilCamemBERT: a distillation of the French model CamemBERT

Modern Natural Language Processing (NLP) models based on Transformer str...
research
08/28/2016

What to do about non-standard (or non-canonical) language in NLP

Real world data differs radically from the benchmark corpora we use in n...
research
03/10/2019

Contextualised concept embedding for efficiently adapting natural language processing models for phenotype identification

Many efforts have been put to use automated approaches, such as natural ...

Please sign up or login with your details

Forgot password? Click here to reset