Standardizing linguistic data: method and tools for annotating (pre-orthographic) French

by   Simon Gabay, et al.

With the development of big corpora of various periods, it becomes crucial to standardise linguistic annotation (e.g. lemmas, POS tags, morphological annotation) to increase the interoperability of the data produced, despite diachronic variations. In the present paper, we describe both methodologically (by proposing annotation principles) and technically (by creating the required training data and the relevant models) the production of a linguistic tagger for (early) modern French (16-18th c.), taking as much as possible into account already existing standards for contemporary and, especially, medieval French.


page 1

page 2

page 3

page 4


Creating a Large Multi-Layered Representational Repository of Linguistic Code Switched Arabic Data

We present our effort to create a large Multi-Layered representational r...

Provenance for Linguistic Corpora Through Nanopublications

Research in Computational Linguistics is dependent on text corpora for t...

Overview of Annotation Creation: Processes & Tools

Creating linguistic annotations requires more than just a reliable annot...

The Annotation Guideline of LST20 Corpus

This report presents the annotation guideline for LST20, a large-scale c...

For the Purpose of Curry: A UD Treebank for Ashokan Prakrit

We present the first linguistically annotated treebank of Ashokan Prakri...

Annotation Uncertainty in the Context of Grammatical Change

This paper elaborates on the notion of uncertainty in the context of ann...

Training with Streaming Annotation

In this paper, we address a practical scenario where training data is re...