The Annotation Guideline of LST20 Corpus

08/12/2020
by   Prachya Boonkwan, et al.
0

This report presents the annotation guideline for LST20, a large-scale corpus with multiple layers of linguistic annotation for Thai language processing. Our guideline consists of five layers of linguistic annotation: word segmentation, POS tagging, named entities, clause boundaries, and sentence boundaries. The dataset complies to the CoNLL-2003-style format for ease of use. LST20 Corpus offers five layers of linguistic annotation as aforementioned. At a large scale, it consists of 3,164,864 words, 288,020 named entities, 248,962 clauses, and 74,180 sentences, while it is annotated with 16 distinct POS tags. All 3,745 documents are also annotated with 15 news genres. Regarding its sheer size, this dataset is considered large enough for developing joint neural models for NLP. With the existence of this publicly available corpus, Thai has become a linguistically rich language for the first time.

READ FULL TEXT
research
06/10/2022

RuCoCo: a new Russian corpus with coreference annotation

We present a new corpus with coreference annotation, Russian Coreference...
research
11/27/2019

NorNE: Annotating Named Entities for Norwegian

This paper presents NorNE, a manually annotated corpus of named entities...
research
07/11/2022

TArC: Tunisian Arabish Corpus First complete release

In this paper we present the final result of a project on Tunisian Arabi...
research
04/27/2022

CREER: A Large-Scale Corpus for Relation Extraction and Entity Recognition

We describe the design and use of the CREER dataset, a large corpus anno...
research
04/02/2020

NUBES: A Corpus of Negation and Uncertainty in Spanish Clinical Texts

This paper introduces the first version of the NUBes corpus (Negation an...
research
11/24/2021

For the Purpose of Curry: A UD Treebank for Ashokan Prakrit

We present the first linguistically annotated treebank of Ashokan Prakri...
research
11/22/2020

Standardizing linguistic data: method and tools for annotating (pre-orthographic) French

With the development of big corpora of various periods, it becomes cruci...

Please sign up or login with your details

Forgot password? Click here to reset