The Annotation Guideline of LST20 Corpus

by   Prachya Boonkwan, et al.

This report presents the annotation guideline for LST20, a large-scale corpus with multiple layers of linguistic annotation for Thai language processing. Our guideline consists of five layers of linguistic annotation: word segmentation, POS tagging, named entities, clause boundaries, and sentence boundaries. The dataset complies to the CoNLL-2003-style format for ease of use. LST20 Corpus offers five layers of linguistic annotation as aforementioned. At a large scale, it consists of 3,164,864 words, 288,020 named entities, 248,962 clauses, and 74,180 sentences, while it is annotated with 16 distinct POS tags. All 3,745 documents are also annotated with 15 news genres. Regarding its sheer size, this dataset is considered large enough for developing joint neural models for NLP. With the existence of this publicly available corpus, Thai has become a linguistically rich language for the first time.



There are no comments yet.


page 27


NorNE: Annotating Named Entities for Norwegian

This paper presents NorNE, a manually annotated corpus of named entities...

CREER: A Large-Scale Corpus for Relation Extraction and Entity Recognition

We describe the design and use of the CREER dataset, a large corpus anno...

Morphological Tagging and Lemmatization of Albanian: A Manually Annotated Corpus and Neural Models

In this paper, we present the first publicly available part-of-speech an...

Cross-Register Projection for Headline Part of Speech Tagging

Part of speech (POS) tagging is a familiar NLP task. State of the art ta...

For the Purpose of Curry: A UD Treebank for Ashokan Prakrit

We present the first linguistically annotated treebank of Ashokan Prakri...

Standardizing linguistic data: method and tools for annotating (pre-orthographic) French

With the development of big corpora of various periods, it becomes cruci...

NUBES: A Corpus of Negation and Uncertainty in Spanish Clinical Texts

This paper introduces the first version of the NUBes corpus (Negation an...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.