Prague Dependency Treebank – Consolidated 1.0

06/05/2020
by   Jan Hajič, et al.
0

We present a richly annotated and genre-diversified language resource, the Prague Dependency Treebank-Consolidated 1.0 (PDT-C 1.0), the purpose of which is - as it always been the case for the family of the Prague Dependency Treebanks - to serve both as a training data for various types of NLP tasks as well as for linguistically-oriented research. PDT-C 1.0 contains four different datasets of Czech, uniformly annotated using the standard PDT scheme (albeit not everything is annotated manually, as we describe in detail here). The texts come from different sources: daily newspaper articles, Czech translation of the Wall Street Journal, transcribed dialogs and a small amount of user-generated, short, often non-standard language segments typed into a web translator. Altogether, the treebank contains around 180,000 sentences with their morphological, surface and deep syntactic annotation. The diversity of the texts and annotations should serve well the NLP applications as well as it is an invaluable resource for linguistic research, including comparative studies regarding texts of different genres. The corpus is publicly and freely available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/10/2022

RuCoCo: a new Russian corpus with coreference annotation

We present a new corpus with coreference annotation, Russian Coreference...
research
03/15/2017

InScript: Narrative texts annotated with script information

This paper presents the InScript corpus (Narrative Texts Instantiating S...
research
01/27/2020

SemClinBr – a multi institutional and multi specialty semantically annotated corpus for Portuguese clinical NLP tasks

The high volume of research focusing on extracting patient's information...
research
05/24/2022

Universal Dependency Treebank for Odia Language

This paper presents the first publicly available treebank of Odia, a mor...
research
06/18/2020

AMALGUM – A Free, Balanced, Multilayer English Web Corpus

We present a freely available, genre-balanced English web corpus totalin...
research
09/18/2023

Not Enough Labeled Data? Just Add Semantics: A Data-Efficient Method for Inferring Online Health Texts

User-generated texts available on the web and social platforms are often...
research
07/01/2020

So What's the Plan? Mining Strategic Planning Documents

In this paper we present a corpus of Russian strategic planning document...

Please sign up or login with your details

Forgot password? Click here to reset