MULTEXT-East

03/31/2020
by   Tomaž Erjavec, et al.
0

MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description. The MULTEXT-East dataset includes the EAGLES-based morphosyntactic specifications, morphosyntactic lexicons, and an annotated multilingual corpora. The parallel corpus, the novel "1984" by George Orwell, is sentence aligned and contains hand-validated morphosyntactic descriptions and lemmas. The resources are uniformly encoded in XML, using the Text Encoding Initiative Guidelines, TEI P5, and cover 16 languages: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, and Ukrainian. This dataset is extensively documented, and freely available for research purposes. This case study gives a history of the development of the MULTEXT-East resources, presents their encoding and components, discusses related work and gives some conclusions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/15/2023

Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages

While natural language processing tools have been developed extensively ...
research
07/14/2021

ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus

With more than 7000 languages worldwide, multilingual natural language p...
research
05/05/2019

BVS Corpus: A Multilingual Parallel Corpus of Biomedical Scientific Texts

The BVS database (Health Virtual Library) is a centralized source of bio...
research
05/02/2016

Multi30K: Multilingual English-German Image Descriptions

We introduce the Multi30K dataset to stimulate multilingual multimodal r...
research
05/06/2019

A Large Parallel Corpus of Full-Text Scientific Articles

The Scielo database is an important source of scientific information in ...
research
03/31/2020

Multilingual Stance Detection: The Catalonia Independence Corpus

Stance detection aims to determine the attitude of a given text with res...
research
08/28/2022

CJaFr-v3 : A Freely Available Filtered Japanese-French Aligned Corpus

We present a free Japanese-French parallel corpus. It includes 15M align...

Please sign up or login with your details

Forgot password? Click here to reset