A Survey of Corpora for Germanic Low-Resource Languages and Dialects

04/19/2023
by   Verena Blaschke, et al.
0

Despite much progress in recent years, the vast majority of work in natural language processing (NLP) is on standard languages with many speakers. In this work, we instead focus on low-resource languages and in particular non-standardized low-resource languages. Even within branches of major language families, often considered well-researched, little is known about the extent and type of available resources and what the major NLP challenges are for these language varieties. The first step to address this situation is a systematic survey of available corpora (most importantly, annotated corpora, which are particularly valuable for NLP research). Focusing on Germanic low-resource language varieties, we provide such a survey in this paper. Except for geolocation (origin of speaker or document), we find that manually annotated linguistic resources are sparse and, if they exist, mostly cover morphosyntax. Despite this lack of resources, we observe that interest in this area is increasing: there is active development and a growing research community. To facilitate research, we make our overview of over 80 corpora publicly available. We share a companion website of this overview at https://github.com/mainlp/germanic-lrl-corpora .

READ FULL TEXT
research
09/04/2019

Towards Realistic Practices In Low-Resource Natural Language Processing: The Development Set

Development sets are impractical to obtain for real low-resource languag...
research
09/19/2023

NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages

Democratizing access to natural language processing (NLP) technology is ...
research
12/09/2020

Simple or Complex? Learning to Predict Readability of Bengali Texts

Determining the readability of a text is the first step to its simplific...
research
12/31/2020

Open Korean Corpora: A Practical Report

Korean is often referred to as a low-resource language in the research c...
research
04/29/2020

Basic Linguistic Resources and Baselines for Bhojpuri, Magahi and Maithili for Natural Language Processing

Corpus preparation for low-resource languages and for development of hum...
research
11/28/2022

Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources

While the NLP community is generally aware of resource disparities among...
research
07/13/2022

O-Dang! The Ontology of Dangerous Speech Messages

Inside the NLP community there is a considerable amount of language reso...

Please sign up or login with your details

Forgot password? Click here to reset