Open Korean Corpora: A Practical Report

12/31/2020
by   Won Ik Cho, et al.
0

Korean is often referred to as a low-resource language in the research community. While this claim is partially true, it is also because the availability of resources is inadequately advertised and curated. This work curates and reviews a list of Korean corpora, first describing institution-level resource development, then further iterate through a list of current open datasets for different types of tasks. We then propose a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/19/2023

A Survey of Corpora for Germanic Low-Resource Languages and Dialects

Despite much progress in recent years, the vast majority of work in natu...
research
10/14/2020

Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview

This paper presents an overview of a program designed to address the gro...
research
08/04/2023

Sinhala-English Parallel Word Dictionary Dataset

Parallel datasets are vital for performing and evaluating any kind of mu...
research
03/18/2020

Gender Representation in Open Source Speech Resources

With the rise of artificial intelligence (AI) and the growing use of dee...
research
06/08/2020

Resource Burning for Permissionless Systems

Proof-of-work puzzles and CAPTCHAS consume enormous amounts of energy an...
research
02/25/2017

Critical Survey of the Freely Available Arabic Corpora

The availability of corpora is a major factor in building natural langua...
research
06/03/2021

A diachronic evaluation of gender asymmetry in euphemism

The use of euphemisms is a known driver of language change. It has been ...

Please sign up or login with your details

Forgot password? Click here to reset