Developing a Fine-Grained Corpus for a Less-resourced Language: the case of Kurdish

09/25/2019
by   Roshna Omer Abdulrahman, et al.
0

Kurdish is a less-resourced language consisting of different dialects written in various scripts. Approximately 30 million people in different countries speak the language. The lack of corpora is one of the main obstacles in Kurdish language processing. In this paper, we present KTC-the Kurdish Textbooks Corpus, which is composed of 31 K-12 textbooks in Sorani dialect. The corpus is normalized and categorized into 12 educational subjects containing 693,800 tokens (110,297 types). Our resource is publicly available for non-commercial use under the CC BY-NC-SA 4.0 license.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/08/2017

The IIT Bombay English-Hindi Parallel Corpus

We present the IIT Bombay English-Hindi Parallel Corpus. The corpus is a...
research
05/19/2021

Essay-BR: a Brazilian Corpus of Essays

Automatic Essay Scoring (AES) is defined as the computer technology that...
research
12/19/2018

A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics

The use of Project Gutenberg (PG) as a text corpus has been extremely po...
research
03/22/2021

Monolingual and Parallel Corpora for Kangri Low Resource Language

In this paper we present the dataset of Himachali low resource endangere...
research
07/12/2023

A Study on the Appropriate size of the Mongolian general corpus

This study aims to determine the appropriate size of the Mongolian gener...
research
04/20/2022

Who Is Missing? Characterizing the Participation of Different Demographic Groups in a Korean Nationwide Daily Conversation Corpus

A conversation corpus is essential to build interactive AI applications....

Please sign up or login with your details

Forgot password? Click here to reset