DeepAI AI Chat
Log In Sign Up

Developing a Fine-Grained Corpus for a Less-resourced Language: the case of Kurdish

09/25/2019
by   Roshna Omer Abdulrahman, et al.
0

Kurdish is a less-resourced language consisting of different dialects written in various scripts. Approximately 30 million people in different countries speak the language. The lack of corpora is one of the main obstacles in Kurdish language processing. In this paper, we present KTC-the Kurdish Textbooks Corpus, which is composed of 31 K-12 textbooks in Sorani dialect. The corpus is normalized and categorized into 12 educational subjects containing 693,800 tokens (110,297 types). Our resource is publicly available for non-commercial use under the CC BY-NC-SA 4.0 license.

READ FULL TEXT

page 1

page 2

page 3

page 4

10/08/2017

The IIT Bombay English-Hindi Parallel Corpus

We present the IIT Bombay English-Hindi Parallel Corpus. The corpus is a...
05/19/2021

Essay-BR: a Brazilian Corpus of Essays

Automatic Essay Scoring (AES) is defined as the computer technology that...
12/19/2018

A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics

The use of Project Gutenberg (PG) as a text corpus has been extremely po...
03/22/2021

Monolingual and Parallel Corpora for Kangri Low Resource Language

In this paper we present the dataset of Himachali low resource endangere...
04/02/2020

Mapping Languages: The Corpus of Global Language Use

This paper describes a web-based corpus of global language use with a fo...
08/18/2017

The Natural Stories Corpus

It is now a common practice to compare models of human language processi...