Monolingual and Parallel Corpora for Kangri Low Resource Language

03/22/2021
by   Shweta Chauhan, et al.
0

In this paper we present the dataset of Himachali low resource endangered language, Kangri (ISO 639-3xnr) listed in the United Nations Educational, Scientific and Cultural Organization (UNESCO). The compilation of kangri corpus has been a challenging task due to the non-availability of the digitalized resources. The corpus contains 1,81,552 Monolingual and 27,362 Hindi-Kangri Parallel corpora. We shared pre-trained kangri word embeddings. We also reported the Bilingual Evaluation Understudy (BLEU) score and Metric for Evaluation of Translation with Explicit ORdering (METEOR) score of Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) results for the corpus. The corpus is freely available for non-commercial usages and research. To the best of our knowledge, this is the first Himachali low resource endangered language corpus. The resources are available at (https://github.com/chauhanshweta/Kangri_corpus)

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/08/2017

The IIT Bombay English-Hindi Parallel Corpus

We present the IIT Bombay English-Hindi Parallel Corpus. The corpus is a...
research
04/12/2021

Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique

Major advancement in the performance of machine translation models has b...
research
02/02/2022

L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources

We present L3Cube-MahaCorpus a Marathi monolingual data set scraped from...
research
01/29/2017

Using English as Pivot to Extract Persian-Italian Parallel Sentences from Non-Parallel Corpora

The effectiveness of a statistical machine translation system (SMT) is v...
research
01/26/2021

Neural machine translation, corpus and frugality

In machine translation field, in both academia and industry, there is a ...
research
02/01/2023

Attention Link: An Efficient Attention-Based Low Resource Machine Translation Architecture

Transformers have achieved great success in machine translation, but tra...
research
09/25/2019

Developing a Fine-Grained Corpus for a Less-resourced Language: the case of Kurdish

Kurdish is a less-resourced language consisting of different dialects wr...

Please sign up or login with your details

Forgot password? Click here to reset