DeepAI AI Chat
Log In Sign Up

Monolingual and Parallel Corpora for Kangri Low Resource Language

by   Shweta Chauhan, et al.

In this paper we present the dataset of Himachali low resource endangered language, Kangri (ISO 639-3xnr) listed in the United Nations Educational, Scientific and Cultural Organization (UNESCO). The compilation of kangri corpus has been a challenging task due to the non-availability of the digitalized resources. The corpus contains 1,81,552 Monolingual and 27,362 Hindi-Kangri Parallel corpora. We shared pre-trained kangri word embeddings. We also reported the Bilingual Evaluation Understudy (BLEU) score and Metric for Evaluation of Translation with Explicit ORdering (METEOR) score of Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) results for the corpus. The corpus is freely available for non-commercial usages and research. To the best of our knowledge, this is the first Himachali low resource endangered language corpus. The resources are available at (


page 1

page 2

page 3

page 4


The IIT Bombay English-Hindi Parallel Corpus

We present the IIT Bombay English-Hindi Parallel Corpus. The corpus is a...

Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique

Major advancement in the performance of machine translation models has b...

L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources

We present L3Cube-MahaCorpus a Marathi monolingual data set scraped from...

Using English as Pivot to Extract Persian-Italian Parallel Sentences from Non-Parallel Corpora

The effectiveness of a statistical machine translation system (SMT) is v...

Neural machine translation, corpus and frugality

In machine translation field, in both academia and industry, there is a ...

Attention Link: An Efficient Attention-Based Low Resource Machine Translation Architecture

Transformers have achieved great success in machine translation, but tra...

Developing a Fine-Grained Corpus for a Less-resourced Language: the case of Kurdish

Kurdish is a less-resourced language consisting of different dialects wr...