SAHAAYAK 2023 – the Multi Domain Bilingual Parallel Corpus of Sanskrit to Hindi for Machine Translation

06/27/2023
by   Vishvajitsinh Bakrola, et al.
0

The data article presents the large bilingual parallel corpus of low-resourced language pair Sanskrit-Hindi, named SAHAAYAK 2023. The corpus contains total of 1.5M sentence pairs between Sanskrit and Hindi. To make the universal usability of the corpus and to make it balanced, data from multiple domain has been incorporated into the corpus that includes, News, Daily conversations, Politics, History, Sport, and Ancient Indian Literature. The multifaceted approach has been adapted to make a sizable multi-domain corpus of low-resourced languages like Sanskrit. Our development approach is spanned from creating a small hand-crafted dataset to applying a wide range of mining, cleaning, and verification. We have used the three-fold process of mining: mining from machine-readable sources, mining from non-machine readable sources, and collation from existing corpora sources. Post mining, the dedicated pipeline for normalization, alignment, and corpus cleaning is developed and applied to the corpus to make it ready to use on machine translation algorithms.

READ FULL TEXT

page 1

page 2

page 3

research
01/07/2018

MIZAN: A Large Persian-English Parallel Corpus

One of the most major and essential tasks in natural language processing...
research
10/04/2020

Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus

Machine translation has been a major motivation of development in natura...
research
04/12/2021

Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique

Major advancement in the performance of machine translation models has b...
research
12/26/2019

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

Lectures translation is a case of spoken language translation and there ...
research
01/24/2019

Automatic Parallel Corpus Creation for Hindi-English News Translation Task

The parallel corpus for multilingual NLP tasks, deep learning applicatio...
research
03/15/2022

Better Quality Estimation for Low Resource Corpus Mining

Quality Estimation (QE) models have the potential to change how we evalu...
research
08/11/2020

On Learning Language-Invariant Representations for Universal Machine Translation

The goal of universal machine translation is to learn to translate betwe...

Please sign up or login with your details

Forgot password? Click here to reset