Itihasa: A large-scale corpus for Sanskrit to English translation

06/06/2021
by   Rahul Aralikatte, et al.
1

This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.

READ FULL TEXT
research
04/08/2021

BSTC: A Large-Scale Chinese-English Speech Translation Dataset

This paper presents BSTC (Baidu Speech Translation Corpus), a large-scal...
research
10/23/2021

PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation

We introduce a high-quality and large-scale Vietnamese-English parallel ...
research
10/23/2018

PreCo: A Large-scale Dataset in Preschool Vocabulary for Coreference Resolution

We introduce PreCo, a large-scale English dataset for coreference resolu...
research
07/07/2020

scb-mt-en-th-2020: A Large English-Thai Parallel Corpus

The primary objective of our work is to build a large-scale English-Thai...
research
07/16/2021

Darmok and Jalad at Tanagra: A Dataset and Model for English-to-Tamarian Translation

Tamarian, a fictional language introduced in the Star Trek episode Darmo...
research
04/11/2023

A Corpus-based Analysis of Attitudinal Changes in Lin Yutang's Self-translation of Between Tears and Laughter

Attitude is omnipresent in almost every type of text. There has yet to b...
research
09/20/2023

K-pop Lyric Translation: Dataset, Analysis, and Neural-Modelling

Lyric translation, a field studied for over a century, is now attracting...

Please sign up or login with your details

Forgot password? Click here to reset