Breaking Language Barriers: A Question Answering Dataset for Hindi and Marathi

08/19/2023
by   Maithili Sabane, et al.
0

The recent advances in deep-learning have led to the development of highly sophisticated systems with an unquenchable appetite for data. On the other hand, building good deep-learning models for low-resource languages remains a challenging task. This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi. Despite Hindi being the 3rd most spoken language worldwide, with 345 million speakers, and Marathi being the 11th most spoken language globally, with 83.2 million speakers, both languages face limited resources for building efficient Question Answering systems. To tackle the challenge of data scarcity, we have developed a novel approach for translating the SQuAD 2.0 dataset into Hindi and Marathi. We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples. We evaluate the dataset on various architectures and release the best-performing models for both Hindi and Marathi, which will facilitate further research in these languages. Leveraging similarity tools, our method holds the potential to create datasets in diverse languages, thereby enhancing the understanding of natural language across varied linguistic contexts. Our fine-tuned models, code, and dataset will be made publicly available.

READ FULL TEXT
research
12/18/2021

Cascading Adaptors to Leverage English Data to Improve Performance of Question Answering for Low-Resource Languages

Transformer based architectures have shown notable results on many down ...
research
05/04/2022

KenSwQuAD – A Question Answering Dataset for Swahili Low Resource Language

This research developed a Kencorpus Swahili Question Answering Dataset K...
research
07/02/2020

Project PIAF: Building a Native French Question-Answering Dataset

Motivated by the lack of data for non-English languages, in particular f...
research
06/07/2022

cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation

Vision-and-language tasks are gaining popularity in the research communi...
research
09/11/2023

NeCo@ALQAC 2023: Legal Domain Knowledge Acquisition for Low-Resource Languages through Data Enrichment

In recent years, natural language processing has gained significant popu...
research
09/23/2021

ParaShoot: A Hebrew Question Answering Dataset

NLP research in Hebrew has largely focused on morphology and syntax, whe...
research
04/23/2020

Rapidly Bootstrapping a Question Answering Dataset for COVID-19

We present CovidQA, the beginnings of a question answering dataset speci...

Please sign up or login with your details

Forgot password? Click here to reset