Vacaspati: A Diverse Corpus of Bangla Literature

07/11/2023
by   Pramit Bhattacharyya, et al.
0

Bangla (or Bengali) is the fifth most spoken language globally; yet, the state-of-the-art NLP in Bangla is lagging for even simple tasks such as lemmatization, POS tagging, etc. This is partly due to lack of a varied quality corpus. To alleviate this need, we build Vacaspati, a diverse corpus of Bangla literature. The literary works are collected from various websites; only those works that are publicly available without copyright violations or restrictions are collected. We believe that published literature captures the features of a language much better than newspapers, blogs or social media posts which tend to follow only a certain literary pattern and, therefore, miss out on language variety. Our corpus Vacaspati is varied from multiple aspects, including type of composition, topic, author, time, space, etc. It contains more than 11 million sentences and 115 million words. We also built a word embedding model, Vac-FT, using FastText from Vacaspati as well as trained an Electra model, Vac-BERT, using the corpus. Vac-BERT has far fewer parameters and requires only a fraction of resources compared to other state-of-the-art transformer models and yet performs either better or similar on various downstream tasks. On multiple downstream tasks, Vac-FT outperforms other FastText-based models. We also demonstrate the efficacy of Vacaspati as a corpus by showing that similar models built from other corpora are not as effective. The models are available at https://bangla.iitk.ac.in/.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/10/2021

FBERT: A Neural Transformer for Identifying Offensive Content

Transformer-based models such as BERT, XLNET, and XLM-R have achieved st...
research
06/02/2023

Data-Efficient French Language Modeling with CamemBERTa

Recent advances in NLP have significantly improved the performance of la...
research
02/13/2021

The first large scale collection of diverse Hausa language datasets

Hausa language belongs to the Afroasiatic phylum, and with more first-la...
research
01/14/2022

A Warm Start and a Clean Crawled Corpus – A Recipe for Good Language Models

We train several language models for Icelandic, including IceBERT, that ...
research
01/24/2021

WangchanBERTa: Pretraining transformer-based Thai Language Models

Transformer-based language models, more specifically BERT-based architec...
research
03/23/2020

Word2Vec: Optimal Hyper-Parameters and Their Impact on NLP Downstream Tasks

Word2Vec is a prominent tool for Natural Language Processing (NLP) tasks...
research
03/17/2023

Trained on 100 million words and still in shape: BERT meets British National Corpus

While modern masked language models (LMs) are trained on ever larger cor...

Please sign up or login with your details

Forgot password? Click here to reset