L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources

02/02/2022
by   Raviraj Joshi, et al.
0

We present L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual corpus with 24.8M sentences and 289M tokens. We further present, MahaBERT, MahaAlBERT, and MahaRoBerta all BERT-based masked language models, and MahaFT, the fast text word embeddings both trained on full Marathi corpus with 752M tokens. We show the effectiveness of these resources on downstream classification and NER tasks. Marathi is a popular language in India but still lacks these resources. This work is a step forward in building open resources for the Marathi language. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/30/2020

AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

We present the IndicNLP corpus, a large-scale, general-domain corpus con...
research
03/22/2021

Monolingual and Parallel Corpora for Kangri Low Resource Language

In this paper we present the dataset of Himachali low resource endangere...
research
04/18/2022

L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models

Code-switching occurs when more than one language is mixed in a given se...
research
12/16/2022

Lessons learned from the evaluation of Spanish Language Models

Given the impact of language models on the field of Natural Language Pro...
research
03/23/2020

Unsupervised Word Polysemy Quantification with Multiresolution Grids of Contextual Embeddings

The number of senses of a given word, or polysemy, is a very subjective ...
research
10/13/2022

Incorporating Context into Subword Vocabularies

Most current popular subword tokenizers are trained based on word freque...
research
05/24/2021

Neural Language Models for Nineteenth-Century English

We present four types of neural language models trained on a large histo...

Please sign up or login with your details

Forgot password? Click here to reset