AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

04/30/2020
by   Anoop Kunchukuttan, et al.
0

We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We show that the IndicNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks. We hope that the availability of the corpus will accelerate Indic NLP research. The resources are available at https://github.com/ai4bharat-indicnlp/indicnlp_corpus.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/26/2020

iNLTK: Natural Language Toolkit for Indic Languages

We present iNLTK, an open-source NLP library consisting of pre-trained l...
research
03/11/2019

ETNLP: A Toolkit for Extraction, Evaluation and Visualization of Pre-trained Word Embeddings

In this paper, we introduce a comprehensive toolkit, ETNLP, which can ev...
research
10/05/2017

BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages

We present BPEmb, a collection of pre-trained subword unit embeddings in...
research
02/02/2022

L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources

We present L3Cube-MahaCorpus a Marathi monolingual data set scraped from...
research
12/27/2021

"A Passage to India": Pre-trained Word Embeddings for Indian Languages

Dense word vectors or 'word embeddings' which encode semantic properties...
research
12/05/2019

Massive vs. Curated Word Embeddings for Low-Resourced Languages. The Case of Yorùbá and Twi

The success of several architectures to learn semantic representations f...
research
05/19/2021

Essay-BR: a Brazilian Corpus of Essays

Automatic Essay Scoring (AES) is defined as the computer technology that...

Please sign up or login with your details

Forgot password? Click here to reset