L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

11/21/2022
by   Ananya Joshi, et al.
0

Sentence representation from vanilla BERT models does not work well on sentence similarity tasks. Sentence-BERT models specifically trained on STS or NLI datasets are shown to provide state-of-the-art performance. However, building these models for low-resource languages is not straightforward due to the lack of these specialized datasets. This work focuses on two low-resource Indian languages, Hindi and Marathi. We train sentence-BERT models for these languages using synthetic NLI and STS datasets prepared using machine translation. We show that the strategy of NLI pre-training followed by STSb fine-tuning is effective in generating high-performance sentence-similarity models for Hindi and Marathi. The vanilla BERT models trained using this simple strategy outperform the multilingual LaBSE trained using a complex training strategy. These models are evaluated on downstream text classification and similarity tasks. We evaluate these models on real text classification datasets to show embeddings obtained from synthetic data training are generalizable to real datasets as well and thus represent an effective training strategy for low-resource languages. We also provide a comparative analysis of sentence embeddings from fast text models, multilingual BERT models (mBERT, IndicBERT, xlm-RoBERTa, MuRIL), multilingual sentence embedding models (LASER, LaBSE), and monolingual BERT models based on L3Cube-MahaBERT and HindBERT. We release L3Cube-MahaSBERT and HindSBERT, the state-of-the-art sentence-BERT models for Marathi and Hindi respectively. Our work also serves as a guide to building low-resource sentence embedding models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/22/2023

L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT

The multilingual Sentence-BERT (SBERT) models map different languages to...
research
04/10/2020

SimpleTran: Transferring Pre-Trained Sentence Embeddings for Low Resource Text Classification

Fine-tuning pre-trained sentence embedding models like BERT has become t...
research
08/24/2023

Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation

Contextual language models have been trained on Classical languages, inc...
research
12/30/2019

An Empirical Study of Factors Affecting Language-Independent Models

Scaling existing applications and solutions to multiple human languages ...
research
06/15/2021

Knowledge-Rich BERT Embeddings for Readability Assessment

Automatic readability assessment (ARA) is the task of evaluating the lev...
research
04/23/2021

Towards Trustworthy Deception Detection: Benchmarking Model Robustness across Domains, Modalities, and Languages

Evaluating model robustness is critical when developing trustworthy mode...
research
05/09/2023

Multilevel Sentence Embeddings for Personality Prediction

Representing text into a multidimensional space can be done with sentenc...

Please sign up or login with your details

Forgot password? Click here to reset