L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT

04/22/2023
by   Samruddhi Deode, et al.
0

The multilingual Sentence-BERT (SBERT) models map different languages to common representation space and are useful for cross-language similarity and mining tasks. We propose a simple yet effective approach to convert vanilla multilingual BERT models into multilingual sentence BERT models using synthetic corpus. We simply aggregate translated NLI or STS datasets of the low-resource target languages together and perform SBERT-like fine-tuning of the vanilla multilingual BERT model. We show that multilingual BERT models are inherent cross-lingual learners and this simple baseline fine-tuning approach without explicit cross-lingual training yields exceptional cross-lingual properties. We show the efficacy of our approach on 10 major Indic languages and also show the applicability of our approach to non-Indic languages German and French. Using this approach, we further present L3Cube-IndicSBERT, the first multilingual sentence representation model specifically for Indian languages Hindi, Marathi, Kannada, Telugu, Malayalam, Tamil, Gujarati, Odia, Bengali, and Punjabi. The IndicSBERT exhibits strong cross-lingual capabilities and performs significantly better than alternatives like LaBSE, LASER, and paraphrase-multilingual-mpnet-base-v2 on Indic cross-lingual and monolingual sentence similarity tasks. We also release monolingual SBERT models for each of the languages and show that IndicSBERT performs competitively with its monolingual counterparts. These models have been evaluated using embedding similarity scores and classification accuracy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/10/2019

Can Monolingual Pretrained Models Help Cross-Lingual Classification?

Multilingual pretrained language models (such as multilingual BERT) have...
research
11/21/2022

L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

Sentence representation from vanilla BERT models does not work well on s...
research
05/21/2021

Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining

Existing models of multilingual sentence embeddings require large parall...
research
02/10/2022

Slovene SuperGLUE Benchmark: Translation and Evaluation

We present a Slovene combined machine-human translated SuperGLUE benchma...
research
04/18/2018

Experiments with Universal CEFR Classification

The Common European Framework of Reference (CEFR) guidelines describe la...
research
04/08/2021

A Simple Geometric Method for Cross-Lingual Linguistic Transformations with Pre-trained Autoencoders

Powerful sentence encoders trained for multiple languages are on the ris...
research
04/30/2021

Paraphrastic Representations at Scale

We present a system that allows users to train their own state-of-the-ar...

Please sign up or login with your details

Forgot password? Click here to reset