Exploring Anisotropy and Outliers in Multilingual Language Models for Cross-Lingual Semantic Sentence Similarity

06/01/2023
by   Katharina Hämmerl, et al.
0

Previous work has shown that the representations output by contextual language models are more anisotropic than static type embeddings, and typically display outlier dimensions. This seems to be true for both monolingual and multilingual models, although much less work has been done on the multilingual context. Why these outliers occur and how they affect the representations is still an active area of research. We investigate outlier dimensions and their relationship to anisotropy in multiple pre-trained multilingual language models. We focus on cross-lingual semantic similarity tasks, as these are natural tasks for evaluating multilingual representations. Specifically, we examine sentence representations. Sentence transformers which are fine-tuned on parallel resources (that are not always available) perform better on this task, and we show that their representations are more isotropic. However, we aim to improve multilingual representations in general. We investigate how much of the performance difference can be made up by only transforming the embedding space without fine-tuning, and visualise the resulting spaces. We test different operations: Removing individual outlier dimensions, cluster-based isotropy enhancement, and ZCA whitening. We publish our code for reproducibility.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/08/2021

Uppsala NLP at SemEval-2021 Task 2: Multilingual Language Models for Fine-tuning and Feature Extraction in Word-in-Context Disambiguation

We describe the Uppsala NLP submission to SemEval-2021 Task 2 on multili...
research
09/01/2021

Aligning Cross-lingual Sentence Representations with Dual Momentum Contrast

In this paper, we propose to align sentence representations from differe...
research
05/30/2023

Stable Anisotropic Regularization

Given the success of Large Language Models (LLMs), there has been consid...
research
04/18/2022

Exploring Dimensionality Reduction Techniques in Multilingual Transformers

Both in scientific literature and in industry,, Semantic and context-awa...
research
09/10/2021

A Simple and Effective Method To Eliminate the Self Language Bias in Multilingual Representations

Language agnostic and semantic-language information isolation is an emer...
research
09/15/2019

Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization

We present Emu, a system that semantically enhances multilingual sentenc...
research
09/16/2021

Locating Language-Specific Information in Contextualized Embeddings

Multilingual pretrained language models (MPLMs) exhibit multilinguality ...

Please sign up or login with your details

Forgot password? Click here to reset