The Impact of Positional Encodings on Multilingual Compression

09/11/2021
by   Vinit Ravishankar, et al.
0

In order to preserve word-order information in a non-autoregressive setting, transformer architectures tend to include positional knowledge, by (for instance) adding positional encodings to token embeddings. Several modifications have been proposed over the sinusoidal positional encodings used in the original transformer architecture; these include, for instance, separating position encodings and token embeddings, or directly modifying attention weights based on the distance between word pairs. We first show that surprisingly, while these modifications tend to improve monolingual language models, none of them result in better multilingual language models. We then answer why that is: Sinusoidal encodings were explicitly designed to facilitate compositionality by allowing linear projections over arbitrary time steps. Higher variances in multilingual training distributions requires higher compression, in which case, compositionality becomes indispensable. Learned absolute positional encodings (e.g., in mBERT) tend to approximate sinusoidal embeddings in multilingual settings, but more complex positional encoding architectures lack the inductive bias to effectively learn compositionality and cross-lingual alignment. In other words, while sinusoidal positional encodings were originally designed for monolingual applications, they are particularly useful in multilingual language models.

READ FULL TEXT

page 4

page 5

page 6

page 8

page 12

research
10/20/2020

Language Representation in Multilingual BERT and its applications to improve Cross-lingual Generalization

A token embedding in multilingual BERT (m-BERT) contains both language a...
research
08/27/2018

Improving Cross-Lingual Word Embeddings by Meeting in the Middle

Cross-lingual word embeddings are becoming increasingly important in mul...
research
07/19/2022

Multilingual Transformer Encoders: a Word-Level Task-Agnostic Evaluation

Some Transformer-based models can perform cross-lingual transfer learnin...
research
09/22/2022

MonoByte: A Pool of Monolingual Byte-level Language Models

The zero-shot cross-lingual ability of models pretrained on multilingual...
research
01/28/2023

Multilingual Sentence Transformer as A Multilingual Word Aligner

Multilingual pretrained language models (mPLMs) have shown their effecti...
research
05/22/2022

The Geometry of Multilingual Language Model Representations

We assess how multilingual language models maintain a shared multilingua...
research
06/08/2023

DLAMA: A Framework for Curating Culturally Diverse Facts for Probing the Knowledge of Pretrained Language Models

A few benchmarking datasets have been released to evaluate the factual k...

Please sign up or login with your details

Forgot password? Click here to reset