Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity

03/10/2020
by   Ivan Vulić, et al.
0

We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering datasets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets. Due to its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and cross-lingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and cross-lingual representation models, including static and contextualized word embeddings (such as fastText, M-BERT and XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised cross-lingual word embeddings. We also present a step-by-step dataset creation protocol for creating consistent, Multi-Simlex-style resources for additional languages. We make these contributions – the public release of Multi-SimLex datasets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning – available via a website which will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/01/2017

Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints

We present Attract-Repel, an algorithm for improving the semantic qualit...
research
09/06/2022

Monolingual alignment of word senses and definitions in lexicographical resources

The focus of this thesis is broadly on the alignment of lexicographical ...
research
04/28/2020

Synonymy = Translational Equivalence

Synonymy and translational equivalence are the relations of sameness of ...
research
12/02/2020

A Computational Approach to Measuring the Semantic Divergence of Cognates

Meaning is the foundation stone of intercultural communication. Language...
research
09/19/2022

ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification

Lexical simplification (LS) is the task of automatically replacing compl...
research
08/06/2016

HyperLex: A Large-Scale Evaluation of Graded Lexical Entailment

We introduce HyperLex - a dataset and evaluation resource that quantifie...
research
03/09/2022

Language Diversity: Visible to Humans, Exploitable by Machines

The Universal Knowledge Core (UKC) is a large multilingual lexical datab...

Please sign up or login with your details

Forgot password? Click here to reset