A Resource-Light Method for Cross-Lingual Semantic Textual Similarity

01/19/2018
by   Goran Glavaš, et al.
0

Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named entity recognition) that for many languages (or language pairs) do not exist. In contrast, we propose an unsupervised and a very resource-light approach for measuring semantic similarity between texts in different languages. To operate in the bilingual (or multilingual) space, we project continuous word vectors (i.e., word embeddings) from one language to the vector space of the other language via the linear translation model. We then align words according to the similarity of their vectors in the bilingual embedding space and investigate different unsupervised measures of semantic similarity exploiting bilingual embeddings and word alignments. Requiring only a limited-size set of word translation pairs between the languages, the proposed approach is applicable to virtually any pair of languages for which there exists a sufficiently large corpus, required to learn monolingual word embeddings. Experimental results on three different datasets for measuring semantic textual similarity show that our simple resource-light approach reaches performance close to that of supervised and resource intensive methods, displaying stability across different language pairs. Furthermore, we evaluate the proposed method on two extrinsic tasks, namely extraction of parallel sentences from comparable corpora and cross lingual plagiarism detection, and show that it yields performance comparable to those of complex resource-intensive state-of-the-art models for the respective tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/27/2018

Unsupervised Multilingual Word Embeddings

Multilingual Word Embeddings (MWEs) represent words from multiple langua...
research
05/13/2023

PESTS: Persian_English Cross Lingual Corpus for Semantic Textual Similarity

One of the components of natural language processing that has received a...
research
12/02/2020

A Computational Approach to Measuring the Semantic Divergence of Cognates

Meaning is the foundation stone of intercultural communication. Language...
research
04/18/2016

Clustering Comparable Corpora of Russian and Ukrainian Academic Texts: Word Embeddings and Semantic Fingerprints

We present our experience in applying distributional semantics (neural w...
research
10/05/2021

Exploiting Twitter as Source of Large Corpora of Weakly Similar Pairs for Semantic Sentence Embeddings

Semantic sentence embeddings are usually supervisedly built minimizing d...
research
07/11/2018

Linear Transformations for Cross-lingual Semantic Textual Similarity

Cross-lingual semantic textual similarity systems estimate the degree of...
research
01/30/2020

Lost in Embedding Space: Explaining Cross-Lingual Task Performance with Eigenvalue Divergence

Performance in cross-lingual NLP tasks is impacted by the (dis)similarit...

Please sign up or login with your details

Forgot password? Click here to reset