Compass-aligned Distributional Embeddings for Studying Semantic Differences across Corpora

04/13/2020
by   Federico Bianchi, et al.
1

Word2vec is one of the most used algorithms to generate word embeddings because of a good mix of efficiency, quality of the generated representations and cognitive grounding. However, word meaning is not static and depends on the context in which words are used. Differences in word meaning that depends on time, location, topic, and other factors, can be studied by analyzing embeddings generated from different corpora in collections that are representative of these factors. For example, language evolution can be studied using a collection of news articles published in different time periods. In this paper, we present a general framework to support cross-corpora language studies with word embeddings, where embeddings generated from different corpora can be compared to find correspondences and differences in meaning across the corpora. CADE is the core component of our framework and solves the key problem of aligning the embeddings generated from different corpora. In particular, we focus on providing solid evidence about the effectiveness, generality, and robustness of CADE. To this end, we conduct quantitative and qualitative experiments in different domains, from temporal word embeddings to language localization and topical analysis. The results of our experiments suggest that CADE achieves state-of-the-art or superior performance on tasks where several competing approaches are available, yet providing a general method that can be used in a variety of domains. Finally, our experiments shed light on the conditions under which the alignment is reliable, which substantially depends on the degree of cross-corpora vocabulary overlap.

READ FULL TEXT
research
11/05/2021

On the Impact of Temporal Representations on Metaphor Detection

State-of-the-art approaches for metaphor detection compare their literal...
research
10/02/2020

Enriching Word Embeddings with Temporal and Spatial Information

The meaning of a word is closely linked to sociocultural factors that ca...
research
06/05/2019

Training Temporal Word Embeddings with a Compass

Temporal word embeddings have been proposed to support the analysis of w...
research
08/18/2016

A Strong Baseline for Learning Cross-Lingual Word Embeddings from Sentence Alignments

While cross-lingual word embeddings have been studied extensively in rec...
research
12/07/2018

Asynchronous Training of Word Embeddings for Large Text Corpora

Word embeddings are a powerful approach for analyzing language and have ...
research
04/06/2019

Simple dynamic word embeddings for mapping perceptions in the public sphere

Word embeddings trained on large-scale historical corpora can illuminate...
research
04/18/2016

Clustering Comparable Corpora of Russian and Ukrainian Academic Texts: Word Embeddings and Semantic Fingerprints

We present our experience in applying distributional semantics (neural w...

Please sign up or login with your details

Forgot password? Click here to reset