The Dependence on Frequency of Word Embedding Similarity Measures

11/15/2022
by   Francisco Valentini, et al.
0

Recent research has shown that static word embeddings can encode word frequency information. However, little has been studied about this phenomenon and its effects on downstream tasks. In the present work, we systematically study the association between frequency and semantic similarity in several static word embeddings. We find that Skip-gram, GloVe and FastText embeddings tend to produce higher semantic similarity between high-frequency words than between other frequency combinations. We show that the association between frequency and similarity also appears when words are randomly shuffled. This proves that the patterns found are not due to real semantic associations present in the texts, but are an artifact produced by the word embeddings. Finally, we provide an example of how word frequency can strongly impact the measurement of gender bias with embedding-based metrics. In particular, we carry out a controlled experiment that shows that biases can even change sign or reverse their order by manipulating word frequencies.

READ FULL TEXT
research
01/02/2023

The Undesirable Dependence on Frequency of Gender Bias Metrics Based on Word Embeddings

Numerous works use word embedding-based metrics to quantify societal bia...
research
12/14/2020

Model Choices Influence Attributive Word Associations: A Semi-supervised Analysis of Static Word Embeddings

Static word embeddings encode word associations, extensively utilized in...
research
01/20/2022

Regional Negative Bias in Word Embeddings Predicts Racial Animus–but only via Name Frequency

The word embedding association test (WEAT) is an important method for me...
research
10/05/2016

Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database

Word embeddings have been extensively studied in large text datasets. Ho...
research
06/07/2022

Gender Bias in Word Embeddings: A Comprehensive Analysis of Frequency, Syntax, and Semantics

The statistical regularities in language corpora encode well-known socia...
research
04/17/2021

Frequency-based Distortions in Contextualized Word Embeddings

How does word frequency in pre-training data affect the behavior of simi...
research
05/10/2022

Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words

Cosine similarity of contextual embeddings is used in many NLP tasks (e....

Please sign up or login with your details

Forgot password? Click here to reset