Frequency-based Distortions in Contextualized Word Embeddings

04/17/2021
by   Kaitlyn Zhou, et al.
0

How does word frequency in pre-training data affect the behavior of similarity metrics in contextualized BERT embeddings? Are there systematic ways in which some word relationships are exaggerated or understated? In this work, we explore the geometric characteristics of contextualized word embeddings with two novel tools: (1) an identity probe that predicts the identity of a word using its embedding; (2) the minimal bounding sphere for a word's contextualized representations. Our results reveal that words of high and low frequency differ significantly with respect to their representational geometry. Such differences introduce distortions: when compared to human judgments, point estimates of embedding similarity (e.g., cosine similarity) can over- or under-estimate the semantic similarity of two words, depending on the frequency of those words in the training data. This has downstream societal implications: BERT-Base has more trouble differentiating between South American and African countries than North American and European ones. We find that these distortions persist when using BERT-Multilingual, suggesting that they cannot be easily fixed with additional data, which in turn introduces new distortions.

READ FULL TEXT
research
05/10/2022

Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words

Cosine similarity of contextual embeddings is used in many NLP tasks (e....
research
11/15/2022

The Dependence on Frequency of Word Embedding Similarity Measures

Recent research has shown that static word embeddings can encode word fr...
research
11/13/2019

What do you mean, BERT? Assessing BERT as a Distributional Semantics Model

Contextualized word embeddings, i.e. vector representations for words in...
research
04/11/2022

Word Embeddings Are Capable of Capturing Rhythmic Similarity of Words

Word embedding systems such as Word2Vec and GloVe are well-known in deep...
research
09/05/2020

Bio-inspired Structure Identification in Language Embeddings

Word embeddings are a popular way to improve downstream performances in ...
research
05/15/2023

Unsupervised Sentence Representation Learning with Frequency-induced Adversarial Tuning and Incomplete Sentence Filtering

Pre-trained Language Model (PLM) is nowadays the mainstay of Unsupervise...
research
01/20/2022

Regional Negative Bias in Word Embeddings Predicts Racial Animus–but only via Name Frequency

The word embedding association test (WEAT) is an important method for me...

Please sign up or login with your details

Forgot password? Click here to reset