Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words

05/10/2022
by   Kaitlyn Zhou, et al.
0

Cosine similarity of contextual embeddings is used in many NLP tasks (e.g., QA, IR, MT) and metrics (e.g., BERTScore). Here, we uncover systematic ways in which word similarities estimated by cosine over BERT embeddings are understated and trace this effect to training data frequency. We find that relative to human judgements, cosine similarity underestimates the similarity of frequent words with other instances of the same word or other words across contexts, even after controlling for polysemy and other factors. We conjecture that this underestimation of similarity for high frequency words is due to differences in the representational geometry of high and low frequency words and provide a formal argument for the two-dimensional case.

READ FULL TEXT
research
04/17/2021

Frequency-based Distortions in Contextualized Word Embeddings

How does word frequency in pre-training data affect the behavior of simi...
research
07/22/2019

Sparsity Emerges Naturally in Neural Language Models

Concerns about interpretability, computational resources, and principled...
research
11/15/2022

The Dependence on Frequency of Word Embedding Similarity Measures

Recent research has shown that static word embeddings can encode word fr...
research
05/15/2023

Unsupervised Sentence Representation Learning with Frequency-induced Adversarial Tuning and Incomplete Sentence Filtering

Pre-trained Language Model (PLM) is nowadays the mainstay of Unsupervise...
research
07/14/2012

Incremental Learning of 3D-DCT Compact Representations for Robust Visual Tracking

Visual tracking usually requires an object appearance model that is robu...
research
03/29/2016

What a Nerd! Beating Students and Vector Cosine in the ESL and TOEFL Datasets

In this paper, we claim that Vector Cosine, which is generally considere...
research
01/11/2022

D-Graph: AI-Assisted Design Concept Exploration Graph

We present an AI-assisted search tool, the "Design Concept Exploration G...

Please sign up or login with your details

Forgot password? Click here to reset