Frequency-aware Dimension Selection for Static Word Embedding by Mixed Product Distance

by   Lingfeng Shen, et al.

Static word embedding is still useful, particularly for context-unavailable tasks, because in the case of no context available, pre-trained language models often perform worse than static word embeddings. Although dimension is a key factor determining the quality of static word embeddings, automatic dimension selection is rarely discussed. In this paper, we investigate the impact of word frequency on the dimension selection, and empirically find that word frequency is so vital that it needs to be taken into account during dimension selection. Based on such an empirical finding, this paper proposes a dimension selection method that uses a metric (Mixed Product Distance, MPD) to select a proper dimension for word embedding algorithms without training any word embedding. Through applying a post-processing function to oracle matrices, the MPD-based method can de-emphasize the impact of word frequency. Experiments on both context-unavailable and context-available tasks demonstrate the better efficiency-performance trade-off of our MPD-based dimension selection method over baselines.


page 1

page 2

page 3

page 4


On the Dimensionality of Word Embedding

In this paper, we provide a theoretical understanding of word embedding ...

Improved Answer Selection with Pre-Trained Word Embeddings

This paper evaluates existing and newly proposed answer selection method...

Revisiting Additive Compositionality: AND, OR and NOT Operations with Word Embeddings

It is well-known that typical word embedding methods such as Word2Vec an...

Model Choices Influence Attributive Word Associations: A Semi-supervised Analysis of Static Word Embeddings

Static word embeddings encode word associations, extensively utilized in...

Extremal GloVe: Theoretically Accurate Distributed Word Embedding by Tail Inference

Distributed word embeddings such as Word2Vec and GloVe have been widely ...

Contrastive Loss is All You Need to Recover Analogies as Parallel Lines

While static word embedding models are known to represent linguistic ana...

LM-Switch: Lightweight Language Model Conditioning in Word Embedding Space

In recent years, large language models (LMs) have achieved remarkable pr...

Please sign up or login with your details

Forgot password? Click here to reset