Unsupervised Sentence Representation Learning with Frequency-induced Adversarial Tuning and Incomplete Sentence Filtering

05/15/2023
by   Bing Wang, et al.
0

Pre-trained Language Model (PLM) is nowadays the mainstay of Unsupervised Sentence Representation Learning (USRL). However, PLMs are sensitive to the frequency information of words from their pre-training corpora, resulting in anisotropic embedding space, where the embeddings of high-frequency words are clustered but those of low-frequency words disperse sparsely. This anisotropic phenomenon results in two problems of similarity bias and information bias, lowering the quality of sentence embeddings. To solve the problems, we fine-tune PLMs by leveraging the frequency information of words and propose a novel USRL framework, namely Sentence Representation Learning with Frequency-induced Adversarial tuning and Incomplete sentence filtering (SLT-FAI). We calculate the word frequencies over the pre-training corpora of PLMs and assign words thresholding frequency labels. With them, (1) we incorporate a similarity discriminator used to distinguish the embeddings of high-frequency and low-frequency words, and adversarially tune the PLM with it, enabling to achieve uniformly frequency-invariant embedding space; and (2) we propose a novel incomplete sentence detection task, where we incorporate an information discriminator to distinguish the embeddings of original sentences and incomplete sentences by randomly masking several low-frequency words, enabling to emphasize the more informative low-frequency words. Our SLT-FAI is a flexible and plug-and-play framework, and it can be integrated with existing USRL techniques. We evaluate SLT-FAI with various backbones on benchmark datasets. Empirical results indicate that SLT-FAI can be superior to the existing USRL baselines. Our code is released in <https://github.com/wangbing1416/SLT-FAI>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/18/2022

Sentiment-Aware Word and Sentence Level Pre-training for Sentiment Analysis

Most existing pre-trained language representation models (PLMs) are sub-...
research
10/08/2022

InfoCSE: Information-aggregated Contrastive Learning of Sentence Embeddings

Contrastive learning has been extensively studied in sentence embedding ...
research
05/10/2022

Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words

Cosine similarity of contextual embeddings is used in many NLP tasks (e....
research
01/19/2023

JCSE: Contrastive Learning of Japanese Sentence Embeddings and Its Applications

Contrastive learning is widely used for sentence representation learning...
research
04/17/2021

Frequency-based Distortions in Contextualized Word Embeddings

How does word frequency in pre-training data affect the behavior of simi...
research
04/28/2021

MelBERT: Metaphor Detection via Contextualized Late Interaction using Metaphorical Identification Theories

Automated metaphor detection is a challenging task to identify metaphori...
research
09/30/2019

A Critique of the Smooth Inverse Frequency Sentence Embeddings

We critically review the smooth inverse frequency sentence embedding met...

Please sign up or login with your details

Forgot password? Click here to reset