Reweighting Strategy based on Synthetic Data Identification for Sentence Similarity

08/29/2022
by   Taehee Kim, et al.
2

Semantically meaningful sentence embeddings are important for numerous tasks in natural language processing. To obtain such embeddings, recent studies explored the idea of utilizing synthetically generated data from pretrained language models (PLMs) as a training corpus. However, PLMs often generate sentences much different from the ones written by human. We hypothesize that treating all these synthetic examples equally for training deep neural networks can have an adverse effect on learning semantically meaningful embeddings. To analyze this, we first train a classifier that identifies machine-written sentences, and observe that the linguistic features of the sentences identified as written by a machine are significantly different from those of human-written sentences. Based on this, we propose a novel approach that first trains the classifier to measure the importance of each sentence. The distilled information from the classifier is then used to train a reliable sentence embedding model. Through extensive evaluation on four real-world datasets, we demonstrate that our model trained on synthetic data generalizes well and outperforms the existing baselines. Our implementation is publicly available at https://github.com/ddehun/coling2022_reweighting_sts.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/20/2022

Judge a Sentence by Its Content to Generate Grammatical Errors

Data sparsity is a well-known problem for grammatical error correction (...
research
04/22/2022

MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Learning semantically meaningful sentence embeddings is an open problem ...
research
11/19/2022

Entity-Assisted Language Models for Identifying Check-worthy Sentences

We propose a new uniform framework for text classification and ranking t...
research
05/24/2023

Contrastive Learning of Sentence Embeddings from Scratch

Contrastive learning has been the dominant approach to train state-of-th...
research
10/22/2018

BioSentVec: creating sentence embeddings for biomedical texts

Sentence embeddings have become an essential part of today's natural lan...
research
11/05/2020

Detecting Hallucinated Content in Conditional Neural Sequence Generation

Neural sequence models can generate highly fluent sentences but recent s...
research
09/13/2021

Show Me How To Revise: Improving Lexically Constrained Sentence Generation with XLNet

Lexically constrained sentence generation allows the incorporation of pr...

Please sign up or login with your details

Forgot password? Click here to reset