Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations

11/15/2017
by   John Wieting, et al.
0

We extend the work of Wieting et al. (2017), back-translating a large parallel corpus to produce a dataset of more than 51 million English-English sentential paraphrase pairs in a dataset we call ParaNMT-50M. We find this corpus to be cover many domains and styles of text, in addition to being rich in paraphrases with different sentence structure, and we release it to the community. To show its utility, we use it to train paraphrastic sentence embeddings using only minor changes to the framework of Wieting et al. (2016b). The resulting embeddings outperform all supervised systems on every SemEval semantic textual similarity (STS) competition, and are a significant improvement in capturing paraphrastic similarity over all other sentence embeddings We also show that our embeddings perform competitively on general sentence embedding tasks. We release the corpus, pretrained sentence embeddings, and code to generate them. We believe the corpus can be a valuable resource for automatic paraphrase generation and can provide a rich source of semantic information to improve downstream natural language understanding tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/31/2018

Effective Parallel Corpus Mining using Bilingual Sentence Embeddings

This paper presents an effective approach for parallel corpus mining usi...
research
11/01/2020

Vec2Sent: Probing Sentence Embeddings with Natural Language Generation

We introspect black-box sentence embeddings by conditionally generating ...
research
11/09/2019

Sentence Meta-Embeddings for Unsupervised Semantic Textual Similarity

We address the task of unsupervised Semantic Textual Similarity (STS) by...
research
02/07/2022

Comparison and Combination of Sentence Embeddings Derived from Different Supervision Signals

We have recently seen many successful applications of sentence embedding...
research
10/05/2021

Exploiting Twitter as Source of Large Corpora of Weakly Similar Pairs for Semantic Sentence Embeddings

Semantic sentence embeddings are usually supervisedly built minimizing d...
research
04/16/2021

Are Classes Clusters?

Sentence embedding models aim to provide general purpose embeddings for ...
research
04/30/2020

The role of context in neural pitch accent detection in English

Prosody is a rich information source in natural language, serving as a m...

Please sign up or login with your details

Forgot password? Click here to reset