Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations
We extend the work of Wieting et al. (2017), back-translating a large parallel corpus to produce a dataset of more than 51 million English-English sentential paraphrase pairs in a dataset we call ParaNMT-50M. We find this corpus to be cover many domains and styles of text, in addition to being rich in paraphrases with different sentence structure, and we release it to the community. To show its utility, we use it to train paraphrastic sentence embeddings using only minor changes to the framework of Wieting et al. (2016b). The resulting embeddings outperform all supervised systems on every SemEval semantic textual similarity (STS) competition, and are a significant improvement in capturing paraphrastic similarity over all other sentence embeddings We also show that our embeddings perform competitively on general sentence embedding tasks. We release the corpus, pretrained sentence embeddings, and code to generate them. We believe the corpus can be a valuable resource for automatic paraphrase generation and can provide a rich source of semantic information to improve downstream natural language understanding tasks.
READ FULL TEXT