A Continuously Growing Dataset of Sentential Paraphrases

08/01/2017
by   Wuwei Lan, et al.
0

A major challenge in paraphrase research is the lack of parallel corpora. In this paper, we present a new method to collect large-scale sentential paraphrases from Twitter by linking tweets through shared URLs. The main advantage of our method is its simplicity, as it gets rid of the classifier or human in the loop needed to select data before annotation and subsequent application of paraphrase identification algorithms in the previous work. We present the largest human-labeled paraphrase corpus to date of 51,524 sentence pairs and the first cross-domain benchmarking for automatic paraphrase identification. In addition, we show that more than 30,000 new sentential paraphrases can be easily and continuously captured every month at 70 precision, and demonstrate their utility for downstream NLP tasks through phrasal paraphrase extraction. We make our code and data freely available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/14/2022

A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing

Foundational Hebrew NLP tasks such as segmentation, tagging and parsing,...
research
11/28/2019

GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors

The lack of large-scale datasets has been a major hindrance to the devel...
research
09/21/2021

BERTweetFR : Domain Adaptation of Pre-Trained Language Models for French Tweets

We introduce BERTweetFR, the first large-scale pre-trained language mode...
research
07/06/2020

Announcing CzEng 2.0 Parallel Corpus with over 2 Gigawords

We present a new release of the Czech-English parallel corpus CzEng 2.0 ...
research
04/15/2021

Bilingual Terminology Extraction from Non-Parallel E-Commerce Corpora

Bilingual terminologies are important resources for natural language pro...
research
04/16/2021

Modeling Fuzzy Cluster Transitions for Topic Tracing

Twitter can be viewed as a data source for Natural Language Processing (...
research
12/24/2021

nvBench: A Large-Scale Synthesized Dataset for Cross-Domain Natural Language to Visualization Task

NL2VIS - which translates natural language (NL) queries to corresponding...

Please sign up or login with your details

Forgot password? Click here to reset