Paraphrase Detection: Human vs. Machine Content

03/24/2023
by   Jonas Becker, et al.
0

The growing prominence of large language models, such as GPT-4 and ChatGPT, has led to increased concerns over academic integrity due to the potential for machine-generated content and paraphrasing. Although studies have explored the detection of human- and machine-paraphrased content, the comparison between these types of content remains underexplored. In this paper, we conduct a comprehensive analysis of various datasets commonly employed for paraphrase detection tasks and evaluate an array of detection methods. Our findings highlight the strengths and limitations of different detection methods in terms of performance on individual datasets, revealing a lack of suitable machine-generated datasets that can be aligned with human expectations. Our main finding is that human-authored paraphrases exceed machine-generated ones in terms of difficulty, diversity, and similarity implying that automatically generated texts are not yet on par with human-level performance. Transformers emerged as the most effective method across datasets with TF-IDF excelling on semantically diverse corpora. Additionally, we identify four datasets as the most diverse and challenging for paraphrase detection.

READ FULL TEXT

page 8

page 9

page 15

page 16

research
10/07/2022

How Large Language Models are Transforming Machine-Paraphrased Plagiarism

The recent success of large language models for text generation poses a ...
research
02/04/2022

A Benchmark Corpus for the Detection of Automatically Generated Text in Academic Publications

Automatic text generation based on neural language models has achieved p...
research
03/23/2021

Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphrase Detection

The rise of language models such as BERT allows for high-quality text pa...
research
12/01/2022

Leveraging Large-scale Multimedia Datasets to Refine Content Moderation Models

The sheer volume of online user-generated content has rendered content m...
research
05/23/2020

A First Step Towards Content Protecting Plagiarism Detection

Plagiarism detection systems are essential tools for safeguarding academ...
research
04/04/2023

To ChatGPT, or not to ChatGPT: That is the question!

ChatGPT has become a global sensation. As ChatGPT and other Large Langua...
research
05/29/2023

A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets

The development of large language models (LLMs) such as ChatGPT has brou...

Please sign up or login with your details

Forgot password? Click here to reset