SynSciPass: detecting appropriate uses of scientific text generation

09/07/2022
by   Domenic Rosati, et al.
0

Approaches to machine generated text detection tend to focus on binary classification of human versus machine written text. In the scientific domain where publishers might use these models to examine manuscripts under submission, misclassification has the potential to cause harm to authors. Additionally, authors may appropriately use text generation models such as with the use of assistive technologies like translation tools. In this setting, a binary classification scheme might be used to flag appropriate uses of assistive text generation technology as simply machine generated which is a cause of concern. In our work, we simulate this scenario by presenting a state-of-the-art detector trained on the DAGPap22 with machine translated passages from Scielo and find that the model performs at random. Given this finding, we develop a framework for dataset development that provides a nuanced approach to detecting machine generated text by having labels for the type of technology used such as for translation or paraphrase resulting in the construction of SynSciPass. By training the same model that performed well on DAGPap22 on SynSciPass, we show that not only is the model more robust to domain shifts but also is able to uncover the type of technology used for machine generated text. Despite this, we conclude that current datasets are neither comprehensive nor realistic enough to understand how these models would perform in the wild where manuscript submissions can come from many unknown or novel distributions, how they would perform on scientific full-texts rather than small passages, and what might happen when there is a mix of appropriate and inappropriate uses of natural language generation.

READ FULL TEXT
research
10/26/2020

Dutch Humor Detection by Generating Negative Examples

Detecting if a text is humorous is a hard task to do computationally, as...
research
06/15/2019

Automatic Conditional Generation of Personalized Social Media Short Texts

Automatic text generation has received much attention owing to rapid dev...
research
10/20/2021

SciXGen: A Scientific Paper Dataset for Context-Aware Text Generation

Generating texts in scientific papers requires not only capturing the co...
research
06/16/2022

DIALOG-22 RuATD Generated Text Detection

Text Generation Models (TGMs) succeed in creating text that matches huma...
research
12/13/2022

Paraphrase Identification with Deep Learning: A Review of Datasets and Methods

The rapid advancement of AI technology has made text generation tools li...
research
06/05/2019

Automated Speech Generation from UN General Assembly Statements: Mapping Risks in AI Generated Texts

Automated text generation has been applied broadly in many domains such ...
research
12/18/2022

Rainproof: An Umbrella To Shield Text Generators From Out-Of-Distribution Data

As more and more conversational and translation systems are deployed in ...

Please sign up or login with your details

Forgot password? Click here to reset