Unsupervised and Distributional Detection of Machine-Generated Text

11/04/2021
by   Matthias Gallé, et al.
4

The power of natural language generation models has provoked a flurry of interest in automatic methods to detect if a piece of text is human or machine-authored. The problem so far has been framed in a standard supervised way and consists in training a classifier on annotated data to predict the origin of one given new document. In this paper, we frame the problem in an unsupervised and distributional way: we assume that we have access to a large collection of unannotated documents, a big fraction of which is machine-generated. We propose a method to detect those machine-generated documents leveraging repeated higher-order n-grams, which we show over-appear in machine-generated text as compared to human ones. That weak signal is the starting point of a self-training setting where pseudo-labelled documents are used to train an ensemble of classifiers. Our experiments show that leveraging that signal allows us to rank suspicious documents accurately. Precision at 5000 is over 90 sampling for the largest model we used (GPT2-large). The drop with increased size of model is small, which could indicate that the results hold for other current and future large language models.

READ FULL TEXT

page 7

page 9

research
11/02/2019

Human and Automatic Detection of Generated Text

With the advent of generative models with a billion parameters or more, ...
research
01/31/2019

Funnelling: A New Ensemble Method for Heterogeneous Transfer Learning and its Application to Polylingual Text Classification

Polylingual Text Classification (PLC) consists of automatically classify...
research
10/15/2020

Neural Deepfake Detection with Factual Structure of Text

Deepfake detection, the task of automatically discriminating machine-gen...
research
05/24/2023

Ghostbuster: Detecting Text Ghostwritten by Large Language Models

We introduce Ghostbuster, a state-of-the-art system for detecting AI-gen...
research
01/31/2019

Funnelling: A New Ensemble Method for Heterogeneous Transfer Learning and its Application to Cross-Lingual Text Classification

Cross-lingual Text Classification (CLC) consists of automatically classi...
research
07/26/2019

Supervised and unsupervised neural approaches to text readability

We present a set of novel neural supervised and unsupervised approaches ...
research
08/10/2020

Navigating Human Language Models with Synthetic Agents

Modern natural language models such as the GPT-2/GPT-3 contain tremendou...

Please sign up or login with your details

Forgot password? Click here to reset