Separating the Human Touch from AI-Generated Text using Higher Criticism: An Information-Theoretic Approach

08/24/2023
by   Alon Kipnis, et al.
0

We propose a method to determine whether a given article was entirely written by a generative language model versus an alternative situation in which the article includes some significant edits by a different author, possibly a human. Our process involves many perplexity tests for the origin of individual sentences or other text atoms, combining these multiple tests using Higher Criticism (HC). As a by-product, the method identifies parts suspected to be edited. The method is motivated by the convergence of the log-perplexity to the cross-entropy rate and by a statistical model for edited text saying that sentences are mostly generated by the language model, except perhaps for a few sentences that might have originated via a different mechanism. We demonstrate the effectiveness of our method using real data and analyze the factors affecting its success. This analysis raises several interesting open challenges whose resolution may improve the method's effectiveness.

READ FULL TEXT
research
08/18/2022

Using Large Language Models to Simulate Multiple Humans

We propose a method for using a large language model, such as GPT-3, to ...
research
06/17/2022

BITS Pilani at HinglishEval: Quality Evaluation for Code-Mixed Hinglish Text Using Transformers

Code-Mixed text data consists of sentences having words or phrases from ...
research
05/04/2020

Distributional Discrepancy: A Metric for Unconditional Text Generation

The goal of unconditional text generation is training a model with real ...
research
05/09/2023

Estimating related words computationally using language model from the Mahabharata – an Indian epic

'Mahabharata' is the most popular among many Indian pieces of literature...
research
06/07/2023

Long-form analogies generated by chatGPT lack human-like psycholinguistic properties

Psycholinguistic analyses provide a means of evaluating large language m...
research
08/25/2023

EntropyRank: Unsupervised Keyphrase Extraction via Side-Information Optimization for Language Model-based Text Compression

We propose an unsupervised method to extract keywords and keyphrases fro...
research
05/17/2023

FACE: Evaluating Natural Language Generation with Fourier Analysis of Cross-Entropy

Measuring the distance between machine-produced and human language is a ...

Please sign up or login with your details

Forgot password? Click here to reset