Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study

08/17/2020
by   Dara Bahri, et al.
0

Large generative language models such as GPT-2 are well-known for their ability to generate text as well as their utility in supervised downstream tasks via fine-tuning. Our work is twofold: firstly we demonstrate via human evaluation that classifiers trained to discriminate between human and machine-generated text emerge as unsupervised predictors of "page quality", able to detect low quality content without any training. This enables fast bootstrapping of quality indicators in a low-resource setting. Secondly, curious to understand the prevalence and nature of low quality pages in the wild, we conduct extensive qualitative and quantitative analysis over 500 million web articles, making this the largest-scale study ever conducted on the topic.

READ FULL TEXT
research
06/03/2021

Fingerprinting Fine-tuned Language Models in the Wild

There are concerns that the ability of language models (LMs) to generate...
research
10/22/2022

Generative Prompt Tuning for Relation Classification

Using prompts to explore the knowledge contained within pre-trained lang...
research
12/24/2022

Real or Fake Text?: Investigating Human Ability to Detect Boundaries Between Human-Written and Machine-Generated Text

As text generated by large language models proliferates, it becomes vita...
research
04/29/2023

POUF: Prompt-oriented unsupervised fine-tuning for large pre-trained models

Through prompting, large-scale pre-trained models have become more expre...
research
08/26/2022

AutoQGS: Auto-Prompt for Low-Resource Knowledge-based Question Generation from SPARQL

This study investigates the task of knowledge-based question generation ...
research
05/23/2022

Looking for a Handsome Carpenter! Debiasing GPT-3 Job Advertisements

The growing capability and availability of generative language models ha...
research
12/12/2022

T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics

Modern embedding-based metrics for evaluation of generated text generall...

Please sign up or login with your details

Forgot password? Click here to reset