Do Language Models Plagiarize?

03/15/2022
by   JooYoung Lee, et al.
0

Past literature has illustrated that language models do not fully understand the context and sensitivity of text and can sometimes memorize phrases or sentences present in their training sets. In this paper, we investigate whether they not only memorize but also plagiarize training samples when generating artificial texts. Our findings support that they, especially GPT-2, reuse particular pieces of texts from the training corpus with or without obfuscation. We have four main results: 1) language models with more capacity plagiarize more; 2) fine-tuned language models demonstrate differing patterns of plagiarism based on characteristics of auxiliary data; 3) sampling from truncated language modeling distributions tends to heighten the degree of plagiarism as opposed to temperature sampling, and 4) plagiarism in language models can have serious privacy consequences. Overall, our work implies that future research on neural language models should take precautions to avoid models plagiarizing their training datasets.

READ FULL TEXT

page 2

page 14

page 15

page 16

research
05/07/2021

Understanding by Understanding Not: Modeling Negation in Language Models

Negation is a core construction in natural language. Despite being very ...
research
12/24/2021

Counterfactual Memorization in Neural Language Models

Modern neural language models widely used in tasks across NLP risk memor...
research
04/20/2023

Analyzing FOMC Minutes: Accuracy and Constraints of Language Models

This research article analyzes the language used in the official stateme...
research
02/15/2022

Quantifying Memorization Across Neural Language Models

Large language models (LMs) have been shown to memorize parts of their t...
research
05/11/2020

Enabling Language Models to Fill in the Blanks

We present a simple approach for text infilling, the task of predicting ...
research
05/12/2022

Sampling with Attribute-Related Information for Controlling Language Models

The dominant approaches for controlling language models are based on fin...
research
05/11/2023

Autocorrelations Decay in Texts and Applicability Limits of Language Models

We show that the laws of autocorrelations decay in texts are closely rel...

Please sign up or login with your details

Forgot password? Click here to reset