ProPILE: Probing Privacy Leakage in Large Language Models

07/04/2023
by   Siwon Kim, et al.
0

The rapid advancement and widespread use of large language models (LLMs) have raised significant concerns regarding the potential leakage of personally identifiable information (PII). These models are often trained on vast quantities of web-collected data, which may inadvertently include sensitive personal data. This paper presents ProPILE, a novel probing tool designed to empower data subjects, or the owners of the PII, with awareness of potential PII leakage in LLM-based services. ProPILE lets data subjects formulate prompts based on their own PII to evaluate the level of privacy intrusion in LLMs. We demonstrate its application on the OPT-1.3B model trained on the publicly available Pile dataset. We show how hypothetical data subjects may assess the likelihood of their PII being included in the Pile dataset being revealed. ProPILE can also be leveraged by LLM service providers to effectively evaluate their own levels of PII leakage with more powerful prompts specifically tuned for their in-house models. This tool represents a pioneering step towards empowering the data subjects for their awareness and control over their own data on the web.

READ FULL TEXT

page 2

page 6

page 7

page 16

page 17

page 18

research
09/21/2023

Knowledge Sanitization of Large Language Models

We explore a knowledge sanitization approach to mitigate the privacy con...
research
12/31/2020

KART: Privacy Leakage Framework of Language Models Pre-trained with Clinical Records

Nowadays, mainstream natural language pro-cessing (NLP) is empowered by ...
research
07/04/2021

Survey: Leakage and Privacy at Inference Time

Leakage of data from publicly available Machine Learning (ML) models is ...
research
09/14/2023

Do Not Give Away My Secrets: Uncovering the Privacy Issue of Neural Code Completion Tools

Neural Code Completion Tools (NCCTs) have reshaped the field of software...
research
02/01/2023

Analyzing Leakage of Personally Identifiable Information in Language Models

Language Models (LMs) have been shown to leak information about training...
research
02/09/2023

Bag of Tricks for Training Data Extraction from Language Models

With the advance of language models, privacy protection is receiving mor...
research
12/13/2020

Leaking Sensitive Financial Accounting Data in Plain Sight using Deep Autoencoder Neural Networks

Nowadays, organizations collect vast quantities of sensitive information...

Please sign up or login with your details

Forgot password? Click here to reset