Submix: Practical Private Prediction for Large-Scale Language Models

01/04/2022
by   Antonio Ginart, et al.
0

Recent data-extraction attacks have exposed that language models can memorize some training samples verbatim. This is a vulnerability that can compromise the privacy of the model's training data. In this work, we introduce SubMix: a practical protocol for private next-token prediction designed to prevent privacy violations by language models that were fine-tuned on a private corpus after pre-training on a public corpus. We show that SubMix limits the leakage of information that is unique to any individual user in the private corpus via a relaxation of group differentially private prediction. Importantly, SubMix admits a tight, data-dependent privacy accounting mechanism, which allows it to thwart existing data-extraction attacks while maintaining the utility of the language model. SubMix is the first protocol that maintains privacy even when publicly releasing tens of thousands of next-token predictions made by large transformer-based models such as GPT-2.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/13/2020

Differentially Private Language Models Benefit from Public Pre-training

Language modeling is a keystone task in natural language processing. Whe...
research
01/14/2021

Privacy Analysis in Language Models via Training Data Leakage Report

Recent advances in neural network based language models lead to successf...
research
05/23/2023

Domain Private Transformers

Large, general purpose language models have demonstrated impressive perf...
research
07/19/2023

What can we learn from Data Leakage and Unlearning for Law?

Large Language Models (LLMs) have a privacy concern because they memoriz...
research
02/09/2023

Bag of Tricks for Training Data Extraction from Language Models

With the advance of language models, privacy protection is receiving mor...
research
05/02/2022

The Limits of Word Level Differential Privacy

As the issues of privacy and trust are receiving increasing attention wi...
research
04/26/2022

You Don't Know My Favorite Color: Preventing Dialogue Representations from Revealing Speakers' Private Personas

Social chatbots, also known as chit-chat chatbots, evolve rapidly with l...

Please sign up or login with your details

Forgot password? Click here to reset