Provably Confidential Language Modelling

05/04/2022
by   Xuandong Zhao, et al.
0

Large language models are shown to memorize privacy information such as social security numbers in training data. Given the sheer scale of the training corpus, it is challenging to screen and filter these privacy data, either manually or automatically. In this paper, we propose Confidentially Redacted Training (CRT), a method to train language generation models while protecting the confidential segments. We borrow ideas from differential privacy (which solves a related but distinct problem) and show that our method is able to provably prevent unintended memorization by randomizing parts of the training process. Moreover, we show that redaction with an approximately correct screening policy amplifies the confidentiality guarantee. We implement the method for both LSTM and GPT language models. Our experimental results show that the models trained by CRT obtain almost the same perplexity while preserving strong confidentiality.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/11/2022

What Does it Mean for a Language Model to Preserve Privacy?

Natural language reflects our private lives and identities, making its p...
research
10/06/2022

Q-LSTM Language Model – Decentralized Quantum Multilingual Pre-Trained Language Model for Privacy Protection

Large-scale language models are trained on a massive amount of natural l...
research
08/30/2021

Selective Differential Privacy for Language Modeling

With the increasing adoption of language models in applications involvin...
research
05/02/2023

Mitigating Approximate Memorization in Language Models via Dissimilarity Learned Policy

Large Language models (LLMs) are trained on large amounts of data, which...
research
05/22/2022

Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models

Despite their wide adoption, the underlying training and memorization dy...
research
06/07/2023

Privately generating tabular data using language models

Privately generating synthetic data from a table is an important brick o...
research
10/31/2022

Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy

Studying data memorization in neural language models helps us understand...

Please sign up or login with your details

Forgot password? Click here to reset