Privacy Implications of Retrieval-Based Language Models

05/24/2023
by   Yangsibo Huang, et al.
0

Retrieval-based language models (LMs) have demonstrated improved interpretability, factuality, and adaptability compared to their parametric counterparts, by incorporating retrieved text from external datastores. While it is well known that parametric models are prone to leaking private data, it remains unclear how the addition of a retrieval datastore impacts model privacy. In this work, we present the first study of privacy risks in retrieval-based LMs, particularly kNN-LMs. Our goal is to explore the optimal design and training procedure in domains where privacy is of concern, aiming to strike a balance between utility and privacy. Crucially, we find that kNN-LMs are more susceptible to leaking private information from their private datastore than parametric models. We further explore mitigations of privacy risks. When privacy information is targeted and readily detected in the text, we find that a simple sanitization step would completely eliminate the risks, while decoupling query and key encoders achieves an even better utility-privacy trade-off. Otherwise, we consider strategies of mixing public and private data in both datastore and encoder training. While these methods offer modest improvements, they leave considerable room for future work. Together, our findings provide insights for practitioners to better understand and mitigate privacy risks in retrieval-based LMs. Our code is available at: https://github.com/Princeton-SysML/kNNLM_privacy .

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/12/2020

TextHide: Tackling Data Privacy in Language Understanding Tasks

An unsolved challenge in distributed or federated learning is to effecti...
research
12/05/2021

Interpretable Privacy Preservation of Text Representations Using Vector Steganography

Contextual word representations generated by language models (LMs) learn...
research
05/20/2020

InfoScrub: Towards Attribute Privacy by Targeted Obfuscation

Personal photos of individuals when shared online, apart from exhibiting...
research
10/04/2022

Knowledge Unlearning for Mitigating Privacy Risks in Language Models

Pretrained Language Models (LMs) memorize a vast amount of knowledge dur...
research
05/22/2019

The tradeoff between the utility and risk of location data and implications for public good

High-resolution individual geolocation data passively collected from mob...
research
05/27/2022

Benign Overparameterization in Membership Inference with Early Stopping

Does a neural network's privacy have to be at odds with its accuracy? In...
research
05/24/2023

Machine Unlearning: its nature, scope, and importance for a "delete culture"

The article explores the cultural shift from recording to deleting infor...

Please sign up or login with your details

Forgot password? Click here to reset