I Am Not What I Write: Privacy Preserving Text Representation Learning

07/06/2019
by   Ghazaleh Beigi, et al.
0

Online users generate tremendous amounts of textual information by participating in different activities, such as writing reviews and sharing tweets. This textual data provides opportunities for researchers and business partners to study and understand individuals. However, this user-generated textual data not only can reveal the identity of the user but also may contain individual's private information (e.g., age, location, gender). Hence, "you are what you write" as the saying goes. Publishing the textual data thus compromises the privacy of individuals who provided it. The need arises for data publishers to protect people's privacy by anonymizing the data before publishing it. It is challenging to design effective anonymization techniques for textual information which minimizes the chances of re-identification and does not contain users' sensitive information (high privacy) while retaining the semantic meaning of the data for given tasks (high utility). In this paper, we study this problem and propose a novel double privacy preserving text representation learning framework, DPText, which learns a textual representation that (1) is differentially private, (2) does not contain private information and (3) retains high utility for the given task. Evaluating on two natural language processing tasks, i.e., sentiment analysis and part of speech tagging, we show the effectiveness of this approach in terms of preserving both privacy and utility.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/10/2021

Quantum machine learning with differential privacy

Quantum machine learning (QML) can complement the growing trend of using...
research
02/26/2018

Learning Anonymized Representations with Adversarial Neural Networks

Statistical methods protecting sensitive information or the identity of ...
research
09/09/2019

Spreech: A System for Privacy-Preserving Speech Transcription

New Advances in machine learning and the abundance of speech datasets ha...
research
06/16/2020

Adversarial representation learning for private speech generation

As more and more data is collected in various settings across organizati...
research
02/03/2022

Privacy-Aware Crowd Labelling for Machine Learning Tasks

The extensive use of online social media has highlighted the importance ...
research
04/18/2018

When the signal is in the noise: The limits of Diffix's sticky noise

Finding a balance between privacy and utility, allowing researchers and ...
research
03/16/2021

No Intruder, no Validity: Evaluation Criteria for Privacy-Preserving Text Anonymization

For sensitive text data to be shared among NLP researchers and practitio...

Please sign up or login with your details

Forgot password? Click here to reset