Generalised Differential Privacy for Text Document Processing

11/26/2018
by   Natasha Fernandes, et al.
0

We address the problem of how to "obfuscate" texts by removing stylistic clues which can identify authorship, whilst preserving (as much as possible) the content of the text. In this paper we combine ideas from "generalised differential privacy" and machine learning techniques for text processing to model privacy for text documents. We define a privacy mechanism that operates at the level of text documents represented as "bags-of-words" - these representations are typical in machine learning and contain sufficient information to carry out many kinds of classification tasks including topic identification and authorship attribution (of the original documents). We show that our mechanism satisfies privacy with respect to a metric for semantic similarity, thereby providing a balance between utility, defined by the semantic content of texts, with the obfuscation of stylistic clues. We demonstrate our implementation on a "fan fiction" dataset, confirming that it is indeed possible to disguise writing style effectively whilst preserving enough information and variation for accurate content classification tasks.

READ FULL TEXT
research
10/20/2019

Leveraging Hierarchical Representations for Preserving Privacy and Utility in Text

Guaranteeing a certain level of user privacy in an arbitrary piece of te...
research
05/02/2018

SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining

Text mining and information retrieval techniques have been developed to ...
research
05/03/2022

Universal Optimality and Robust Utility Bounds for Metric Differential Privacy

We study the privacy-utility trade-off in the context of metric differen...
research
06/02/2023

Guiding Text-to-Text Privatization by Syntax

Metric Differential Privacy is a generalization of differential privacy ...
research
02/15/2023

DP-BART for Privatized Text Rewriting under Local Differential Privacy

Privatized text rewriting with local differential privacy (LDP) is a rec...
research
05/27/2021

On Privacy and Confidentiality of Communications in Organizational Graphs

Machine learned models trained on organizational communication data, suc...
research
09/16/2022

De-Identification of French Unstructured Clinical Notes for Machine Learning Tasks

Unstructured textual data are at the heart of health systems: liaison le...

Please sign up or login with your details

Forgot password? Click here to reset