SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining

05/02/2018
by   Benjamin Weggenmann, et al.
0

Text mining and information retrieval techniques have been developed to assist us with analyzing, organizing and retrieving documents with the help of computers. In many cases, it is desirable that the authors of such documents remain anonymous: Search logs can reveal sensitive details about a user, critical articles or messages about a company or government might have severe or fatal consequences for a critic, and negative feedback in customer surveys might negatively impact business relations if they are identified. Simply removing personally identifying information from a document is, however, insufficient to protect the writer's identity: Given some reference texts of suspect authors, so-called authorship attribution methods can reidentfy the author from the text itself. One of the most prominent models to represent documents in many common text mining and information retrieval tasks is the vector space model where each document is represented as a vector, typically containing its term frequencies or related quantities. We therefore propose an automated text anonymization approach that produces synthetic term frequency vectors for the input documents that can be used in lieu of the original vectors. We evaluate our method on an exemplary text classification task and demonstrate that it only has a low impact on its accuracy. In contrast, we show that our method strongly affects authorship attribution techniques to the level that they become infeasible with a much stronger decline in accuracy. Other than previous authorship obfuscation methods, our approach is the first that fulfills differential privacy and hence comes with a provable plausible deniability guarantee.

READ FULL TEXT
research
11/26/2018

Generalised Differential Privacy for Text Document Processing

We address the problem of how to "obfuscate" texts by removing stylistic...
research
11/22/2022

Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF

The task of determining the similarity of text documents has received co...
research
02/10/2021

Privacy-Preserving Graph Convolutional Networks for Text Classification

Graph convolutional networks (GCNs) are a powerful architecture for repr...
research
07/12/2023

Testing different Log Bases For Vector Model Weighting Technique

Information retrieval systems retrieves relevant documents based on a qu...
research
10/19/2022

Attribution and Obfuscation of Neural Text Authorship: A Data Mining Perspective

Two interlocking research questions of growing interest and importance i...
research
11/06/2017

Authorship Analysis of Xenophon's Cyropaedia

In the past several decades, many authorship attribution studies have us...
research
09/30/2014

An agent-driven semantical identifier using radial basis neural networks and reinforcement learning

Due to the huge availability of documents in digital form, and the decep...

Please sign up or login with your details

Forgot password? Click here to reset