rx-anon – A Novel Approach on the De-Identification of Heterogeneous Data based on a Modified Mondrian Algorithm

05/18/2021
by   Fabian Singhofer, et al.
0

Traditional approaches for data anonymization consider relational data and textual data independently. We propose rx-anon, an anonymization approach for heterogeneous semi-structured documents composed of relational and textual attributes. We map sensitive terms extracted from the text to the structured data. This allows us to use concepts like k-anonymity to generate a joined, privacy-preserved version of the heterogeneous data input. We introduce the concept of redundant sensitive information to consistently anonymize the heterogeneous data. To control the influence of anonymization over unstructured textual data versus structured data attributes, we introduce a modified, parameterized Mondrian algorithm. The parameter λ allows to give different weight on the relational and textual attributes during the anonymization process. We evaluate our approach with two real-world datasets using a Normalized Certainty Penalty score, adapted to the problem of jointly anonymizing relational and textual data. The results show that our approach is capable of reducing information loss by using the tuning parameter to control the Mondrian partitioning while guaranteeing k-anonymity for relational attributes as well as for sensitive terms. As rx-anon is a framework approach, it can be reused and extended by other anonymization algorithms, privacy models, and textual similarity metrics.

READ FULL TEXT

page 11

page 20

page 21

page 32

page 33

page 34

research
11/09/2020

MUSE: Illustrating Textual Attributes by Portrait Generation

We propose a novel approach, MUSE, to illustrate textual attributes visu...
research
10/04/2021

Benchmarking Data Lakes Featuring Structured and Unstructured Data with DLBench

In the last few years, the concept of data lake has become trendy for da...
research
09/03/2021

Joint Management and Analysis of Textual Documents and Tabular Data within the AUDAL Data Lake

In 2010, the concept of data lake emerged as an alternative to data ware...
research
05/07/2022

Learning Disentangled Textual Representations via Statistical Measures of Similarity

When working with textual data, a natural application of disentangled re...
research
10/25/2012

A Biomimetic Approach Based on Immune Systems for Classification of Unstructured Data

In this paper we present the results of unstructured data clustering in ...
research
06/17/2019

Public Ledger for Sensitive Data

Satoshi Nakamoto's Blockchain allows to build publicly verifiable and al...
research
08/24/2022

Next-Year Bankruptcy Prediction from Textual Data: Benchmark and Baselines

Models for bankruptcy prediction are useful in several real-world scenar...

Please sign up or login with your details

Forgot password? Click here to reset