k-Anonymity in Practice: How Generalisation and Suppression Affect Machine Learning Classifiers

02/09/2021
by   Djordje Slijepcevic, et al.
29

The protection of private information is a crucial issue in data-driven research and business contexts. Typically, techniques like anonymisation or (selective) deletion are introduced in order to allow data sharing, in the case of collaborative research endeavours. For use with anonymisation techniques, the k-anonymity criterion is one of the most popular, with numerous scientific publications on different algorithms and metrics. Anonymisation techniques often require changing the data and thus necessarily affect the results of machine learning models trained on the underlying data. In this work, we conduct a systematic comparison and detailed investigation into the effects of different k-anonymisation algorithms on the results of machine learning models. We investigate a set of popular k-anonymisation algorithms with different classifiers and evaluate them on different real-world datasets. Our systematic evaluation shows that with an increasingly strong k-anonymity constraint, the classification performance generally degrades, but to varying degrees and strongly depending on the dataset and anonymisation method. Furthermore, Mondrian can be considered as the method with the most appealing properties for subsequent classification.

READ FULL TEXT

page 23

page 38

research
08/15/2022

An Overview and Prospective Outlook on Robust Training and Certification of Machine Learning Models

In this discussion paper, we survey recent research surrounding robustne...
research
10/08/2020

Metrics and methods for a systematic comparison of fairness-aware machine learning algorithms

Understanding and removing bias from the decisions made by machine learn...
research
03/26/2023

Approaches to Improving the Accuracy of Machine Learning Models in Requirements Elicitation Techniques Selection

Selecting techniques is a crucial element of the business analysis appro...
research
05/11/2023

Energy cost and machine learning accuracy impact of k-anonymisation and synthetic data techniques

To address increasing societal concerns regarding privacy and climate, t...
research
03/29/2023

Poster: Link between Bias, Node Sensitivity and Long-Tail Distribution in trained DNNs

Owing to their remarkable learning (and relearning) capabilities, deep n...
research
05/16/2023

Comparison of classifiers in challenge scheme

In recent decades, challenges have become very popular in scientific res...
research
05/18/2020

An Overview of Privacy in Machine Learning

Over the past few years, providers such as Google, Microsoft, and Amazon...

Please sign up or login with your details

Forgot password? Click here to reset