Assessing the risk of re-identification arising from an attack on anonymised data

03/31/2022
by   Anna Antoniou, et al.
0

Objective: The use of routinely-acquired medical data for research purposes requires the protection of patient confidentiality via data anonymisation. The objective of this work is to calculate the risk of re-identification arising from a malicious attack to an anonymised dataset, as described below. Methods: We first present an analytical means of estimating the probability of re-identification of a single patient in a k-anonymised dataset of Electronic Health Record (EHR) data. Second, we generalize this solution to obtain the probability of multiple patients being re-identified. We provide synthetic validation via Monte Carlo simulations to illustrate the accuracy of the estimates obtained. Results: The proposed analytical framework for risk estimation provides re-identification probabilities that are in agreement with those provided by simulation in a number of scenarios. Our work is limited by conservative assumptions which inflate the re-identification probability. Discussion: Our estimates show that the re-identification probability increases with the proportion of the dataset maliciously obtained and that it has an inverse relationship with the equivalence class size. Our recursive approach extends the applicability domain to the general case of a multi-patient re-identification attack in an arbitrary k-anonymisation scheme. Conclusion: We prescribe a systematic way to parametrize the k-anonymisation process based on a pre-determined re-identification probability. We observed that the benefits of a reduced re-identification risk that come with increasing k-size may not be worth the reduction in data granularity when one is considering benchmarking the re-identification probability on the size of the portion of the dataset maliciously obtained by the adversary.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/20/2019

Risk-Efficient Bayesian Data Synthesis for Privacy Protection

High-utility and low-risks synthetic data facilitates microdata dissemin...
research
06/10/2016

De-identification of Patient Notes with Recurrent Neural Networks

Objective: Patient notes in electronic health records (EHRs) may contain...
research
10/30/2016

Feature-Augmented Neural Networks for Patient Note De-identification

Patient notes contain a wealth of information of potentially great inter...
research
05/09/2022

An Application of D-vine Regression for the Identification of Risky Flights in Runway Overrun

In aviation safety, runway overruns are of great importance because they...
research
06/01/2020

Identification Risk Evaluation of Continuous Synthesized Variables

We propose a general approach to evaluating identification risk of conti...
research
04/12/2023

Measuring Re-identification Risk

Compact user representations (such as embeddings) form the backbone of p...
research
01/21/2013

A formalization of re-identification in terms of compatible probabilities

Re-identification algorithms are used in data privacy to measure disclos...

Please sign up or login with your details

Forgot password? Click here to reset