A Review of Anonymization for Healthcare Data

04/13/2021 ∙ by Iyiola E. Olatunji, et al. ∙ L3S Research Center 24

Mining health data can lead to faster medical decisions, improvement in the quality of treatment, disease prevention, reduced cost, and it drives innovative solutions within the healthcare sector. However, health data is highly sensitive and subject to regulations such as the General Data Protection Regulation (GDPR), which aims to ensure patient's privacy. Anonymization or removal of patient identifiable information, though the most conventional way, is the first important step to adhere to the regulations and incorporate privacy concerns. In this paper, we review the existing anonymization techniques and their applicability to various types (relational and graph-based) of health data. Besides, we provide an overview of possible attacks on anonymized data. We illustrate via a reconstruction attack that anonymization though necessary, is not sufficient to address patient privacy and discuss methods for protecting against such attacks. Finally, we discuss tools that can be used to achieve anonymization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the increasing adoption of healthcare information technology (HIT) by medical institutions, the generation and capture of healthcare-related data have been increasing rapidly in the past years. The application of artificial intelligence (AI) techniques already gives a glimpse of potential improvements ranging from lung cancer nodules detection in CT scans to disease prediction and treatment

[61, 80, 97]. The challenge though is that these AI models are usually data hungry and require large amounts of data for training. Health care data, on the other hand, contains highly sensitive patient information and cannot be easily shared. The reluctance behind releasing data query/analysis tools build on health care data can be further justified by the fundamental law of information recovery [23] which states that when a data source is queried multiple times and it returns overly accurate information for each query, the underlying data can be reconstructed partially or in full. Therefore, health data need to be protected against such leakage to ensure patient’s privacy.

Privacy can be applied to health data at different levels. For instance, at the data collection phase, randomization in the form of noise is usually added. Federated learning, homomorphic encryption, and secure multi-party computation can be applied at the data distribution phase. In this work we focus on anonymization which is used to achieve privacy at the data publication phase. Consequently, regulations such as the GDPR [81] require data anonymization or removal of personal or sensitive identification information before processing any knowledge extraction task or query. We provide a comprehensive review of anonymization in healthcare data focusing on the three main aspects: (i) anonymization models and techniques (ii) attacks and defenses proposed for anonymized data (iii) available tools for anonymizing data.

In the first part, we introduce basic concepts and discuss existing anonymization techniques and their applicability to various types of health data. In particular, we differentiate between two different health data types: (i) relational and (ii) graph-based. Relational (same-site) patient data represents patient visits and medical diagnosis from a single hospital, and graph-based health data could include, for example, transmission network or epidemiological graph where the nodes are the patients and the edges are the interaction between the patients.

1.1 Attacks on anonymization

Although anonymized, the data might still be subject to several attacks such as background knowledge attacks, linkage attacks, attribute disclosure attacks, and membership disclosure attacks. A classic example of a linkage attack is the analysis of the 1990 U.S. Census performed by Sweeney [95]. Sweeney [95] found different combinations of pseudo or quasi-identifiers (QIDs) that would distinctively identify a person in the US and later used the same QID set to identify the then Massachusetts’ governor (William Weld) as well as his medical record using a combination of information from an anonymous voter list and anonymized medical dataset obtained from the group insurance commission.

In the second part of the review, we focus on different attacks under different adversarial settings. For a practical illustration, we demonstrate by devising a reconstruction attack on the anonymized MIMIC-III dataset [42].

1.2 Anonymization tools

Finally, we review several existing tools which can be utilized by practitioners and researchers for anonymizing health data. These tools allow users to perform non-interactive anonymization on data. They can be integrated into popular database management systems (DBMS) such as MySQL, PostgreSQL, and Oracle, and can be used to generate synthetic dataset.

1.3 Related works

We differentiate our work from the following key surveys on anonymization [36, 33, 25, 63]. Hamza et al. [36] only surveyed different attacks on privacy-preserving data publishing. Gkoulalas-Divanis et al. [33] provided a comprehensive review on several algorithms for protecting health data. However, their survey does not cater for the tools or practical applicability of these methods. Besides, their review is only limited to relational data. Eze and Peyton [25] reviewed anonymization for health data but their survey is limited to data sharing. Moreover, their survey does not cater for different adversarial settings for which attacks can be successful on different methods.

Health data is dynamic (constantly changing) and has several attributes (high dimensional) which differentiate it from other types of data. However, the anonymization techniques provided in [63] is generic to all types of data and not specific to health data. Therefore, most techniques reviewed cannot be directly applicable to health data. Lastly, none of the aforementioned works showed the vulnerability of attack on a real-world dataset.

1.4 Contributions

We summarize our contributions as follows:

  • We present a comprehensive review of different anonymization models and data transformation techniques that have been applied to relational and graph-structured health data.

  • We discuss different attacks on health data and demonstrate a practical reconstruction attack (code is provided at https://github.com/iyempissy/anonymization-reconstruction-attack)
    using the MIMIC-III dataset. We further review methods for protecting against such attacks.

  • We highlight existing practical tools that can be used to preserve and analyze an individual’s privacy from an adversary in healthcare settings.

We believe that our work will assist researchers and practitioners in choosing appropriate anonymization techniques based on a multitude of aspects such as data type, desired privacy level, information loss, and possible adversarial behavior.

1.5 Organization

The rest of the paper is organized as follows. We present anonymization models (privacy models) and techniques for satisfying such models in Section 2. We review methods for anonymizing different health data types in Section 3. In Section 4, we present attacks on anonymized data under different adversarial settings and demonstrate our reconstruction attack in Section 5. Finally, we present the defense mechanism in Section 6 and practical tools in Section 7.

2 Anonymization models and techniques

Anonymization refers to the complete removal of an individual’s identifiers as well as the generalization of any other data that can be used to establish links to the individual. Though regulations like GDPR might require anonymization of data, a clear guideline does not exist. We remark that anonymization differs from de-identification which refers to the removal or replacement of personal identifiers in the dataset such that the link between the individual and her data record can only be established by an authorized third party.

Several privacy models and techniques for achieving anonymization have been proposed in the literature. Overall, the three main goals of anonymization are to preserve: data utility (measured by the amount of loss caused by the anonymization technique e.g information loss), privacy (measured by the conformity of the data to the privacy model constraints), and data truthfulness (each anonymized record corresponds to a single record in the original table) [29]. We start by giving an overview of proposed anonymization models in Section  2.2, followed by techniques for operationalizing these models in Section 2.3.

2.1 Basic definitions and notations

The notion of privacy is often tied to the relational (tabular) data model [29], in which (finite) datasets are organized in tables (relations) that consist of columns (attributes) and rows (records). We start by defining anonymization models for tabular data and later extend the notion to graph-based data in Section 3.2.1. First, we need the following notations and basic definitions.

Adhering to [95], we denote tables as , where the are the attributes. For a record and any subset of attributes , let be the sequence of values for a record with respect to the subset attributes .

In the context of privacy, attributes can be categorized as direct identifiers (DIDs), quasi-identifiers (QIDs) and sensitive attributes (SAs) (cf. Table 1). DIDs uniquely identify an individual e.g social security numbers and driver’s license. QIDs on their own cannot identify an individual but when combined, can re-identify the individual. SAs of an individual are to be kept private from potential adversaries, while non-sensitive attributes may be made public and are considered to be already known by adversaries. As QIDs form an important ingredient of various anonymization models, we formally define QID as follows:

Definition 1.

Given a universe of individuals and a table containing a dataset pertaining to a set of certain individuals, let and . Then a QID is a set of non-sensitive attributes for which one individual can be re-identified when combined, i.e. .

2.2 Anonymization models

We here define three basic anonymization models proposed in the literature, namely, -anonymity, -diversity, and -closeness. We also summarize various properties, limitations, and several variants of these models in Table 2.

2.2.1 -anonymity

-anonymity requires that at least individuals share the same attributes. Since QID contains fields that are likely to appear in other known datasets, -anonymity ensures that each individual remains anonymous within their respective group (equivalence class).

Definition 2.

Let be the QID for table , then satisfies -anonymity if and only if there are at least identical records for each unique combination of records values with respect to the QID.

For example, if = 10, then each equivalence class should have at least 10 similar records. This guarantees that the attacker cannot identify the identity of a single record. However, when is too high, utility depreciates. Moreover, the absence of sufficient heterogeneity in sensitive attributes limits the privacy offered by the -anonymity model.

2.2.2 -diversity

-diversity overcomes the limitations of -anonymity by considering diversity among SAs. -diversity ensures that there are at least -distinct values of SA in each equivalence class [62].

Suppose, that for the QID value combination , there exists an equivalence class . The set of records in table , whose values of all belong to is called a -block.

Definition 3.

A -block is -diverse if it contains at least different values for the SA and the most-frequent values have roughly the same frequency. A table is -diverse if every -block is -diverse.

However, -diversity cannot prevent attribute disclosure attacks.

Example. Assuming disease is the SA and that the table is 3-diverse (meaning 3 distinct SA values in each equivalence class). For an equivalence class where the attribute values are gastric ulcer, gastritis, and stomach cancer, an adversary can infer that an individual has stomach-related problems because all three diseases in the equivalence class are stomach-related.

2.2.3 -closeness

-closeness ensures that the distance between the distribution of sensitive values in each equivalence class and the original class is no more than a threshold [55]. Hence, a smaller value of represents stronger privacy.

Definition 4.

Let be the distribution of SA in table and be the distribution of this attribute in a -block. The -block is said to have -closeness if the distance between and is at most for any SA . The table has -closeness if this holds for all its -blocks.

-closeness overcomes the limitation of

-diversity in presence of skewed attribute distributions in which one sensitive value dominates. Ensuring

-closeness for the above example would imply that for a table that satisfies 3-diversity, each equivalence class will not contain all stomach-related problems but other types of disease such as pneumonia. Common distance function used to measure closeness includes Kullback-Leibler (KL) distance and Earth Mover’s Distance (EMD).

2.3 Techniques for satisfying privacy models

Several techniques that have been used to satisfy various privacy models (-anonymity, -diversity, -closeness) include slicing, generalization, suppression and relocation, perturbation, bucketization, and microaggregation.

Slicing. Slicing involves partitioning the data into groups both horizontally and vertically. The slicing technique first performs vertical slicing by partitioning attributes into columns where each column contains a subset of attributes. Then performs horizontal slicing by partitioning tuples into buckets where each bucket contains a subset of tuples [57]. The goal of slicing is to ensure that attributes that are highly correlated are grouped together.

Generalization. Generalization replaces QIDs (attributes that potentially identify individual i. e., age, zip code, gender, date of birth, etc.) values with other less specific values which are consistent with the original data [50]. An example of generalization hierarchy for three attributes (gender, marital status, and religion) is shown in Figure 1.

Since generalization may lose considerable information, various methods have been proposed to reduce the information loss. Most generalization algorithms use global recoding or full domain generalization of attribute values. This implies that the same transformation is applied to each QID value. However, data utility can be higher when local recoding is applied to each equivalence class [108]. Anonymization achieved with generalization-only approaches inevitably distorts the records. Therefore, over-generalization negatively influences the analysis of the anonymized dataset.

Suppression and relocation. In order to reduce over-generalization, suppression and relocation techniques are used [96, 74]

. Suppression involves the removal of outliers, and relocation involves the changing of the QIDs of the outliers. Outliers are the main cause of over-generalization because the outliers are distant from other records and since they are few, it is not sufficient to group them in the same equivalent classes. In these methods, the values of outliers are removed or changed.

Suppression can be performed at the record level or cell level. Record level suppression leads to excessive deletion of records for equivalence classes that do not meet privacy constraints while cell level suppression only removes QID values that make a tuple violate privacy model constraint. Thereby leading to reduced loss of information. Suppression comes in handy when used together with machine learning models because the suppressed values can be handled as missing values. However, in large datasets, full column suppression can potentially hurt utility.

Perturbation. In lieu of suppression, perturbation techniques can be used to augment generalization. Perturbation involves replacing sensitive values and QIDs with fake masks of the original data [25]. Using perturbation to achieve anonymization leads to better utility because the degree of generalization is limited via the insertion of counterfeit records to equivalent classes [50].

Bucketization. Bucketization [3, 45] separates the sensitive attributes from the QIDs by randomly permuting or swapping the sensitive attribute values in each bucket. This achieves better utility than generalization. However, because bucketization publishes the QID values in their original forms, an adversary can still identify an individual using the QIDs and therefore, does not prevent membership disclosure. For better privacy, generalization and bucketization can be applied on the same dataset to jointly achieve -anonymity and -diversity respectively.

Microaggregation: Ensuring that the same dataset satisfies -anonymity, -diversity and -closeness leads to considerable loss of information due to generalization, perturbation and suppression or may not be even achievable in practice. Thus, Gal et al. [31] proposed a microaggregation algorithm that caters for numerical QIDs for creating -anonymous equivalent classes by replacing sensitive attributes with masked values. [44]. Microaggregation involves substituting the values of groups of nearest records by their centroid. The optimal solution to this problem is known to be NP-hard [75, 19].

Figure 1: Example of generalization hierarchies of gender, marital status, and religion. Religion is label-encoded indicating Greek orthodox, protestant, others, Jewish, Methodist, catholic, and christian scientist. Level refers to the height of the hierarchy tree and * indicates suppression.
Attributes Description
Relational data attributes
Direct identifier (DID) Identifiers that uniquely identifies an individual e.g social security numbers and driver’s license.
Quasi-identifier (QID) QIDs on their own cannot identify an individual but when combined, can re-identify the individual e.g gender, postcode
Sensitive attributes (SA) SAs are not usually public data but are sensitive if associated with an individual e.g drug codes, diagnosis, and disease conditions
Graph-based data attributes
Node Represents individuals. Even when the PII are removed from the nodes, the presence or absence of a target individual in the graph can be considered as the privacy of the individual.
Node properties Node properties such as the degree (number of connected neighbors) or attributes associated with a node e.g sex can be considered as privacy of the individual
Node labels The labels of nodes in a graph can be non-sensitive labels or sensitive labels. The sensitive labels are similar to the sensitive attributes in relational data
Links The links or edges in graphs show that there exists a relationship between corresponding nodes. Therefore, needs to be kept private
Table 1: Attributes of different health data types.

3 Anonymization in healthcare: data types and techniques

3.1 Relational health data

The most conventional way of representing patients’ health data is relational(tabular) where each row (a record) in a database table represents a measurement, event or other patient-related information. This data model is known as relational and typically represents single-site patient data, which includes patient visits and medical diagnosis from a single hospital or doctor’s practice.

For example, when patient visits a doctor in the same hospital, the diagnosis and details of ’s visits form a record in the database. This forms the basis of the electronic medical records (EMR). For this type of health data, anonymization is achieved by ensuring that the records in each equivalence class are indistinguishable or similar.

3.1.1 Anonymization of relational health data

Anonymization based on generalization. Following the definitions in Section 2.2, the combination of QIDs such as age, race, gender, and zip code can be used by an adversary to re-identify an individual. Hence, QIDs are usually generalized to satisfy the constraints of the privacy models.

To reduce information loss due to generalization, the -anonymity problem can be viewed as a clustering or partitioning problem where the goal is to find sets of clusters (i.e., equivalence classes), each of which contains at least records [13, 5, 20, 71, 73]. The records in each partition or cluster are then generalized so that they all share the same QID value and have at least -records. However, when there are many QIDs, the clustering technique still suffers from high information loss [4, 72] and fails to protect against -diversity.

While most of the works ignored the differences of QIDs’ potential to reveal an individual’s identity, Majeed et al. [64] focused on adaptive attribute generalization based on different information contents of QIDs. They proposed a solution to -diverse and -closeness in an imbalance dataset by creating a flexible generalization hierarchy based on the vulnerability of the QIDs to reveal the identity and the diversity of SAs in each equivalence class. Their proposed method enhances utility by controlling the over-generalization of QIDs that are less vulnerable to revealing identity in diverse classes.

For the same dataset to be -anonymous, -diverse, and -close, Gal et al. [31] proposed a microaggregation algorithm that utilizes numerical QIDs to create -anonymous clusters and replaces sensitive attributes with masked values. Other methods used to jointly achieve all -anonymity, -diversity, and -closeness on medical data include SHARE [33]

and correlation-aware anonymization of high-dimensional data (CAHD)

[32].

Anonymization for incremental medical data. Medical records are incremental in nature, therefore, methods for incremental anonymization or data streams have been proposed where the goal is to anonymize medical data as they increase [14], [28], [47, 82, 106], [67], [46], [78], [79]. Such methods are based on accumulation, aggregation, and clustering. For these methods, each new piece of data is a tuple consisting of QIDs and sensitive attributes. When new data arrive, they pass through an aggregation engine that assigns the data to a cluster based on the information loss. These clusters are then passed to the privacy models. If a tuple does not satisfy the privacy model e.g -diversity, perturbation is applied to the tuple by randomly adding some generated attributes to the original data.

Utility-aware anonymization. Utility-aware anonymization was considered in [51, 108, 83]. These methods use attribute-level anonymization that retains the original values of QIDs based on the assigned utility value and applies local recoding to anonymize the data. Moreover, Poulis et al. [83] proposed a (, )-anonymity technique that has bounded utility constraints and still ensures the privacy of both demographic information and disease diagnosis codes.

As knowledge of the interrelationship between QIDs and SAs affects privacy protection [58, 21, 103, 98]

, a number of works focused on determining the possible level of privacy without compromising utility. Besides using popular metrics such as information loss, precision, and discernability to measure utility, several works used the accuracy of a trained classification model as the utility metric. Such approaches lean towards employing machine learning (ML) models or decision trees to measure the accuracy-anonymization/privacy trade-off

[111, 52, 30, 54, 48].

Anonymizing multi-dimensional health data. Since health data may be multi-dimensional with several attributes, the knowledge of an attacker knowing those attributes can be relaxed and bounded to according to the -privacy model proposed by Mohammed et al. [68]. This implies that not all attributes need to be anonymized but a combination of attributes in order to satisfy the privacy model (e.g.

-anonymity). However, combinatorics suffers from the curse of dimensionality

[4] and the anonymized dataset can still be vulnerable to inference attacks when viewed collectively [14]. Therefore, data publishers should anonymize and publish only a sample of the original dataset [24].

Anonymizing against background attack. Majeed [65] proposed an anonymization scheme that prevents individuals from identity disclosure even when faced with adversaries having strong background knowledge. Their proposed method is based on transforming data into fixed intervals and then replacing the original values with averages.

3.2 Graph-based health data

With the adoption of electronic health record (EHR) which allows cross-sectional as well as the longitudinal view of patient diagnosis across different hospitals (multi-site), several works have been geared towards representing EHR data in combination with other heterogeneous sources such as clinical notes, diagnosis (ICD codes), therapies and prescriptions (procedure and ATC codes) as graphs [8, 40, 90, 35, 18, 109]. Liu et al. [59] used graph analysis technique to detect fraud, waste, and abuse in healthcare by representing health dataset as a heterogeneous network consisting of several patients, doctors, and pharmacies. Similarly, Branting et al. [10]

used graphs derived from an open-source health dataset for estimating healthcare fraud. Recently, EHRs have been represented as graph data to extract knowledge graphs

[90, 35, 18, 109].

These graph-based health data provide rich knowledge about the interaction and the underlying properties of the data. For example, in a disease transmission network or epidemiological graph, the nodes are the patients whereas the edges represent the interaction between the patients. Therefore, both nodes and edges need to be protected as they contain sensitive information. Anonymization approach for graph-based health data can be achieved either by removing edges and node labels (naïve anonymization) or by adding new edges or nodes to modify the structure of the graph (structural anonymization).

3.2.1 Anonymization of graph-based health data

Most real-world data including health data can be represented as graphs where nodes denote entities or individual and edges represent the interaction or relationship between the entities To anonymize graphs, one can re-represent the data as a single table and apply anonymization as though it was relational data. However, graphs do not have the same semantics as relational data where records are independent and such representation will fail to provide the necessary anonymization. The graph-variant of the -anonymity technique used in relational health data is -degree anonymity.

Definition 5.

A graph is -degree anonymous if for every node , there exist at least -1 other nodes in the graph with the same degree as .

Zhou and Pei [116] showed that even when an individual’s privacy is preserved with conventional anonymization techniques, the individual in a graph can be re-identified due to neighborhood attacks. This is possible when the adversary has some knowledge about the neighbors of a target victim and the relationship among the neighbors. To ensure privacy, each node must have other nodes with the same -hop neighborhood.

Anonymization based on graph manipulation. Using graph modification technique such as edge addition or deletion, Liu and Terzi [60] achieved -degree anonymity by constructing a -anonymous degree sequence where each node has the same degree as -1 other nodes. The anonymous degree sequence is then used in constructing the anonymized graph. Mortazavi and Erfani [70] proposed (, )-anonymity method that preserves the privacy of the individual or node in the graph such that even when an attacker knows at most neighbors of a node, an attacker cannot identify that node in a group of less than nodes. This is achieved by first adding edges to the graph to satisfy the privacy model, then redundant edges are removed to minimize changes between the anonymized and the original graph. However, the changes may be visible to an attacker. Similarly, [118], [17], [110] and [6] exploited graph structure and attribute information to protect against such structure-based and node attribute disclosure attacks (see Section 4.3) using -degree anonymity, -diversity, and -closeness.

Clustering-based anonymization. Similar to the relational data in Section 3.1.1, clustering-based anonymization techniques for graph data have been proposed [9, 38, 99]. These approaches first partition the data into clusters and construct graphs based on the data in each cluster. Then graph modification techniques such as removing or adding edges are performed on the constructed graph to satisfy the privacy model constraints. Zheleva and Getoor [115] showed that sensitive relationships can be inferred from anonymized graphs. As a defense method, they removed all edges from the anonymized graph and collapsed the nodes into a single node for each cluster. Then they randomly selected edges from the removed edges and assigns them as the edge of the cluster. In order to anonymize graphs while maintaining structural similarity, Foffano et al. [26] applied Szemerédi regularity lemma in partitioning each node into clusters and ensuring that the intra-partition edges behave almost randomly. By this, the inter-partition edges are left unaltered, making it semantically similar to the original graph.

Although we allude that several methods for anonymizing social network data might be applicable for graph-based health data, such claims have not been empirically verified. Interested readers are referred to [63] for details about anonymizing social network data.

Privacy Model Protected Attribute Guarantee Limitation Variants
-anonymity QID Protection against LI/identity disclosure attack and AD attack. Privacy is not guaranteed when an adversary has strong background knowledge (BK attack) or when the SA values in an equivalence class are not diverse (homogeneity attack). (, )-anonymization [114] -sensitive -anonymity [15] -sensitive -anonymity and (, )-sensitive -anonymity [94] -sensitive -anonymity [43] (, )-anonymity [104] complete (,)-anonymity [41] SDSV--sensitive -anonymity [11] (, )-sensitive -anonymity [2] overlapped slicing method [12] MSA(,) algorithm [113] (,)-anonymity [83] DBTP-MDAV [105]


-diversity
SA Protection against AD attack and homogeneity attack. Offers protection of SA. Difficulty in creating a feasible -diverse dataset when the data is highly imbalanced (e.g., if the SA distribution is uneven). Does not protect against MD attack. independent -diversity [117] (, )-diversity [100] DBTP--MDAV [105] (, )-anonymization [107] (, )-semantic diversity [76] SQ -diversity [89]

-closeness
SA Depending on the value of , it may protect against MD attacks. Well suited for numeric attributes and offers better protection than -diversity. Difficult to create a feasible -close dataset when data is highly imbalanced. A higher value of may lead to degradation in data utility and it is complex in nature. (,)-closeness [56] -Shuffle based -closeness [85] Multiple sensitive attribute-based -closeness [102] SABRE -closeness [16] Microaggregation-based -closeness [92] SQ -closeness [89]
-degree anonymity Node and links Protection against NAD attack, NB attack, and ST attack. Vulnerable to NE attack. The edge additions and removal may adversely alter the structure of the original graph. -anon graphs [116] -degree anonymity [60] class-based anonymity [9] anonymity [38] cluster-based anonymity [99]
-degree--diversity Node properties and labels Protection against NE attack. Similarly, node labels cannot be inferred. The noise added to nodes to preserve privacy may lead to lower utility. -automorphism [118] -isomorphism [17] -anonymity [110] Gram algorithm [70] L-anonymity [26]

Table 2: Comparison of different privacy models, guarantees and their variants. The first 3 rows are for relational data while the last 2 rows are for graph-based data.

4 Attacks on anonymized health data under different adversarial settings

In this section, we discuss the different types of adversaries and plausible attacks on graph-based and relational health data. We also discuss the vulnerability of the attack under certain adversarial settings as summarized in Table 3.

4.1 Types of adversaries

When considering anonymization, understanding the knowledge of the adversary provides better protection against re-identification and also plays a crucial role when performing analysis [39]. An adversary can either be semi-honest or malicious

. A semi-honest adversary (honest-but-curious) is an adversary that follows the predefined protocol but also interested in learning more from the received information than he is entitled to. A malicious adversary, however, deviates from the protocol and possibly colludes with other corrupted parties. We classify these adversaries into classical, statistical, and adaptive adversaries.

Adversary Type Description Categorization Attacks
Classical Tries to discover SAs of an individual based on her knowledge of the QIDs. Semi-honest BK, NAD, LI
Statistical Exploits the differences between the statistical distribution of the original and the anonymized dataset to uncover perturbations applied to the data. Semi-honest AD, MD, NE
Adaptive Has the capability to reverse-engineer the algorithm used for the anonymization based on her knowledge of the anonymization algorithm. Moreover, she has the capability of adapting her strategy as the attack progresses. Malicious AD, MD, NB, ST, NE
Table 3: Categorization of different types of adversaries and their attack vulnerability.

4.2 Attacks on anonymized relational data

Following the adversarial settings in Section 4.1, we categorize the attacks on relational health data into background knowledge attacks, linkage attacks, attribute disclosure attacks, and membership disclosure attacks.

Background knowledge (BK) attack. When an adversary knows some information or QIDs about the target individual, she can reconstruct the identifiable information of the individual. Such a reconstruction attack compromises the privacy of the target individual. We show an example of a reconstruction attack in Section 5 using the MIMIC-III dataset.

Linkage (LI) attack. The linkage attack is one of the classical attacks on relational data where an adversary can re-identify or link a record in an anonymized dataset by combining QIDs from different sources to an individual. This requires some form of background knowledge attack.

Attribute disclosure (AD) attack. In an AD attack, the attacker aims to gain new information on SA. The attacker can also exploit the properties of the QIDs to estimate the SA. Usually, attribute disclosure is a by-product of identity disclosure. Nonetheless, it can be agnostic to linkage attack. For example, an adversary may be interested in inferring SA that is common to all target individuals.

Membership disclosure (MD) attack. MD attacks involve an adversary aiming to infer the presence or absence of an individual in a dataset. For example, knowing that an individual is present in a cancer dataset reveals that she has cancer even though the specific type of cancer might not be inferred. This may serve as leverage to launch an attribute disclosure attack.

4.3 Attacks on anonymized graph data

On relational data, a combination of QIDs can be used in identifying the individual as in linkage attack. However, in graph-based data, several additional parameters can be used to uniquely identify an individual. These attributes include node, node properties or attributes, node labels, and links (see also Table 1). The combination of the attributes implies that several pieces of information can be considered as privacy of an individual which leads to several attacks that are different from relational health data.

Node attribute disclosure (NAD) attack. This attack is similar to the linkage attack in the relational data where QID can be used to re-identify an individual. A node may be linked to an individual based on the set of attributes (like age, gender, etc) assigned to the node.

Neighborhood (NB) attack. Since nodes are connected in the graph, an adversary having background knowledge about some of the neighbors can uncover the identity of other connected nodes. For example, before the anonymized graph is released, an adversary can randomly create a subgraph and attach it to some target user. The attack is successful if the attacker can find the subgraph in the released anonymized graph. She can then discover other nodes by following the edges of the subgraph.

Structure-based (ST) attack. The ST attack is an umbrella type of attack consisting of several attacks where the adversary utilizes the unique structural characteristics of the graph. These attacks include degree attack (using the degree sequence of the graph to uniquely identify an individual), subgraph attack (the attacker knows the subgraph around the target node), hub-fingerprint attack (an adversary knowing the distance between a hub and target node can invade the privacy of the target node), and walk-based attack (searching over short walks in the graphs)

Node existence (NE) attack. Similar to the membership disclosure attack in the relationa data, the adversary aims to discover the presence or absence of a node in the graph [77]. This attack is usually preceded by the structure-based attack.

5 Reconstruction attack

5.1 Motivation

Tang et al. [97] showed that demographic features such as insurance, marital status, gender, age, and race does not contribute to patient readmission prediction models. From a privacy perspective, this implies that demographics which are a form of QIDs that can be used to re-identify an individual when combined with other publicly available data can be safely removed without affecting the prediction result. Similarly, the remaining (non-private) data can be released without violating patient’s privacy.

In this paper, we show that this approach of releasing data still poses a threat to privacy. Assume that an adversary has some demographic information of some data records, she can train a classifier or ML model using the released data (with the demographics data removed) and the subset of demographic information that she has to predict the demographics of the entire dataset. That is, the demographic information of the rest of the patients, for whom she does not have can be inferred. This implies that demographics encodes some amount of information and in the presence of other attributes, demographics seems not to be useful (i.e. it is correlated with other attributes).

We allude that the assumption that the adversary can have access to demographic information of some of the patients is valid as, for example, some people post their diagnoses or symptoms online. For instance, Guntuku et al. [34] showed that social media can be used for symptom discovery and diagnosis of diseases. They mapped tweets to states by geolocating all tweets using a combination of location coordinate information and user location descriptions. Similarly, demographics can be obtained from social media profiles which when mapped with the diagnosis can be used by an adversary to launch an attack. To summarize, we show that it does not suffice to only remove demographic information from data to ensure privacy.

5.2 Dataset

The Medical Information Mart for Intensive Care (MIMIC-III) [42] dataset consists of hospital admission data, lab measurements, procedure event recordings, prescriptions, hospital length of stay, diagnostic codes, and microbiological data of 53,467 unique patients. We extract demographics and data for our readmission prediction experiment using the same subset of the dataset as in [80] while for the length of stay (LOS) prediction, we used the same subset of the dataset as in [97]. We used age, ethnicity, admission type, marital status, insurance, religion, gender, and language as the demographic information. We grouped age into 5 equally spaced bins. represents the features used in predicting the corresponding task. They include potassium score, arterial blood pressure, albumin score, blood urea nitrogen (BUN) score, creatinine score, sodium score, bicarbonate score, heart rate, systolic blood pressure, temperature, respiratory rate, spo2, glucose level, coagulation, physiology score (sapsii), pao2fio2 score, sirs, organ failure (sofa) and acute physiology score (apsiii).

5.3 Methods

We used the ARX tool [84] for implementing our privacy models. We first define multiple transformation models for each attributes based on generalization and suppression that fulfills the risk threshold of the privacy model as shown in Figure 1. This creates a hierarchy of possible solution space as shown in Figure 2. The algorithm then searches the solution space for optimal solution (yellow colored node) according to the utility model. The utility model measures the loss of information. It is defined as:

where is the information loss for attribute defined as:

where is the number of attributes and is the number of records in the dataset and is the information loss of a particular record of an attribute . This depends on the attribute type.

For categorical and numeric attributes:

For attributes with intervals:

where is the number of leaf nodes at the value in the hierarchy of attribute , is the value of attribute in record , and are the upper and lower bounds of an interval respectively. If all values are suppressed, then = 1. We select the solution with high utility.

We considered two privacy models (-anonymity, -diversity) for all our experiments. As defined in Definition 4,

-closeness caters for the distribution of sensitive attributes which does not affect the results of using machine learning (ML) models. Therefore, we exclude this privacy model from our analysis. For the ML model, we used a 3-layer multilayer perceptron (MLP).

Figure 2: Solution space lattice of performing 100-anonymity when all demographics are anonymized. The highlighted yellow node shows the optimum solution. Different numbers indicate the height of the hierarchy tree of each demographic attribute that satisfies the privacy model.

Performance of using other features to predict demographics (reconstruction attack).

Feature Accuracy Age Ethnicity Marital Status Insurance Religion Gender

5.4 Results and discussion

(a) 24hrs Readmission
(b) 48hrs Readmission
(c) 72hrs Readmission
(d) 7days Readmission
(e) 30days Readmission
(f) Rebounce
Figure 3: Performance of training the ML model for patient’s readmission prediction task using only demographic data.

5.4.1 Reconstruction attack: result

Following the observation from Tang et al. [97] where demographic information does not affect the risk of readmission prediction. We reverse the process by first using only the demographics features to predict the risk of readmission. Specifically, we ignore the features. As shown in Figure 3, demographic features can still predict with some reasonable performance. This implies that there is some correlation between and demographics features. Therefore, the goal of our reconstruction attack is to construct the non-sanitized demographic features using the anonymized features. Table 5.3 shows the accuracy of the ML model for our reconstruction attack. This shows that with high accuracy, demographics such as age, ethnicity, insurance, and gender can be reconstructed. However, marital status and religion are difficult to reconstruct because these attributes have been over-generalized by the privacy models.

5.4.2 Effect of generalization and suppression

To quantify the effect of generalization and suppression on the privacy model, we used the length of stay (LOS) prediction task. We first anonymize the original data according to the privacy model, then run the ML model on the anonymized data. As shown in Figure 6, when all features are anonymized by mean generalization, the performance of the anonymized data and the non-sanitized data does not differ significantly. However, when the data records that violate the privacy model constraint are completely removed (suppression), the performance drops significantly. To mitigate this effect, we turn to the -privacy model [68]. The -privacy model [68] requires that not all attributes need to be anonymized but a combination of

attributes. However, since combinatorics suffers from the curse of dimensionality, we adopted a feature selection method on the non-sanitized data for ranking the features to determine the predominant ones. We then anonymize the predominant features rather than anonymizing all

features. According to the feature ranking, heart rate measurement is the top feature while temperature and glucose are the least features. Figure 4 shows the effect of anonymizing temperature and glucose. We observe that the anonymized data does not differ much from the non-sanitized data across all privacy models. This is because the feature ranking of these attributes is low (bottom 2). As shown in Figure 5, when the top feature is anonymized, the difference between the anonymized data and non-sanitized data is significant. These observations show that attributes with the least effect on prediction performance are indifferent when they are anonymized. Similarly, anonymizing only the top features have a major effect on the performance. Therefore attribute level privacy based on -privacy model is better than anonymizing all attributes provided that the anonymized attribute is the knowledge that the adversary can have.

Figure 4: Performance of the ML model when temperature and glucose are the anonymized attributes on the LOS prediction task.

Figure 5: Performance of the ML model when only the heart rate measurement attribute is anonymized on the LOS prediction task.

Figure 6: The effect of mean generalization and suppression of data that does not meet privacy model constraints for -anonymity where k=1000.

6 Differential privacy as a defense mechanism

To defend against attacks on anonymized data as discussed in Sections 4.3 and 4.2, differential privacy (DP) was proposed [23]. DP is a mathematical definition of privacy aimed at preserving the privacy of an individual via the addition of noise. The main intuition of DP is a guarantee that the inclusion or exclusion of an individual’s record has little effect on the output of the analysis. Consider two datasets , where they differ in at most one data record (neighbors). DP ensures that the output of performing analysis on and will be the same. This guarantees that no individual record can be inferred. Thus, robust to reconstruction attack.

Noise addition methods to satisfy DP can be local or global. In local DP, noise is added to each data point in the dataset (either by a dataset curator once the dataset is formed or by the individuals themselves before making their data available to the curator) whereas in global DP, the noise necessary to protect the individual’s privacy is added at the output of the query of the dataset. Generally, global DP can lead to more accurate results when compared to local DP while keeping the same privacy level. However, when using global differential privacy, data owners need to trust the dataset curator to add the necessary noise to preserve their privacy.

DP for answering queries. DP has been applied to privacy-preserving data mining [27, 93, 1, 88, 66, 86, 69] where the goal is to ensure indistinguishability of results between any pair of neighbor data sets which differs by one record by adding noise. An alternative approach to standard DP based on individual differential privacy was proposed in [93] where utility is preserved by introducing less noise to the query result. This is achieved by assuming that the data curator can use her knowledge of the actual data set at the time of query response and downwardly adjust the distortion to the actual data. Sarathy and Muralidhar [88]

evaluated the effect of adding Laplace noise to queries to ensure privacy. They showed that the addition of Laplace noise to queries (since it depends on the sensitivity of the query function and privacy budget) can still be vulnerable to tracker attack where an adversary can issue multiple queries and use the query response to uncover one or more observation. In order to avoid such vulnerability, the noise variance added should increase as a function of the number of queries.

Platforms that ensure DP includes privacy-integrated queries (PINQ) [66] that provides SQL-like language to analysts while protecting privacy and GUPT [69] which ensures that the privacy budget is distributed among queries based on the desired level of privacy and utility.

DP is a natural fit for query processing rather than data publishing since the goal of DP is to ensure that the output of two neighboring datasets is the same. However, applying DP to data publishing is desirable due to its guarantees.

DP for data publishing of health microdata.

The application of DP to data publishing is mainly by releasing aggregated results based on count queries or contingency tables or histograms

[37, 53]. For example, a contingency table can be created by combining demographics and LOS where LOS is the sensitive attribute. To achieve DP, noise is added to the counts of the different combinations of demographic attributes and LOS. However, the more the attributes, the more the noise that needs to be added to ensure DP which leads to higher data distortion. Similarly, creating a contingency table for health data with several attributes is difficult in practice. Moreover, since ML models aim at learning the dependencies between several attributes to make predictions, releasing microdata is preferred over contingency tables. Soria-Comas et al. [91] proposed an intuitive approach to achieving DP based on -anonymity. Their method is based on microaggregation technique that adds noise to the -anonymous version of the data set. Zhang et al. [112]

proposed a deferentially private method for releasing high-dimensional data by constructing a Bayesian network. The Bayesian network models the correlations among the attributes and approximate the distribution of the data using a set of low-dimensional marginals. Then noise is added to each marginal to ensure DP. Both the noisy marginals and the Bayesian network are then used to construct an approximation of the data distribution. Instead of releasing the dataset of the approximate distribution, they first sample records from the approximate distribution to construct a synthetic dataset, then releases the synthetic data.

A recent approach to releasing microdata for health data employed data perturbation techniques to reduce the amount of noise required to satisfy DP [49]. Specifically, generalization, suppression, and insertion of random data (noisy insertion) are first performed on the raw data. The generalization and suppression reduce the amount of noise needed to be added because the combination of attributes will be reduced. Then a utility score based on information loss is assigned to each record. Finally, records with less information loss are selected as candidates to be released. The noisy insertion ensures that the released data satisfies DP and thus robust to reconstruction attack.

7 Tools

ARX [84]. ARX is a non-interactive microdata anonymization tool that tends to automate the anonymization process. This means that protected records are created from records of an input dataset. This is also known as one-time anonymization or release and forget where the publisher anonymizes the database and then publishes it, allowing third parties to access the anonymized data. It combines several transformation techniques such as sampling, aggregation, suppression, categorization, and generalization. The ARX data anonymization tool is an open software that supports an arbitrary combination of privacy and utility models. Thus, making it a generic anonymization tool. Privacy models supported include -anonymity, -diversity, -closeness, -disclosure privacy, -likeness and -presence, -map and DP.

PySyft [87].

PySyft is an extension built on top of popular deep learning frameworks such as PyTorch and Tensorflow for encrypted privacy-preserving deep learning. It is a Python library for secure and private deep learning. It supports federated learning, DP, and encrypted computation including multi-party computation and homomorphic encryption. Although PySyft is in its early years of development, the scalability and privacy protection it offers is very promising.

Anonimatron [7]. Anonimatron is an open-source data anonymization tool written in Java. It can generate surrogate data with the same properties as the original data. It can also generate fake email addresses, fake Roman names, and universally unique identifiers (UUIDs). Anonimatron supports popular DBMS including Oracle, PostgreSQL, and MySQL.

Synthea [101]. Synthea is an open-source synthetic medical data generation tool that models the medical history of patients for research. It utilizes realistic but not real medical data. Synthea uses the PADARSER framework [22] to generate a skeletal synthetic EHR. As output, Synthea provides fast healthcare interoperability resources(FHIR), a medical information standard for exchanging EHR. Synthea generates realistic medical data for patients living in the imaginary commonwealth of Massachusetts. A complete lifetime EHR was obtained for each patient from birth to death using different publicly available sources like US Census Bureau demographics or National Institutes of Health reports.

Other tools include UTD Anonymization Toolbox, Cornell Anonymization Toolkit, TIA-MAT, Anamnesia, SECRETA, sdcMicro and -Argus. However, they usually only support a limited set of privacy models or focus on specific privacy and data transformation models.

8 Conclusion

In this paper, we provided a comprehensive review of anonymization models and techniques applicable for relational and graph-based health care data. Besides, we studied possible attacks on anonymized data and empirically demonstrated reconstruction attack on MIMIC-III data. Finally we discussed existing defense mechanisms while giving an overview of existing anonymization tools. We believe that our comprehensive review covering different perspectives on anonymization will assist researchers and practitioners in selecting relevant anonymization techniques based on the data type, desired privacy level, information loss, and possible adversarial behavior.

Funding

This work is in part funded by the Lower Saxony Ministry of Science and Culture under grant number ZN3491 within the Lower Saxony "Vorab" of the Volkswagen Foundation and supported by the Center for Digital Innovations (ZDIN), and the Federal Ministry of Education and Research (BMBF), Germany under the project LeibnizKILabor (grant number 01DD20003).

References

  • [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. Cited by: §6.
  • [2] S. Agarwal and S. Sachdeva (2018) An enhanced method for privacy-preserving data publishing. In Innovations in Computational Intelligence, pp. 61–75. Cited by: Table 2.
  • [3] C. C. Aggarwal (2005) On k-anonymity and the curse of dimensionality. In VLDB, Vol. 5, pp. 901–909. Cited by: §2.3.
  • [4] C. C. Aggarwal (2008) Privacy and the dimensionality curse. In Privacy-Preserving Data Mining, pp. 433–460. Cited by: §3.1.1, §3.1.1.
  • [5] R. Agrawal, A. Evfimievski, and R. Srikant (2003) Information sharing across private databases. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp. 86–97. Cited by: §3.1.1.
  • [6] A. Andreou, O. Goga, and P. Loiseau (2017) Identity vs. attribute disclosure risks for users with multiple social profiles. In Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017, pp. 163–170. Cited by: §3.2.1.
  • [7] Anonimatron (2015) GDPR compliant testing.. Realrolfje. External Links: Link Cited by: §7.
  • [8] P. S. Bearman, J. Moody, and K. Stovel (2004) Chains of affection: the structure of adolescent romantic and sexual networks. American journal of sociology 110 (1), pp. 44–91. Cited by: §3.2.
  • [9] S. Bhagat, G. Cormode, B. Krishnamurthy, and D. Srivastava (2009) Class-based graph anonymization for social network data. Proceedings of the VLDB Endowment 2 (1), pp. 766–777. Cited by: §3.2.1, Table 2.
  • [10] L. K. Branting, F. Reeder, J. Gold, and T. Champney (2016) Graph analytics for healthcare fraud risk estimation. In 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 845–851. Cited by: §3.2.
  • [11] E. K. Budiardjo, W. C. Wibowo, H. T. Achsan, et al. (2019) An approach for distributing sensitive values in k-anonymity. In 2019 International Workshop on Big Data and Information Security (IWBIS), pp. 109–114. Cited by: Table 2.
  • [12] E. K. Budiardjo, W. C. Wibowo, et al. (2019) Privacy preserving data publishing with multiple sensitive attributes based on overlapped slicing. Information 10 (12), pp. 362. Cited by: Table 2.
  • [13] J. Byun, A. Kamra, E. Bertino, and N. Li (2007) Efficient k-anonymization using clustering techniques. In International Conference on Database Systems for Advanced Applications, pp. 188–200. Cited by: §3.1.1.
  • [14] J. Byun, Y. Sohn, E. Bertino, and N. Li (2006) Secure anonymization for incremental datasets. In Workshop on secure data management, pp. 48–63. Cited by: §3.1.1, §3.1.1.
  • [15] A. Campan, T. M. Truta, and N. Cooper (2010) P-sensitive k-anonymity with generalization constraints.. Trans. Data Priv. 3 (2), pp. 65–89. Cited by: Table 2.
  • [16] J. Cao, P. Karras, P. Kalnis, and K. Tan (2011) SABRE: a sensitive attribute bucketization and redistribution framework for t-closeness. The VLDB Journal 20 (1), pp. 59–81. Cited by: Table 2.
  • [17] J. Cheng, A. W. Fu, and J. Liu (2010) K-isomorphism: privacy preserving network publication against structural attacks. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 459–470. Cited by: §3.2.1, Table 2.
  • [18] Q. Cong, Z. Feng, F. Li, L. Zhang, G. Rao, and C. Tao (2018) Constructing biomedical knowledge graph based on semmeddb and linked open data. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1628–1631. Cited by: §3.2.
  • [19] J. Domingo-Ferrer and V. Torra (2005) Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Mining and Knowledge Discovery 11 (2), pp. 195–212. Cited by: §2.3.
  • [20] X. Dong, A. Halevy, and J. Madhavan (2005) Reference reconciliation in complex information spaces. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pp. 85–96. Cited by: §3.1.1.
  • [21] W. Du, Z. Teng, and Z. Zhu (2008) Privacy-maxent: integrating background knowledge in privacy quantification. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 459–472. Cited by: §3.1.1.
  • [22] K. Dube and T. Gallagher (2013) Approach and method for generating realistic synthetic electronic healthcare records for secondary use. In International Symposium on Foundations of Health Informatics Engineering and Systems, pp. 69–86. Cited by: §7.
  • [23] C. Dwork and A. Roth (2014-08) The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9 (3–4), pp. 211–407. External Links: ISSN 1551-305X, Link, Document Cited by: §1, §6.
  • [24] K. El Emam and F. K. Dankar (2008) Protecting privacy using k-anonymity. Journal of the American Medical Informatics Association 15 (5), pp. 627–637. Cited by: §3.1.1.
  • [25] B. Eze and L. Peyton (2015) Systematic literature review on the anonymization of high dimensional streaming datasets for health data sharing. Procedia Computer Science 63, pp. 348–355. Cited by: §1.3, §2.3.
  • [26] D. Foffano, L. Rossi, and A. Torsello (2019) You can’t see me: anonymizing graphs using the szemerédi regularity lemma. Frontiers in Big Data 2, pp. 7. Cited by: §3.2.1, Table 2.
  • [27] A. Friedman and A. Schuster (2010) Data mining with differential privacy. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 493–502. Cited by: §6.
  • [28] B. C. Fung, T. Trojer, P. C. Hung, L. Xiong, K. Al-Hussaeni, and R. Dssouli (2011) Service-oriented architecture for high-dimensional private data mashup. IEEE Transactions on Services Computing 5 (3), pp. 373–386. Cited by: §3.1.1.
  • [29] B. C. Fung, K. Wang, A. W. Fu, and S. Y. Philip (2010) Introduction to privacy-preserving data publishing: concepts and techniques. CRC Press. Cited by: §2.1, §2.
  • [30] B. C. Fung, K. Wang, and S. Y. Philip (2007) Anonymizing classification data for privacy preservation. IEEE transactions on knowledge and data engineering 19 (5), pp. 711–725. Cited by: §3.1.1.
  • [31] T. S. Gal, T. C. Tucker, A. Gangopadhyay, and Z. Chen (2014) A data recipient centered de-identification method to retain statistical attributes. Journal of biomedical informatics 50, pp. 32–45. Cited by: §2.3, §3.1.1.
  • [32] G. Ghinita, Y. Tao, and P. Kalnis (2008) On the anonymization of sparse high-dimensional data. In 2008 IEEE 24th International Conference on Data Engineering, pp. 715–724. Cited by: §3.1.1.
  • [33] A. Gkoulalas-Divanis, G. Loukides, and J. Sun (2014) Publishing data from electronic health records while preserving privacy: a survey of algorithms. Journal of biomedical informatics 50, pp. 4–19. Cited by: §1.3, §3.1.1.
  • [34] S. C. Guntuku, G. Sherman, D. C. Stokes, A. K. Agarwal, E. Seltzer, R. M. Merchant, and L. H. Ungar (2020) Tracking mental health and symptom mentions on twitter during covid-19. Journal of general internal medicine 35 (9), pp. 2798–2800. Cited by: §5.1.
  • [35] A. Gyrard, M. Gaur, S. Shekarpour, K. Thirunarayan, and A. Sheth (2018) Personalized health knowledge graph. In ISWC 2018 Contextualized Knowledge Graph Workshop, Cited by: §3.2.
  • [36] N. Hamza, H. A. Hefny, et al. (2013) Attacks on anonymization-based privacy-preserving: a survey for data mining and data publishing. Scientific Research Publishing. Cited by: §1.3.
  • [37] M. Hardt, K. Ligett, and F. McSherry (2010) A simple and practical algorithm for differentially private data release. arXiv preprint arXiv:1012.4763. Cited by: §6.
  • [38] M. Hay, G. Miklau, D. Jensen, D. Towsley, and P. Weis (2008) Resisting structural re-identification in anonymized social networks. Proceedings of the VLDB Endowment 1 (1), pp. 102–114. Cited by: §3.2.1, Table 2.
  • [39] M. Jändel (2014) Decision support for releasing anonymised data. Computers & security 46, pp. 48–61. Cited by: §4.1.
  • [40] S. Ji, P. Mittal, and R. Beyah (2016) Graph data anonymization, de-anonymization attacks, and de-anonymizability quantification: a survey. IEEE Communications Surveys & Tutorials 19 (2), pp. 1305–1326. Cited by: §3.2.
  • [41] H. Jian-min, Y. Hui-qun, Y. Juan, and C. Ting-ting (2008) A complete (alpha, k)-anonymity model for sensitive values individuation preservation. In 2008 International Symposium on Electronic Commerce and Security, pp. 318–323. Cited by: Table 2.
  • [42] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-Wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016) MIMIC-iii, a freely accessible critical care database. Scientific data 3 (1), pp. 1–9. Cited by: §1.1, §5.2.
  • [43] R. Khan, X. Tao, A. Anjum, T. Kanwal, A. Khan, C. Maple, et al. (2020) -Sensitive k-anonymity: an anonymization model for iot based electronic health records. Electronics 9 (5), pp. 716. Cited by: Table 2.
  • [44] P. Kieseberg, B. Malle, P. Frühwirt, E. Weippl, and A. Holzinger (2016) A tamper-proof audit and control system for the doctor in the loop. Brain Informatics 3 (4), pp. 269–279. Cited by: §2.3.
  • [45] D. Kifer and J. Gehrke (2006) Injecting utility into anonymized datasets. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pp. 217–228. Cited by: §2.3.
  • [46] J. W. Kim, B. Jang, and H. Yoo (2018) Privacy-preserving aggregation of personal health data streams. PloS one 13 (11), pp. e0207639. Cited by: §3.1.1.
  • [47] S. Kim, M. K. Sung, and Y. D. Chung (2014) A framework to preserve the privacy of electronic health data streams. Journal of biomedical informatics 50, pp. 95–106. Cited by: §3.1.1.
  • [48] S. Kisilevich, L. Rokach, Y. Elovici, and B. Shapira (2009) Efficient multidimensional suppression for k-anonymity. IEEE Transactions on Knowledge and Data Engineering 22 (3), pp. 334–347. Cited by: §3.1.1.
  • [49] H. Lee and Y. D. Chung (2020) Differentially private release of medical microdata: an efficient and practical approach for preserving informative attribute values. BMC Medical Informatics and Decision Making 20 (1), pp. 1–15. Cited by: §6.
  • [50] H. Lee, S. Kim, J. W. Kim, and Y. D. Chung (2017) Utility-preserving anonymization for health data publishing. BMC medical informatics and decision making 17 (1), pp. 1–12. Cited by: §2.3, §2.3.
  • [51] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan (2006) Mondrian multidimensional k-anonymity. In 22nd International conference on data engineering (ICDE’06), pp. 25–25. Cited by: §3.1.1.
  • [52] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan (2006) Workload-aware anonymization. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 277–286. Cited by: §3.1.1.
  • [53] H. Li, L. Xiong, X. Jiang, and J. Liu (2015) Differentially private histogram publication for dynamic datasets: an adaptive sampling approach. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1001–1010. Cited by: §6.
  • [54] J. Li, J. Liu, M. Baig, and R. C. Wong (2011) Information based data anonymization for classification utility.

    Data & Knowledge Engineering

    70 (12), pp. 1030–1045.
    Cited by: §3.1.1.
  • [55] N. Li, T. Li, and S. Venkatasubramanian (2007) T-closeness: privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd International Conference on Data Engineering, pp. 106–115. Cited by: §2.2.3.
  • [56] N. Li, T. Li, and S. Venkatasubramanian (2009) Closeness: a new privacy measure for data publishing. IEEE Transactions on Knowledge and Data Engineering 22 (7), pp. 943–956. Cited by: Table 2.
  • [57] T. Li, N. Li, J. Zhang, and I. Molloy (2010) Slicing: a new approach for privacy preserving data publishing. IEEE transactions on knowledge and data engineering 24 (3), pp. 561–574. Cited by: §2.3.
  • [58] T. Li and N. Li (2008) Injector: mining background knowledge for data anonymization. In 2008 IEEE 24th International Conference on Data Engineering, pp. 446–455. Cited by: §3.1.1.
  • [59] J. Liu, E. Bier, A. Wilson, J. A. Guerra-Gomez, T. Honda, K. Sricharan, L. Gilpin, and D. Davies (2016) Graph analysis for detecting fraud, waste, and abuse in healthcare data. AI Magazine 37 (2), pp. 33–46. Cited by: §3.2.
  • [60] K. Liu and E. Terzi (2008) Towards identity anonymization on graphs. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 93–106. Cited by: §3.2.1, Table 2.
  • [61] L. Liu, Q. Dou, H. Chen, I. E. Olatunji, J. Qin, and P. Heng (2018) Mtmr-net: multi-task deep learning with margin ranking loss for lung nodule analysis. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 74–82. Cited by: §1.
  • [62] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam (2007) L-diversity: privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD) 1 (1), pp. 3–es. Cited by: §2.2.2.
  • [63] A. Majeed and S. Lee (2020) Anonymization techniques for privacy preserving data publishing: a comprehensive survey. IEEE Access. Cited by: §1.3, §1.3, §3.2.1.
  • [64] A. Majeed, F. Ullah, and S. Lee (2017) Vulnerability-and diversity-aware anonymization of personally identifiable information for improving user privacy and utility of publishing data. Sensors 17 (5), pp. 1059. Cited by: §3.1.1.
  • [65] A. Majeed (2019) Attribute-centric anonymization scheme for improving user privacy and utility of publishing e-health data. Journal of King Saud University-Computer and Information Sciences 31 (4), pp. 426–435. Cited by: §3.1.1.
  • [66] F. D. McSherry (2009) Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 19–30. Cited by: §6, §6.
  • [67] E. Mohammadian, M. Noferesti, and R. Jalili (2014) FAST: fast anonymization of big data streams. In

    Proceedings of the 2014 International Conference on Big Data Science and Computing

    ,
    pp. 1–8. Cited by: §3.1.1.
  • [68] N. Mohammed, B. C. Fung, P. C. Hung, and C. Lee (2010) Centralized and distributed anonymization for high-dimensional healthcare data. ACM Transactions on Knowledge Discovery from Data (TKDD) 4 (4), pp. 1–33. Cited by: §3.1.1, §5.4.2.
  • [69] P. Mohan, A. Thakurta, E. Shi, D. Song, and D. Culler (2012) GUPT: privacy preserving data analysis made easy. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 349–360. Cited by: §6, §6.
  • [70] R. Mortazavi and S. Erfani (2020) GRAM: an efficient (k, l) graph anonymization method. Expert Systems with Applications 153, pp. 113454. Cited by: §3.2.1, Table 2.
  • [71] R. Mortazavi and S. Jalili (2014) Fast data-oriented microaggregation algorithm for large numerical datasets. Knowledge-Based Systems 67, pp. 195–205. Cited by: §3.1.1.
  • [72] A. Narayanan and V. Shmatikov (2008) Robust de-anonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy (sp 2008), pp. 111–125. Cited by: §3.1.1.
  • [73] M. E. Nergiz, C. Clifton, and A. E. Nergiz (2008) Multirelational k-anonymity. IEEE Transactions on Knowledge and Data Engineering 21 (8), pp. 1104–1117. Cited by: §3.1.1.
  • [74] M. E. Nergiz and M. Z. Gök (2014) Hybrid k-anonymity. Computers & security 44, pp. 51–63. Cited by: §2.3.
  • [75] A. Oganian and J. Domingo-Ferrer (2001) On the complexity of optimal microaggregation for statistical disclosure control. Statistical Journal of the United Nations Economic Commission for Europe 18 (4), pp. 345–353. Cited by: §2.3.
  • [76] K. Oishi, Y. Sei, Y. Tahara, and A. Ohsuga (2020) Semantic diversity: privacy considering distance between values of sensitive attribute. Computers & Security 94, pp. 101823. Cited by: Table 2.
  • [77] I. E. Olatunji, W. Nejdl, and M. Khosla (2021)

    Membership inference attack on graph neural networks

    .
    arXiv preprint arXiv:2101.06570. Cited by: §4.3.
  • [78] A. Otgonbayar, Z. Pervez, K. Dahal, and S. Eager (2018) K-varp: k-anonymity for varied data streams via partitioning. Information Sciences 467, pp. 238–255. Cited by: §3.1.1.
  • [79] A. Otgonbayar, Z. Pervez, and K. Dahal (2019) : Expiration band for anonymizing varied data streams. IEEE Internet of Things Journal 7 (2), pp. 1438–1450. Cited by: §3.1.1.
  • [80] A. Pakbin, P. Rafi, N. Hurley, W. Schulz, M. H. Krumholz, and J. B. Mortazavi (2018) Prediction of icu readmissions using data at patient discharge. In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 4932–4935. Cited by: §1, §5.2.
  • [81] E. Parliament and C. of the European Union (2016) General data protection regulation. Official Journal of the European Union. External Links: Link Cited by: §1.
  • [82] J. Pei, J. Xu, Z. Wang, W. Wang, and K. Wang (2007) Maintaining k-anonymity against incremental updates. In 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007), pp. 5–5. Cited by: §3.1.1.
  • [83] G. Poulis, G. Loukides, S. Skiadopoulos, and A. Gkoulalas-Divanis (2017) Anonymizing datasets with demographics and diagnosis codes in the presence of utility constraints. Journal of Biomedical Informatics 65, pp. 76–96. External Links: ISSN 1532-0464, Document Cited by: §3.1.1, Table 2.
  • [84] F. Prasser, J. Eicher, H. Spengler, R. Bild, and K. A. Kuhn (2020) Flexible data anonymization using arx—current status and challenges ahead. Software: Practice and Experience 50 (7), pp. 1277–1304. Cited by: §5.3, §7.
  • [85] Y. Qu, J. Xu, and S. Yu (2017) Privacy preserving in big data sets through multiple shuffle. In Proceedings of the Australasian Computer Science Week Multiconference, pp. 1–8. Cited by: Table 2.
  • [86] I. Roy, S. T. Setty, A. Kilzer, V. Shmatikov, and E. Witchel (2010) Airavat: security and privacy for mapreduce.. In NSDI, Vol. 10, pp. 297–312. Cited by: §6.
  • [87] T. Ryffel, A. Trask, M. Dahl, B. Wagner, J. Mancuso, D. Rueckert, and J. Passerat-Palmbach (2018) A generic framework for privacy preserving deep learning. arXiv preprint arXiv:1811.04017. Cited by: §7.
  • [88] R. Sarathy and K. Muralidhar (2011) Evaluating laplace noise addition to satisfy differential privacy for numeric data.. Trans. Data Priv. 4 (1), pp. 1–17. Cited by: §6.
  • [89] Y. Sei, H. Okumura, T. Takenouchi, and A. Ohsuga (2017) Anonymization of sensitive quasi-identifiers for l-diversity and t-closeness. IEEE transactions on dependable and secure computing 16 (4), pp. 580–593. Cited by: Table 2.
  • [90] L. Shi, S. Li, X. Yang, J. Qi, G. Pan, and B. Zhou (2017) Semantic health knowledge graph: semantic integration of heterogeneous medical knowledge and services. BioMed research international 2017. Cited by: §3.2.
  • [91] J. Soria-Comas, J. Domingo-Ferrer, D. Sánchez, and S. Martínez (2014) Enhancing data utility in differential privacy via microaggregation-based k-anonymity. The VLDB Journal 23 (5), pp. 771–794. Cited by: §6.
  • [92] J. Soria-Comas, J. Domingo-Ferrer, D. Sanchez, and S. Martinez (2015) T-closeness through microaggregation: strict privacy with enhanced utility preservation. IEEE Transactions on Knowledge and Data Engineering 27 (11), pp. 3098–3110. Cited by: Table 2.
  • [93] J. Soria-Comas, J. Domingo-Ferrer, D. Sánchez, and D. Megías (2017) Individual differential privacy: a utility-preserving formulation of differential privacy guarantees. IEEE Transactions on Information Forensics and Security 12 (6), pp. 1418–1429. Cited by: §6.
  • [94] X. Sun, L. Sun, and H. Wang (2011) Extended k-anonymity models against sensitive attribute disclosure. Computer Communications 34 (4), pp. 526–535. Cited by: Table 2.
  • [95] L. Sweeney (2000) Simple demographics often identify people uniquely. Health (San Francisco) 671 (2000), pp. 1–34. Cited by: §1.1, §2.1.
  • [96] L. Sweeney (2002) Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10 (05), pp. 571–588. Cited by: §2.3.
  • [97] F. Tang, C. Xiao, F. Wang, and J. Zhou (2018) Predictive modeling in urgent care: a comparative study of machine learning approaches. Jamia Open 1 (1), pp. 87–98. Cited by: §1, §5.1, §5.2, §5.4.1.
  • [98] Y. Tao, X. Xiao, J. Li, and D. Zhang (2008) On anti-corruption privacy preserving publication. In 2008 IEEE 24th International Conference on Data Engineering, pp. 725–734. Cited by: §3.1.1.
  • [99] B. Thompson and D. Yao (2009) The union-split algorithm and cluster-based anonymization of social networks. In Proceedings of the 4th International Symposium on Information, Computer, and Communications Security, pp. 218–227. Cited by: §3.2.1, Table 2.
  • [100] H. Tian and W. Zhang (2011) Extending l-diversity to generalize sensitive data. Data & Knowledge Engineering 70 (1), pp. 101–126. Cited by: Table 2.
  • [101] J. Walonoski, M. Kramer, J. Nichols, A. Quina, C. Moesel, D. Hall, C. Duffett, K. Dube, T. Gallagher, and S. McLachlan (2017-08) Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. Journal of the American Medical Informatics Association 25 (3), pp. 230–238. External Links: ISSN 1527-974X, Document, Link, https://academic.oup.com/jamia/article-pdf/25/3/230/24151869/ocx079.pdf Cited by: §7.
  • [102] R. Wang, Y. Zhu, T. Chen, and C. Chang (2018) Privacy-preserving algorithms for multiple sensitive attributes satisfying t-closeness. Journal of Computer Science and Technology 33 (6), pp. 1231–1242. Cited by: Table 2.
  • [103] R. C. Wong, A. W. Fu, K. Wang, and J. Pei (2007) Minimality attack in privacy preserving data publishing. In Proceedings of the 33rd international conference on Very large data bases, pp. 543–554. Cited by: §3.1.1.
  • [104] R. C. Wong, J. Li, A. W. Fu, and K. Wang (2006) (, K)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 754–759. Cited by: Table 2.
  • [105] X. Wu, Y. Wei, T. Jiang, Y. Wang, and S. Jiang (2019) A micro-aggregation algorithm based on density partition method for anonymizing biomedical data. Current Bioinformatics 14 (7), pp. 667–675. Cited by: Table 2.
  • [106] Y. Wu, Z. Sun, and X. Wang (2009) Privacy preserving k-anonymity for re-publication of incremental datasets. In 2009 WRI World Congress on Computer Science and Information Engineering, Vol. 4, pp. 53–60. Cited by: §3.1.1.
  • [107] Y. Xiao and H. Li (2020) Privacy preserving data publishing for multiple sensitive attributes based on security level. Information 11 (3), pp. 166. Cited by: Table 2.
  • [108] J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, and A. W. Fu (2006) Utility-based anonymization using local recoding. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 785–790. Cited by: §2.3, §3.1.1.
  • [109] T. Yu, J. Li, Q. Yu, Y. Tian, X. Shun, L. Xu, L. Zhu, and H. Gao (2017) Knowledge graph for tcm health preservation: design, construction, and applications. Artificial intelligence in medicine 77, pp. 48–52. Cited by: §3.2.
  • [110] M. Yuan, L. Chen, and P. S. Yu (2010) Personalized privacy protection in social networks. Proceedings of the VLDB Endowment 4 (2), pp. 141–150. Cited by: §3.2.1, Table 2.
  • [111] A. Zaman, C. Obimbo, and R. A. Dara (2016) A novel differential privacy approach that enhances classification accuracy. In Proceedings of the Ninth International C* Conference on Computer Science & Software Engineering, pp. 79–84. Cited by: §3.1.1.
  • [112] J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, and X. Xiao (2017) Privbayes: private data release via bayesian networks. ACM Transactions on Database Systems (TODS) 42 (4), pp. 1–41. Cited by: §6.
  • [113] L. Zhang, J. Xuan, R. Si, and R. Wang (2017) An improved algorithm of individuation k-anonymity for multiple sensitive attributes. Wireless Personal Communications 95 (3), pp. 2003–2020. Cited by: Table 2.
  • [114] Q. Zhang, N. Koudas, D. Srivastava, and T. Yu (2007) Aggregate query answering on anonymized tables. In 2007 IEEE 23rd international conference on data engineering, pp. 116–125. Cited by: Table 2.
  • [115] E. Zheleva and L. Getoor (2007) Preserving the privacy of sensitive relationships in graph data. In International workshop on privacy, security, and trust in KDD, pp. 153–171. Cited by: §3.2.1.
  • [116] B. Zhou and J. Pei (2008) Preserving privacy in social networks against neighborhood attacks. In 2008 IEEE 24th International Conference on Data Engineering, pp. 506–515. Cited by: §3.2.1, Table 2.
  • [117] H. Zhu, S. Tian, and K. Lü (2015) Privacy-preserving data publication with features of independent l-diversity. The Computer Journal 58 (4), pp. 549–571. Cited by: Table 2.
  • [118] L. Zou, L. Chen, and M. T. Özsu (2009) K-automorphism: a general framework for privacy preserving network publication. Proceedings of the VLDB Endowment 2 (1), pp. 946–957. Cited by: §3.2.1, Table 2.