Real-world K-Anonymity Applications: the KGen approach and its evaluation in Fraudulent Transactions

04/01/2022
by   Daniel De Pascale, et al.
0

K-Anonymity is a property for the measurement, management, and governance of the data anonymization. Many implementations of k-anonymity have been described in state of the art, but most of them are not able to work with a large number of attributes in a "Big" dataset, i.e., a dataset drawn from Big Data. To address this significant shortcoming, we introduce and evaluate KGen an approach to K-anonymity featuring Genetic Algorithms. KGen promotes such a meta-heuristic approach since it can solve the problem by finding a pseudo-optimal solution in a reasonable time over a considerable load of input. KGen allows the data manager to guarantee a high anonymity level while preserving the usability and preventing loss of information entropy over the data. Differently from other approaches that provide optimal global solutions catered for small datasets, KGen works properly also over Big datasets while still providing a good-enough solution. Evaluation results show how our approach can still work efficiently on a real world dataset, provided by Dutch Tax Authority, with 47 attributes (i.e., the columns of the dataset to be anonymized) and over 1.5K+ observations (i.e., the rows of that dataset), as well as on a dataset with 97 attributes and over 3942 observations.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

02/11/2020

Big Data and model-based survey sampling

Big Data are huge amounts of digital information that are automatically ...
08/17/2016

Clustering Mixed Datasets Using Homogeneity Analysis with Applications to Big Data

Datasets with a mixture of numerical and categorical attributes are rout...
06/14/2021

z-anonymity: Zero-Delay Anonymization for Data Streams

With the advent of big data and the birth of the data markets that sell ...
04/18/2019

Ontology-based Design of Experiments on Big Data Solutions

Big data solutions are designed to cope with data of huge Volume and wid...
02/03/2021

Optimization meets Big Data: A survey

This paper reviews recent advances in big data optimization, providing t...
02/20/2020

Meta-learning for mixed linear regression

In modern supervised learning, there are a large number of tasks, but ma...
03/27/2020

Sorting Big Data by Revealed Preference with Application to College Ranking

When ranking big data observations such as colleges in the United States...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The amount of data being produced and processed, both online and offline, is exponentially increasing, and so is the costly consumption of resources to carry such processing to fruition. On the one hand, maintaining data anonymity is a must-have, especially in sight of the severe sanctions connected to potential violations of the General Data Protection Regulation ArfeltBD19. On the other hand, many agencies want or need to exploit such data for commercial purposes or public safety and security, implying that data should be usable.

It is, hence, fundamental to provide fast and reliable techniques to the stakeholders that guarantee the privacy and anonymity of the data and, at the same time, maintain the data’s usefulness. This paper introduces and evaluates KGen, an approach to state-of-the-art privacy-preserving technologies implemented using a metaheuristic-based approach.

The process starts with a dataset, and, through an anonymization process, it provides a dataset anonymized. At the core of KGen is the most widely known k-anonymity approach to anonymization samarati2001microdata. K-anonymity is defined as the condition wherefore, for each record in that dataset, there are at least other k-1 records indistinguishable from it.

The K-anonymity property is classified as an NP-Hard problem, as proved by Meyerson et al;

meyerson2004complexity. Aggarwal aggarwal2005k shows the problem raised by any K-anonymity algorithms applied with large datasets. The information loss of a dataset also depends on the size of a dataset. If the size of a dataset increases, the information loss of a dataset increases too, leading to having a useless dataset with a higher level of anonymization.

Though it is not possible to anonymize a large dataset without loss of information, with KGen we aim to provide an anonymized dataset on the K-Anonymity property. In the scope of KGen, K-anonymity needs to be traded-off against the usefulness of data. At the same time, several algorithms address this problem, providing an optimal solution samarati2001microdata; sweeney1997guaranteeing; sweeney2002achieving; el2009globally; lefevre2005incognito, all known approaches merely work on a relatively small number of attributes with a reduced level of generalization for each attribute. While the number of attributes that need to be anonymized grows, the higher is the complexity to obtain a usable dataset.

To account for the trade-off mentioned above, KGen features an approach based on Genetic Algorithms goldberg1988genetic providing a pseudo-optimal solution in a time useful for practical usage (in the result of this work the maximum time reached is 2 hours for the dataset with 15 attributes). We compared KGen with other approaches from the state-of-the-art in order to validate its results.

The main goal of this work is to provide an approach useful in an industrial context. To this end, we defined the following research question:

Main RQ: Is the performance of the proposed approach useful for stakeholders?

To answer the main research question, we outlined three subsequent research questions:

  1. Does KGen perform when compared to state-of-the-art approaches? To address this RQ We first compared our approach to existing ones by means of execution time to generate the best-anonymized dataset.

  2. How accurate are KGen solutions compared to state-of-the-art approaches? To answer this question, we proposed a measure of accuracy to measure how the pseudo-optimal solution is far from the optimal solution.

  3. What is the quality of KGen solution? We measured the quality of a solution using generalization and suppression metrics defined in the state-of-the-art and discussed in the Sec. 2.3.

Moreover, to evaluate the applicability in a large context scenario, we outlined a followup main research question:

Main RQ: To what extent can the case-specific evaluation generalise to much larger datasets?

Therefore, in order to evaluate KGen in an industrial context, the approach was used a real-world sample dataset provided by the Dutch Tax Authority for fraudulent transactions. The evaluation aims at accounting for KGen’s real-life applicability. Moreover, we led a second experimentation, using the “c2k_data_comma.csv” dataset cargo2000dataset to prove the applicability of the approach using a large dataset. The experimentation has been done using OLA el2009globally, a state-of-the-art approach for the dataset k-anonymization, a brute force approach and a meta-heuristic random approach to evaluate the goodness of KGen. The experimentation reveals promising results and shows that KGen is an approach capable of providing a good-enough solution in less than 5h:05m:40s (the worst case recorded with the “c2k_data_comma.csv” dataset and 25 quasi-identifiers attributes to anonymize. KGen showed to be able to find results up to 25 attributes to anonymize, under the limited-time set of 15 hours differently from other approaches that provided results up to 7 attributes in much more time. Moreover, KGen demonstrates to preserve the quality of data correctly, a critical feature in order to keep the dataset qualitatively usable.

From a software and information systems engineering perspective the concrete usage of our proposed method KGen is twofold: (a) privacy-aware data-intensive applications GuerrieroTRMBA17 GuerrieroTN18 could be designed using KGen as a middleware to anonymize datasets before processing automatically; (b) compliance officers can use KGen to experiment with processed and non-processed data to quantify the extent of privacy “damage” carried out by data processors.

The remaining part of the paper is organized as follows. Section 2 introduces the state of the art of the anonymization process and the main works related to anonymization. Sec. 3 introduces KGen, explaining all its components. Sec. 4 outlines the research design of the work. It describes the dataset used for the experimentation, the metrics used to evaluate the RQs illustrated above and the algorithms used for the comparison study. The results o this work are shown in Sec. 5. Sec. 6 contains the discussion above the results obtained in the Sec. 5. In Sec. 7 are discussed the threats to validity found in KGen. Lastly, section 8 summarizes the main contributions of KGen and sketches future research directions.

2 Background and related work

Figure 1: Example of lattice (Age-Postcode-Gender). Each node contain a possible level of generalization, for each attribute, and is connected to other nodes that can be reached increasing or decreasing by one a single level of generalization of a given node.

This section is organized in three main subsections: the first one describes the anonymization process to allow a better understanding of the purposes behind this work; the second subsection explains what a genetic algorithm is — hence laying the technical foundations behind the metaheuristic underlying KGen. Third, finally, we showcase the known k-anonymity implementations in the state of the art to which KGen can be compared.

2.1 Anonymization

The anonymization process starts from a given dataset and generates an anonymous dataset. A dataset is composed of multiple observations with several different attributes. From a privacy perspective, there are two different kinds of attributes in any dataset samarati2001microdata:

  • Identifiers. An Identifier attribute can uniquely identify a row in the dataset. In the anonymization process, these are suppressed (this process is explained more in-depth in the next section).

  • Quasi Identifiers. Are the set of attributes that can be superimposed with external information to reveal an individual’s identity dalenius1986finding. Examples of common quasi-identifiers are el2009evaluating; el2006evaluating; el2007pan; canadian2005cihr: dates (such as birth, death, admission, discharge, visit, and specimen collection), locations (such as postal codes, hospital names, and regions), race, ethnicity, languages spoken, aboriginal status, and gender.

During the anonymization process, the data is changed by either removing or suppressing all identifiers samarati2001microdata. This is essential to prevent reverting to the original dataset. Thus, nullifying the anonymization process. Stemming from this assumption, the only data that needs to be (partially)-anonymized while simultaneously ensuring the highest amount of information usability as possible are the quasi-identifiers.

Therefore, the central part of the anonymization process revolves around two main factors (1) the anonymization of those attributes, quasi-identifiers, and (2) finding the optimal trade-off between them. Hence, making it hard to uniquely identify rows in a data set by removing information and maximizing the usefulness of the data, keeping as much as possible intact. In turn, the usability of the dataset can be measured using the loss of information metrics el2009globally. Metrics that are used to evaluate the goodness of a possible k-anonymous are explained below.

2.2 K-Anonymity

Name Age Gender Postcode Crime
Alice 24 F 80015 Assault
Max 28 M 80019 Kidnapping
Laurel 42 F 85073 Homicide
Frank 49 M 85071 Rape
Table 1: Original dataset. The attribute Name is an Identifier. Instead Age, Gender and Postcode are Quasi-Identifiers.
Name Age Gender Postcode Crime
***** 20 - 30 P 8001* Assault
***** 20 - 30 P 8001* Kidnapping
***** 40 - 50 P 8507* Homicide
***** 40 - 50 P 8507* Rape
Table 2: Dataset k-anonymized. Considering the QI, the number of indistinguishable rows are two. So, the dataset is k-anonymized (k = 2).

To guarantee anonymity KGen harnesses the concept of k-anonymity samarati2001microdata. A dataset is called k-anonymous if a single row is indistinguishable from, at least, other k-1 rows in the dataset.

Definition: Let be a table and be all the quasi-identifiers of that table. T is said k-anonymous if, for each row of T, there are at least k-1 rows equals to that row (for a total of k indistinguishable rows).

Table 2 shows an example of anonymization of the dataset in Table 1. The quasi-identifiers have been anonymized in order to guarantee the anonymization. Applying different levels of generalization for all quasi-identifier attributes, it is possible to guarantee the anonymization with a certain degree of remaining usability of the same dataset. Table 2, for example, shows a k-anonymous dataset with a level of k = 2.

2.3 K-Anonymity operators

(a) Generalization: POSTCODE
(b) Generalization: AGE
Figure 2: Generalization hierarchy of two quasi-identifiers attributes.

As mentioned before, the anonymization process revolves around the anonymization of attributes. State of the art offers several approaches, mainly around four different anonymization techniques, namely, generalization, suppression, anatomization and perturbation samarati2001microdata; fung2010privacy.

  • Generalization. Given an attribute, its level of anonymity can be represented as a hierarchy (Fig. 2). The higher the level of generalization of an attribute, the more the dataset is generalized, ensuring a high level of anonymization and a correspondingly low level of usability.

  • Suppression. If a dataset is not k-anonymized because there is only a single row that does not allow to satisfy the k-anonymity conditions, it is possible to suppress that single row to have a k-anonymized dataset.

  • Anatomization. Unlike generalization and suppression, the anatomization operator does not work on QI and sensitive data, but it works on the relationship between them. The operator splits the QI and the sensitive data into two different tables. To preserve the relationship between the two groups, each table have a common attribute, groupID, All rows in the same group have the same groupID fung2010privacy.

  • Perturbation. The perturbation replaces the original values with synthetic data. The new record generated does not correspond to a real-world record. In this way, for the attacker is not possible to recover sensitive data, starting from the data published.

KGen uses only generalization and suppression operators because, in the comparison study done in this work, the state-of-the-art approach chosen uses only the two operators mentioned above.

Generalization works on the generalization of all values of a single attribute. Thus, no information is lost, but the entire dataset is modified. Conversely, suppression works at a local level, its approach revolving around the removal of entire rows, with the remaining data left unchanged samarati2001microdata.

In both cases, however, it is always possible to compute the generalization hierarchy of all the attributes as represented by a lattice (i.e., repeating arrangement of points, see Fig. 1)el2009globally. Thus, a node of the lattice represents a possible anonymized dataset containing the level of generalization of each quasi-identifier attribute. The lattice shown in Fig. 1 is the representation of all possible configurations of the dataset in Tab. 1. The minimum node in a lattice is the representation of a dataset with all quasi-identifier attributes not anonymized (node 000 of Fig. 1); the maximum node, instead, is the representation of a dataset completely anonymized because contains the maximum level of generalization of each quasi-identifier attribute (node 341 of Fig. 1). Each arrow represents a possible generalization path taken through the lattice. Thus, the height of a lattice is equaled to the number of steps that, from the minimum node, are necessary to reach the maximum node, increasing one by one the level of generalization of a quasi-identifier attribute. Climb up the lattice allows to have a higher level of anonymization of a dataset but a lower utility (this concept is explained in Sec. 2.4).

Every path starting from the minimum node to the maximum node is called strategy path. For example, in the Fig. 1 the path [000, 001, 011, 021, 031, 041, 141, 241, 341] is a strategy path.

All strategy paths share the same starting node (the minimum node of the lattice) and final node (the maximum node of the lattice). As explained before, since the maximum node represents a dataset completely anonymized, all strategy paths ensure the existence of at least one k-anonymized node. 3 In the lattice, every node could represent a k-anonymized dataset and, among these, only one represents the optimal global solution. So, the goal of k-anonymity is to find it in a reasonable time.

2.4 Measuring Loss of information

Using generalization and suppression, all possible datasets in the lattice can be possible solutions. The way of preferring a dataset to another for KGen is to select the dataset whose information is most useful in generalization. A dataset with more generalization or more suppression has less information and, hence, lower usability. KGen uses metrics to measure the usability of an input dataset using different metrics of information loss. The significant metrics for information loss are outlined below. Subsequently, a selection is made and illustrated for KGen.

One metric for the level of information loss was proposed by Samarati samarati2001microdata. The idea of the proposed approach is to take the k-anonymity node with a minimum height level in the lattice. So, for example, if in the lattice showed in Fig. 1 nodes 100 and 001 are both k-anonymized, using this metric, they have the same level of loss of information because they have the same height level in the lattice. However, the height lattice is not a helpful metric since it does not consider each attribute’s maximum level of generalization. In the previous example, there are two nodes: the first one has only the first attribute generalized at level 1 of a maximum of 4 levels. Instead, the second one has the last attribute that, in this case, is completely anonymized. Moreover, with the first metric presented, they have the same level of loss of information. Sweeney in sweeney2002achieving and sweeney2001computational takes into consideration as information metric also the level of generalization of each attribute. The aim is to evaluate, for each attribute, its level of generalization, called “precision”, using this formula:

(1)

where log is the actual level of generalization of the i-th quasi-identifier, Hlog is the heigth of the generalization hierarchy of the i-th quasi-identifier and N is the total number of quasi-identifier attributes in the dataset. Hence, the level of generalization of a single node is given by the average of all precision values calculated.

(2)

For example, the node [1, 0, 0], representation of the attributes Age/Postcode/Gender with a generalization hierarchy’s height of, respectively, 3, 4 and 1, has a precision level of ( + + ) / 3 = 0.11. Instead, the node [0, 0, 1] has a precision level of ( + + ) / 3 = 0.33. With this metric, the node position in the lattice and the level of generalization of each attribute are taken into account. KGen uses this decaying information metric to find the dataset with the most information and the highest anonymization concurrently.

2.5 K-Anonymity Complexity

Different works prove that an optimal k-anonymization algorithm is an NP-Hard problem. Meyerson et al meyerson2004complexity provide a demonstration on the complexity classification of the problem, finding that not only the k-anonymity algorithm is NP-Hard, but also the k-anonymization with suppression of different attributes is NP-Hard.

Aggarwal aggarwal2005k shows that the k-anonymity complexity is highly dependent on the size of the problem and that it is impossible to apply the k-anonymization property on a dataset with lots of quasi-identifier attributes with an acceptable level of information loss.

Sun et al. sun2008complexity introduce two variants of the k-anonymization problem, the Restricted K-anonymity problem and the Restricted K-anonymity problem on attributes. They proved that both of them are NP-Hard for , but, on the positive side, they developed a polynomial solution for the k-anonymization problem with .

2.6 Genetic Algorithms: An Overview

Genetic algorithms are simulations of natural selection, used to solve optimization problems dansimon2013ga such as the one reflected by KGen. The natural selection process inspires genetic algorithms, and their workings and architecture reflect the natural process of reproduction, proliferation, and selection. More specifically, starting from an initial population, the algorithm selects, with a function used to measure the goodness of an individual, the best individuals and, from them, produces new individuals. Then, the old and the new population are re-evaluated to see which of them survives to the next generation. This process goes on until a stop condition is satisfied. In order to better explain this process, it is essential to describe the main components of a genetic algorithm:

Solution encoding: a good solution representation plays a key role in a genetic algorithm because all future evaluations are applied to all solutions. So, if a solution is easy to evaluate, then the entire algorithm’s complexity is low. A solution typically consists in an array of values. As a first step, a random population is generated. Then the algorithm tries to improve its solutions in order to find the best solution.

Fitness function: in implementing a genetic algorithm, a key role is played by the complexity of the fitness function. A fitness function is a good representation of the objective to achieve. If it has low complexity, then the entire algorithm has a lower complexity. The choice of the proper fitness function should be made together with the choice on the solution encoding because they are highly correlated. The fitness function is directly applied to the solution, so if they are incompatible, then the evaluation process is more complicated.

Genetic operator: Genetic operators are functions that automatically allow the generation of new chromosomes, starting from the previous population. There are three different types of operators: selection, an operator used to find the best chromosome in the population; crossover, a ”mating process” applied to two chromosomes to generate two new chromosomes; mutation, operator used to mutate a single chromosome to avoid the genetic algorithm convergence into a local optimal solution dansimon2013ga.

2.7 Related work

There are many works on k-anonymization and its practical implementation. Samarati et al. samarati2001microdata

provide a k-minimal generalization algorithm to apply a binary search to find all k-anonymous node, selecting all nodes with the least steps as solutions. If there is more than one node as a solution, the algorithm selects one randomly or using other criteria, as the information loss. However, the node with the lowest distance vector is not guaranteed the optimal solution because they could be other nodes with a higher distance value but with a lower level of information loss. For this reason, the algorithm does not provide the optimal global solution.

Similarly, the Datafly algorithm adopts a heuristic based on the attribute sweeney1997guaranteeingsweeney2002achieving. The most distinct attribute is taken into account as how next generalized attribute. The process continues with new distinct attributes that do not satisfy k-anonymous until the k-anonymous criteria are satisfied. This approach does not guarantee the minimum k-anonymous solution, however, the found solution is always k-anonymous.

Kirsten et al. Incognito exploits a bottom-up approach with a breadth-first strategy to navigate the lattice to find all k-minimal distance vectors lefevre2005incognito. After detecting all vectors, the algorithm calculates their information loss to select the solution with the least information loss as the optimal solution. This algorithm can find, in this way, a global optimum.

Besides, the Optimal Lattice Anonymization (OLA) The OLA algorithm is an improvement of Incognito and Datafly algorithms el2009globally. All the anonymization processes, as shown in Fig. 1, may be represented as a lattice. The goal of the OLA algorithm is to find the optimal node in the lattice that must be k-anonymous and with minimum loss of information. The approach embraces a binary search algorithm for each strategy path. When the optimal node in a strategy path is reached, the algorithm commences to analyze the next strategy hub, and so on. In the end, the algorithm holds a list with all k-minimal nodes for each strategy path. At this point, it is chosen only the node with the minimum information loss. Thus, OLA, as Incognito, can provide a globally optimal solution.

Bayardo et al. bayardo2005data present a new approach to explore the space of possible combinations developing data-management strategies to reduce reliance on expensive operations. They can find an optimal solution under two representative cost measures and a wide range of k. Moreover, they can provide good anonymizations where the input data or input parameters preclude finding an optimal solution in a reasonable time.

Lyengar shows an example of a Genetic algorithm applied on the k-anonymity problem iyengar2002transforming. It seems to generate good results, as we can see from the experimentation done in their work. Nevertheless, they considered only a dataset with eight quasi-identifier attributes, lacking more considerable experimentation.

Among all of these k-anonymization algorithms, only OLA and Bayardo’s algorithm proved that their results are better than the others (Datafly, Samarati’s algorithm) el2009globally; bayardo2005data. For this work, we realized a comparison only with OLA because we found different implementations of it, differently from Bayardo’s approach. Furthermore, we did not realize a comparison with Lyengar’s GA because of lacking a pseudo-code of the algorithm or a repository with their work.

3 Scalable K-Anonymization: KGen Explained

Figure 3: KGEN Pipeline. It is divided into three steps (separated bu dotted vertical lines), input, processing, and output; the KGEN-GA architecture is described in the processing step.

This section describes KGen from a technical perspective, elaborating (1) the general KGen architecture; (2) the KGen lattice preprocessing; (3) solution encoding; (4) solution fitness; (5) genetic operators.

3.1 KGen Architecture

An overview of KGen architecture is shown in Fig. 3. Processing of data starts with an input phase in which KGen receives a dataset to anonymize along with configuration parameters such: (a) the generalization strategy to be adopted; (b) attributes’ information type, that is, whether they are Identifiers or Quasi-Identifiers. As explained by Samarati et al. samarati2001microdata, there are different generalization strategies, assuming the existence of different domains, including generalized values and mapping between each domain and domains generalization of it. Thus, for example, the postcode can be generalized, dropping, from the right, the least significant value (as shown in Fig. 1(a)).

The subsequent processing phase is the core of the KGen approach. An overview of this phase is provided in Algorithm 1. The first step of KGen processing phase is the preprocessing of the lattice for size reduction. The next step is an iteration of the KGen Genetic Algorithm (GA) implementation. In the KGen-GA step, KGen tries to converge to the optimal solution following the GA meta-heuristic approach recapped in Section 2. The output of the processing phase is the k-anonymized dataset using the best solution provided by KGen.

Input: Dataset
       Output: Dataset anonymized

1:procedure KGEN Algorithm
2:      
3:      
4:      
5:      
6:      while  do
7:            
8:            for  do
9:                 
10:                 
11:                 
12:                 
13:                 
14:                              
15:            
16:            
17:            
18:                   
19:      
20:      
21:return
Algorithm 1 KGEN Algorithm

3.2 Lattice Preprocessing

The lattice reduction is the first step of KGen execution. It is based on the lattice pruning technique used in lefevre2005incognito. This step aims at removing the complexity given by the generation of a lattice at the expense of introducing an acceptable permutation computational cost. It reduces the lattice size, thus the complexity of the k-anonymization algorithm. The size-reduction process exemplified in Fig. 1 shows an example of a non-reduced lattice. In this example, the minimum node is 0, 0, 0 and the maximum node is 3, 4, 1. The reduction technique is recapped in Table 3, parts from (a) to (f); KGen slices the dataset into N vectors, one per quasi-identifier (Tab. 3(a)), and validates the k-anonymity property iteratively on each vector thus obtained, until a new minimum level of generalization is found (Tab. 3(b)). The idea is that if at least one quasi-identifier attribute is not k-anonymized, then the entire dataset cannot be anonymized too. Hence, the computational cost for the execution of KGen on nodes containing quasi-identifiers not anonymized is meaningless. Although this approach poses limitations when anonymizing by suppression, such limitations are addressed in the Threats to Validity section, see Sec. 7.

Age Postcode Gender
24 80015 F
28 80019 M
42 85073 F
49 85071 M
(a) Original dataset not anonymized. The attributes Age, Gender and Postcode are Quasi-Identifiers.
Age
24
28
42
49
Postcode
80015
80019
85073
85071
Gender
F
M
F
M
(a) First step of the reduction process. The original dataset is split into n dataset, where n is the number of quasi-identifiers in the original dataset. Each of these new datasets contain only one of these quasi-identifiers.
Age
20 - 29
20 - 29
40 - 49
40 - 49
(b) Age: LOG 2.
Postcode
8001*
8001*
8507*
8507*
(c) PC: LOG 1.
Gender
F
M
F
M
(d) Gender: LOG 0
(b) Second step of the reduction process. Each of the datasets generates previously has been anonymized up to reach the minimum level of anonymization. The level of generalization of each of these datasets represent the new minimum level of generalization of the lattice.
Table 3: Example of the entire lattice reduction process.

3.3 Solution Encoding

Figure 4: Solution encoding of the lattice node .

A genetic algorithm aims to find the best pseudo-optimal solution in a reasonable time. In this case, a solution is the representation of a node in the lattice (see Fig. 1) that represents its level of generalization. In KGen, a solution is represented as an array of numbers, where in the i-th position of the array contains the value of the i-th attribute in the lattice node. Fig. 4 shows the solution encoding of the lattice node Age/Postcode/Gender 2, 3, 0. In the solution encoding process, the level of generalization values of Age, Postcode and Gender are respectively put in positions 0, 1 and 2. In a Genetic algorithm, the initial population is initialized randomly.

3.4 Fitness Functions

Every Genetic Algorithm needs to define its fitness function. This function allows evaluating, for each iteration, all generated solutions. As discussed in section 2, there are two metrics for the evaluation of a single node, namely, (a) k-anonymity and (b) loss of information. In KGen, the loss of information is the only metric used to evaluate the fitness of a solution. For every fitting solution, k-anonymity is evaluated to see if a solution is feasible or not. Thus, the goal of the KGen fitness function is to find the lowest value of loss of information of a node while ensuring, at the same time, the k-anonymity property.

Value LOG Rows
24 0 [1]
28 0 [2]
42 0 [3]
49 0 [4]
20-24 1 [1]
25-29 1 [2]
40-44 1 [3]
45-49 1 [4]
20-29 2 [1, 2]
40-49 2 [3, 4]
0-49 3 [1, 2, 3, 4]
0-99 4 [1, 2, 3, 4]
(a) Age support map.
Value LOG Rows
80015 0 [1]
80019 0 [2]
85073 0 [3]
85071 0 [4]
8001* 1 [1, 2]
8507* 1 [3, 4]
800** 2 [1, 2]
850** 2 [3, 4]
80*** 3 [1, 2]
85*** 3 [3, 4]
8**** 4 [1, 2, 3, 4]
***** 5 [1, 2, 3, 4]
(b) Postcode support map
Table 4: Example of support map for the quasi-identifiers AGE and Postcode.

3.4.1 Implementing K-Anonymity in KGen

We implemented KGen using the improved algorithm for k-anonymity presented by Zhang et al. zhang2012improved. They propose a technique for improving the k-anonymity implementation by providing a new structure for the generalization hierarchy, namely, a support map. A support map provides a structure in which each indistinguishable value is associated with its level of generalization, and all the rows contain an equal value. Tab. 4 shows an example of a support map, applied on two quasi-identifier attributes, Age and Postcode. With the support map technique, for each attribute, there is a related support map. This support map contains all values referred to that attribute, including all their generalization versions, and, for each value, they memorize its level of generalization and all rows that contain that value. In Tab 4(a), the value 24 has a level of generalization 0 and is included only in the first row. Differently, its generalization 20-29 has a level of generalization 2 and can be found in rows 1 and 2. In this way, to see if a dataset is k-anonymized, the algorithm intersects all value rows of a given level of generalization to see if there are no rows less than k. In Tab. 4, for example, with the intersection of LOG 2 of age and LOG 1 of Postcode, we have two groups of rows: the first one containing rows 1 and 2, that contain values 20-29 and 8001*; the last one, that contains rows 3 and 4 with values 40-49 and 8507*.

3.4.2 Implementing Loss of Information in KGen

As discussed in Section 2.4, KGen implements the precision criterion, as information loss metric. Each possible solution is evaluated with the precision Formula 1. The goal of KGen’s genetic algorithm is to minimize the precision of a solution to find the best k-anonymized solution with the least precision.

3.5 Genetic Operators

LOG Penalty LOG weighted Percentage
0.85 0 (1-0) * (0.85) = 0.85 0.51
0.44 3 (1-0.3) * (0.44) = 0.308 0.19
0.55 1 (1-0.1) * (0.55) = 0.495 0.3

[table]A table beside a figure

Figure 5:

Selection process, based on LOG metric as fitness function. Based on their LOG value and their penalty, the selection generates all probabilities. The pie chart shows the probability to choose a single solution.

For the implementation of the KGen-GA approach, the following operators are provided.

Selection. For the selection operator, KGen uses the Tournament Selection operator blickle1995mathematical with penalty. The Tournament Selection is used to select the fittest candidate for the current generation. This operator assigns a probability to each solution based on two criteria: the fitness value and the penalty of a solution. The fitness value, in our case, is the loss of information metric. Instead, the penalty is calculated as follows: when a new solution is generated, its penalty value is . Suppose this solution survives going to the next generation, its penalty increases by . The maximum value reachable is . Otherwise, with a value of , the penalty decreases the probability to . The concept is that the more a solution survives, the more the probability to be chosen decreases. Therefore, the penalty is used as a weight for solution optimality. An example of this process is shown in Fig. 5 (in the figure, the data regarding the level of generalization (LOG) and the penalty are chosen randomly, just to explain the process behind the KGen selection operator). The probability of selection is calculated using this formula:

(3)
(a) Offspring generation.
(b) Crossover lattice, an example with only one k-anonymized parent.
Figure 6: Example of crossover operator with only one of the parent k-anonymized. In this case, all nodes with dashed lines represent a possible final offspring of the crossover.

Crossover. KGen provides its own Crossover implementation, based on the double point crossover defined in mirjalili2019genetic. Fig. 5(a) show the first step: (i) the PARENTS selected with the selection operation, (ii) on top of them the crossover generates two new chromosomes, one with the highest value extracted from PARENTS and the second one with the lowest values extracted from PARENTS.
Subsequently, three possible scenarios manifest:

  • Case 1. Both parents are k-anonymized. In this case, the maximum node is anonymized because, by definition of strategy path, all nodes after a k-anonymized node are also k-anonymized. If also the minimum node is anonymized, add it to the final offspring. Otherwise, the algorithm adds a random node between the minimum node and the first parent node and another random node between the minimum node and the second parent node;

  • Case 2. Both parents are not k-anonymized. In this case, the minimum node is not k-anonymized, and the final offspring is the maximum node;

  • Case 3. Only one of parents is k-anonymized. The minimum node is not k-anonymized, and the maximum node is k-anonymized. In this case, the last offspring is a random node between the minimum node and the k-anonymized parent.

An example of case 3 is shown in Fig. 6, while Fig. 5(a) shows the generation of the minimum and maximum nodes. Finally, Fig. 5(b) shows the crossover lattice that contains all the possible crossover’s offspring. In this case, only nodes with dashed lines are considered since they represent the random solution discussed previously.

Mutation. In this case, KGen uses two different Mutation techniques:

  • Standard mutation. a classic mutation operator, inherited from the approach in goldberg1988genetic. This approach changes a single value of the chromosome and allows to change a possible solution with another one from the same strategy path. This operator needs to guarantee the principle of exploitation mkaouer2014model since this principle allows a solution to move up or down its strategy path;

  • Horizontal mutation. this operator allows the genetic algorithm to change a solution with another solution of a different strategy path. In this way, it is possible to guarantee the exploration criteria. In order to change the strategy path, it is necessary to change more than one value of the solution and, to avoid having a solution in the same strategy path, it is necessary, alternatively, increase and decrease the chosen value, with a value between the minimum value (or maximum value in case we need to increase the value) and the actual node. An example of Horizontal mutation is shown below:

    Example
    Minimum solution: 0 0 0 0 0
    Actual solution: 2 2 2 2 2
    Maximum solution: 4 4 4 4 4
    Percentage of values to mutate: 50%. In this case it means that we need to mutate 2 values
    Random indexes chosen: 2, 3
    Algorithm: The value in the index 2 can choose a random value between its value and its maximum (so, from 2 to 4). The value in the index 3 can choose, instead, a value between 2 and 0, its minimum.
    Possible mutate solution: 2 2 3 0 2
    This procedure of increasing and decreasing iteratively must keeps going on until all indexes chosen have been mutated.

4 Research Design

The main goal of this work is to provide an approach to the stakeholders that can be used in a real case scenario. To that end, we proposed KGen, a meta-heuristic approach based on a Genetic Algorithm, to build an infrastructure capable of anonymizing a dataset in a real case scenario. First, it means that the dataset specification can not know a priori, so the approach should scale with the dataset provided. Secondly, we evaluated the algorithm proposed with experimentation, using a large dataset to validate the approach in a significant case context.

4.1 Dataset

To answer the first main research question, we build an experimentation on top of the dataset provided by the Financial Forensics (F) Taskforce West-Brabant-Zeeland. The task force needed a middleware capable of enabling forensic analysis without putting at risk the privacy of data owners and without any human intervention over the data; furthermore, this needed to be done in computational times which were consistent with the quantity of data available as opposed to the qualities of that data. The task force has many instances of data constrained around a reasonable set of 50+ features. Therefore, the key requirement was striking a balance between the computational complexity of the algorithms involved and the anonymization reliability of such algorithms. We were provided with an experimental dataset in the scope of our experimentation, which was completely spoofed at the source. Namely, the data was disguised as a communication from an unknown source but still reflecting the original structure and properties. The dataset in question contained 47 attributes and 1599 observations involving four different attribute types: Dates, Numbers, Strings, Places. The generalization techniques used to generalize them are showed in Tab. 5.

To validate KGen with a large dataset, we led a second experimentation using the “c2k_data_comma.csv” dataset cargo2000dataset, which is commonly considered big data (in terms of attributes, or columns of the dataset) for anonymization research, with its 97 attributes and 3942 observations. The attributes analyzed are all numeric, so the only generalization strategy applicable is the range generalization samarati2001microdata. The more the range of possible values increases, the more a number is generalized (e.g., 23, at the level of generalization can be generalized in 20-25).

Generalization tecniques
NUMBER Range generalization (3 ->0-5)
STRING Star generalization (NL805 ->NL80*)
DATE Date generalization
(01/01/1970 ->01/1970 ->1970)
PLACE Place generalization
(Den Bosch ->Noord Brabant)
Table 5: Generalization strategies applied on the K-Anonymity problem.

4.2 Metrics

To find an answer to our minor RQ outlined in Sec. 1

, we defined the evaluation metrics below.

The RQ compares the performance of the approach using execution time of the anonymization algorithm concerning the complexity of the dataset in input, as defined in related work el2009globally. K-Anonymity property is an NP-Hard problem meyerson2004complexity. For this reason, when the number of quasi-identifier attributes increases, the number of nodes in the lattice increases and, consequently, the execution time to analyze them. Hence, the execution time is a reliable indicator to compare approaches.

To answer to the RQ, we proposed a measure of accuracy, expressed as the distance between the optimal solution and pseudo-optimal solution. Each solution is part of a strategy path, and there is an optimal solution for each strategy path. Following this principle, the worst solution is the last node of this strategy path, with an accuracy value equal to 0. Instead, the optimal node has an accuracy value equal to 1. More in general, the accuracy of a solution is computed as follows:

(4)

where H(x) is the height function of an x solution. The general accuracy, instead, is the weighted arithmetic mean of all accuracy values of our solutions, formally:

(5)

We choose the weighted arithmetic mean because of the value problem WOOD20061326; in our case, accuracy could be

, and it is not possible to use harmonic or geometric means with values less or equal to

. The problem with these metrics is that we should always know the optimal solution to measure the accuracy level. So, the only way to determine the accuracy level is to compare an approach with another one that provides optimal solutions.

In the RQ, we measure the quality of a proposed solution. The quality is strongly related to the anonymization and usability of a dataset. As previously stated, the metrics used to evaluate these two aspects are the level of generalization and the percentage of a solution’s suppression. With the former, we measure the level of generalization of a solution, and the latter is used as an indicator of the level of suppression of a dataset. All solutions provided by an approach are k-anonymized. Therefore, the lower is the level of generalization and the level of suppression of a solution, the better its quality. Since solutions could be more than one, the final level of generalization is the minimum of all levels of generalizations of solutions and the level of suppression is taken from the solution found.

4.3 Evaluated Algorithms

In the scope of our evaluation, we select four k-anonymization algorithms from state of the art, which use generalization and suppression techniques as well as an exhaustive algorithm featuring a brute-force approach by enumeration Ullmann1976. Below are listed the selected algorithms:

  • Exhaustive Approach. This algorithm is an implementation of the k-anonymization property assessment algorithm as well as the generalization and suppression metrics on all nodes in the input lattice. After the analysis of the entire lattice, it is possible to find the minimum k-anonymization node. This approach provides the optimal solution;

  • OLA Approach. As explained in the Related Work section (see Sec. 2.7), the OLA algorithm is an optimization of the k-anonymization algorithm. Also, this algorithm converges towards the optimal solution;

  • KGen Approach. KGen is the approach that we want to test within this work, designed to cope with big datasets;

  • Random-Search Approach. This algorithm is included as a validation baseline for KGen. The comparison with this algorithm is due to genetic algorithms’ feature of introducing a certain degree of randomness in solution generation. Hence, by comparing KGen to a Random algorithm, we aim at establishing whether the KGen behavior is close or not to a Random approach.

The remaining approaches from state of the art discussed in Sec. 2.7 were already compared in other previous works with the OLA approach el2009globally. For this reason, they have not been used in this evaluation study.

5 Results

For the comparative analysis, the experimentation was run on a CPU i7-7700HQ 2.8GHz, 16GB RAM DDR4, on Windows 10 64bit. The maximum threshold allowed for the suppression technique required by the stakeholder is 0.5%. The computational time limitations were set to 15 hours. Others metaheuristic parameters related to KGen and Random-Search Approach can be seen in Tab. 6.

KGEN Random
maxEvaluations 5000 5000
populationSize 100 5000
crossoverRate 0.9 -
mutationRate 0.2 -
horizontalMutationRate 0.4 -
Table 6: Metaheuristic parameters setup.

5.1 Rq1: KGen Performance

(a) Execution time c2k dataset.
(b) Execution time F dataset.
Figure 7: Execution time evaluation results over the considered datasets.

Fig. 7 plots execution times in a logarithmic scale. The exact approach can give results for a maximum of 6 QID for the c2k dataset and 10 QID for F dataset while its computation halts or crashes with the increase of QIDs. Conversely, KGen and random-search provide results until to 25 QID for c2k and 15 QID for F.

5.2 Rq2: KGen Accuracy

(a) Accuracy on c2k dataset.
(b) Accuracy on F dataset.
Figure 8: Accuracy evaluation results over the considered datasets.
(a) Solution quality, Level of generalization (LOG) of c2k dataset.
(b) Solution quality, Level of generalization (LOG) of F dataset.
Figure 9: Level Of Generalization on the dataset anonymized.
(a) Level of suppression (LOS) of c2k dataset.
(b) Level of suppression (LOS) of F dataset.
Figure 10: Level Of Suppression on the dataset anonymized.

Fig. 8 outlines results for accuracy. Given the limitation of the exact approaches to provide the optimal solution for several quasi-identifiers higher than 7 for the c2k dataset and 8 for the real dataset, the accuracy graph shows the accuracy level only up to 7 or 8 quasi-identifiers.

As the left-hand side of the figure shows, most approaches, including KGen offer accurate results with the apparent exception of the random approach, which, by definition, is bound to be non-accurate. On the right-hand side, the real data dataset results show that the accuracy decreases from the seventh quasi-identifiers.

5.3 Rq3: KGen solution quality

Fig. 9 and Fig. 10 show the level of generalization and suppression of all approaches compared. In the scope of the plot, to evaluate the extent of goodness for approaches different than exact ones (i.e., KGen and random), it is sufficient to evaluate how low are their curves.

Regarding the level of generalization, the KGen result, except with the F dataset with 7/8 quasi-identifiers, is always equal or lower to the other approaches. Even when the other approaches cannot provide a solution, KGen provides better results than the Random approach.

The suppression criteria, instead, presents a different behaviour depending on the dataset used. With the F dataset, the behaviour of KGen seems to be the same as the exact approaches, and the suppression value with a number of quasi-identifiers higher than 9 seems to decrease. The c2k dataset, instead, presents curves with unstable behaviour for all the approaches considered, making it more challenging to analyze. Nonetheless, the KGen behaviour is equal to the exact approaches. Considering that exact approaches provide the best results, it means that KGen provides the same good results as the exact approaches results.

6 Discussion

As expected from our results on RQ1 the state-explosion problem clarke08 does not allow to have the exact solution in a reasonable time in all cases. More specifically, with more than 6 quasi-identifier attributes for the c2k dataset and 10 quasi-identifier attributes for the F dataset, it is unfeasible to run exact approaches. Differently, with the usage of metaheuristics, we can provide solutions until 25 quasi-identifiers attributes and opportunistically continue if granted with the appropriate computational means. Clearly, from that point onwards, also for metaheuristic approaches is difficult to provide a solution. One factor strongly related to the increasing of the execution time on metaheuristics pertains to the maximum number of evaluations, based on metaheuristic configuration (e.g., see Tab. 6) since the number of nodes evaluated is directly related to these configurations. Consequently, to decrease the execution time, operators and data processing agents can fine-tune the maxEvaluation parameter of KGen (or even the random approach) opportunistically and as needed. Another important aspect is that the slope of execution-time curves for the random approach is lower than KGen at the increase of QID. This is because a single evaluation run in KGen analyzes more than one single node, given that the crossover operator continuously generates new nodes. This limitation can be the object of future study by researchers and practitioners interested to address its impact.

Moreover, concerning RQ2, the accuracy level shows how KGen provides solutions that are identical .9% to the optimal approach. It means that KGen can (a) converge using its genetics operators to the optimal solution with small instances and (b) to be very close to the optimum as the number of instances increases. Conversely, the random approach initially provides a good level of accuracy due to the number of evaluations concerning the size of the problem. For example, if a lattice contains 300 nodes, with 5000 evaluations (setting of the random approach described in Tab. 6), the random approach analyzes all nodes in the lattice, providing a high level of accuracy. However, the opposite is true exponentially with the increase of lattice nodes.

Focusing on RQ3, we can observe the level of generalization and suppression of KGen as being very close to the level of generalization of optimal approaches. This is a good indicator of the power of our research solution. Most notably, our approach (just as the random one) can provide solutions that can deal with higher numbers of quasi-identifier attributes, a feature where most optimal approaches fail. Unlike the random approach, however, KGen provides excellent results in terms of generalization level, considering the suppression applied. If the random approach seems to have better results on large instances, considering only the generalization level, we can see how this is due to the high level of suppression applied by the random approach itself. Looking at both metrics, we can easily understand how KGen has the best results.

From the results of the three sub-research questions, we can assume that KGen performs well in real case contexts. Moreover, given the dataset provided by the Taskforce West-Brabant-Zeeland, we can anonymize their dataset with a good level of anonymization, having the same results of exact approaches.

Unlike heuristic approaches, meta-heuristic approaches can also perform well in a context where the dataset size, in terms of the number of quasi-identifiers, is more extensive. Hence, a stakeholder can use KGen in large contexts scenarios. Nonetheless, to ensure the applicability in a general context, the approach needs to be validated with more datasets.

Lastly, after providing the anonymized dataset, KGen provides also metadata regarding the information loss for each dataset attribute. Hence, the final user can estimate the damage entity by mean of information loss of each attribute.

7 Limitations and Threats to Validity

Age Postcode Gender
24 80015 F
28 80019 M
42 85073 F
(a) Example of a dataset not k-anonymized.
Age
20 - 29
20 - 29
40 - 49
(b) Age: LOG 2.
Postcode
8001*
8001*
8507*
(c) PC: LOG 1.
Gender
F
M
F
(d) Gender: LOG 0
(a) Datasets k-anonymized, applying the suppression criteria (with a max level of suppression of 35% of the entire dataset). The final dataset contains only the first row.
Table 7: Lattice reduction process with suppression criteria.

This section outlines the major limitation we perceive in our work, which reflects one of the optimizations that KGen features in its processing and algorithms. As outlined in section 3.2, KGen features a lattice size reduction technique that limits the approach applicability in specific cases. Nevertheless, the technique is essential since it can work on a smaller search space than the original one, whose size could be untractable without major software-defined infrastructure requirements. However, the described technique introduces a vulnerability when, during the anonymization process, also the suppression technique is introduced. Preprocessing without suppression ensures that all lattice nodes except for the new minimum node found in the process are not k-anonymized. With the suppression active, instead, this does not hold. Let us take into account the example in Tab. 7. If we apply the suppression criteria (with a maximum level of suppression set, by default, of 35%) on each dataset, in order that all datasets are k-anonymized, we have the suppression of the last row on the first and second dataset and the suppression of the second row in the last dataset (Tab. 7(b) - 7(c) - 7(d)). At this point, removing the second and third-row from the dataset, the remaining dataset is composed by only the first row, with a final level of generalization of 2, 1, 0. Nevertheless, this dataset is k-anonymized also without any generalization (0, 0, 0). By applying the suppression criteria, it is possible to have one k-anonymized node with a level of generalization less than the minimum level of generalization provided by the preprocessing. We are aware of this limitation and plan to address it in future developments and iterations over this work.

8 Conclusion and future work

With the quickly increasing amount of digital data, there emerges a growing need to provide support for fast and scalable data-processing capable of offering anonymization guarantees. In this paper, we introduce KGen, a scalable approach to data-intensive k-anonymization featuring genetic algorithms.

The KGen approach focuses on the assessment of the balance between two critical, and opposite data quality attributes functional to data-processing, namely, data privacy versus usefulness of data. As aforementioned, KGen exploits genetic algorithms that allow organically increasing the level of privacy of the data while safeguarding that the data evidence, which is still usable e.g., in terms of financial evidence and audit trails part of governmental data-intensive processing.

KGen is a practical, scalable, data-intensive approach that can effectively anonymize datasets embracing the well-accepted k-anonymization measure. The approach is supported by a prototype coded in Java and tested through various experiments using benchmarks and real-life industrial datasets.

Initial results look very promising. We have shown empirically that the behavior of KGen level of generalization metric performance equally well as other optimization approaches, while KGen -in contrast to other approaches- can deal with a large number of quasi-identifiers, and thus Big datasets.

Future work will focus on building a more robust and user-friendly interface on top of the current prototype and more personalized privacy measures. Besides, we intend to work on a dynamic version of KGen, D-KGen, that can deal with streaming data that dynamically add/remove/alter the dataset on-the-fly and just-in-time, breaking the “closed-world assumption” underpinning most of the existing approaches.

References