Nowadays, applications interact with a plethora of potentially sensitive information from multiple sources. As an example, modern applications regularly combine data from different domains such as healthcare and IoT. While such rich sources of data are extremely valuable for analysts, researchers, marketers and other professionals, data privacy technologies and practices face several key challenges to keep pace.
There are two major obstacles when it comes to hosting and sharing sensitive data. The first is that the public cloud solutions are not trusted with sensitive data (e.g. health, financial, or critical infrastructure data) and thus organisations have to invest in private or hybrid clouds as the hosting and processing environment. This adds complexity for the design and implementation and often comes with additional cost due to security and customisation. Homomorphic encryption provides an answer to these kinds of obstacles, by encrypting the data at their source while allowing operations on them and thus lifting the trust barrier from the hosting solution.
The second is data privacy. Data privacy technologies are applied in two major use cases. The first use case concerns data sharing, where data need to be sufficiently anonymized before being shared with researchers and analysts. The second use case concerns security. By anonymizing data at rest, the risk of breaches is minimised since sensitive information is protected.
Latest advances in regulation, like EU General Data Protection Regulation (GDPR), also propose anonymization for safely processing data when consent is not an option or organisations want to use them for purposes beyond those for which it was originally obtained for an indefinite period of time.
In this paper, we present how to apply different data privacy approaches to homomorphically encrypted data. Specifically, we present how we can achieve uniqueness discovery, data masking, differential privacy and -anonymity over encrypted data, requiring zero knowledge about the original values. Uniqueness discovery allows the user to find which attributes or combinations of attributes (quasi-identifiers) appear with a lower frequency than a predefined threshold. This leads to the selection of attributes that need to be protected via a combination of data masking, differential privacy and -anonymity approaches. We explore how we can securely apply all these techniques without leaking information about the original data, such as the domain cardinality or diameter.
The rest of the paper is organised as follows. Section II describes the motivating scenarios and use cases behind this work. Section III provides background information about the basic principles of data privacy and homomorphic computations and further outlines the data publishing workflow and describes the building blocks to achieving data privacy. In Section IV we present the related work. In Section V we present the secure protocols while in Section VI and Section VII we discuss their security guarantees and performance respectively. Finally, we conclude in Section VIII.
The first major question that arises is why data encryption alone is not enough. In general, to protect the privacy of sensitive data, only encrypted data is outsourced to the third-party cloud providers and there exist well-established systems which allow secure query processing directly over encrypted data. On top of that, a more important question is, why someone should perform masking and anonymization over encrypted data in the cloud environment. In a typical scenario, the data is anonymized at the source and then uploaded to the cloud or shared with a third party. However, there are many reasons to perform the anonymization part on the cloud after data encryption.
The answer to these two questions relies on two major observations from the GDPR compliance standard. First, data encryption does not meet the high compliance standards, since all the data encryption schemes are reversible. Second, anonymized data is not considered personal information and benefit from relaxed standards under GDPR111https://ibm.biz/Bd2yUK. Thus the need to apply non-reversible masking and anonymization over encrypted data is required. Furthermore, the GDPR mandates that the data controller needs to demonstrate that the state-of-the-art strategy is applied when it comes to pseudonymization/anonymization approaches. By relying on the cloud to deploy the state-of-the-art approaches, the operational and compliance model for the data owners becomes significantly easier.
Apart from the compliance regulations, there are several other factors that motivate anonymization over encrypted data. First, the objective of the data use may change over the course of time. Initially, it may not be desirable to share the data but after some time it might be required. This is common with enterprise data where confidentiality must be kept for several years before the data can be exchanged. Second, the users might want to selectively share data and thus apply anonymization to only the selected portion. In both cases, by pre-anonymizing the data we will not be able to reach the desired outcome. Furthermore, in the distributed IoT scenario, individual sensors may not have sufficient storage and processing capability and the ever-increasing volume of data makes it hard to anonymize at the source. As an operating model, it is far easier to homomorphically encrypt the data at source before outsourcing to the cloud and then anonymize on-demand at the cloud when it comes to sharing and collaborating.
Specific masking operations, k-anonymity and differential privacy fall into the non-reversible anonymization category and thus are the focus of this paper.
Our approach guarantees two major properties: a) the data owner does not send the data as-is and so the data trust cannot see the original data and b) facilitate the application of on-demand non-reversible anonymization approaches to the data in order to meet compliance standards or selective data sharing.
Based on the data privacy terminology, attributes in a dataset are classified as direct or quasi identifiers. Direct identifiers are uniquely identifying and are always removed or masked before the data release. Quasi-identifiers are sets of attributes that can uniquely identify one or very few individuals. For example, for the dataset in TableI, if we observe the gender attribute in isolation, it is not uniquely identifying (roughly 50% of a dataset would be either male or female); the same applies for a ZIP code (several thousands of people might live in the same ZIP code). However, if we look at attributes in combinations then we can isolate very few individuals. As an example, the combination of ZIP code plus gender plus birth date can be uniquely identifying (in case of the US this combination can uniquely identify 87 of the population). Based on the -anonymity approach , quasi-identifiers are generalized and clustered in such a degree that an individual is indistinguishable from at least other individuals.
Iii-B System Entities
A Database-as-a-Service (DBaaS) architecture consists of below entities:
Data Owner (DO): A company or an individual who is having a proprietary right to a sensitive database, such as a Bank. The Data Owner wants to securely outsource its data storage and future computations over data to a Cloud Service Provider.
Cloud Service Provider (CSP): A third party, that provides the storage and computation capability as a service to its clients. For our scenario, the Cloud Service Provider is a system where any present-day state of the art database engine is running. For example, Amazon Redshift and Microsoft Azure SQL Database.
In particular, we work in the two-party federated cloud setting, with two non-colluding public cloud servers. This model was introduced in Twin Clouds  and was subsequently used in related problems [14, 32]. Federated clouds are an example of Interclouds , a collection of global stand-alone clouds. Intercloud allows better load balancing and allocation of resources. A detailed survey of the taxonomy of intercloud architectures is presented in .
Iii-C Trust Assumption
We assume that the CSP is honest-but-curious (or semi-honest) i.e. it is honest and executes the protocol correctly, but is also interested in the plaintext of the encrypted data stored at its site, either because it is curious or it has been compromised. In this paper, we will show that the honest-but-curious adversary will not be able to learn anything about the plaintext of the encrypted database, even though it can observe the computations and can take memory dumps. Further, we also prevent the leakage of any data clustering information available in the intermediate steps of secure -anonymization, masking or differential privacy algorithms.
Iii-D Homomorphic Computation
Homomorphic encryption schemes support direct computation of functions over encrypted data without needing to decrypt it first. To this end the seminal work of Gentry  presents a fully homomorphic encryption (FHE) scheme, which is capable of evaluating any arbitrary dynamically chosen function over an encrypted database without needing the secret key. But since computation over fully homomorphic encrypted data is still many orders of magnitude slower than the plaintext execution, this limits the practical deployment of these scheme for real workloads.
Another line of research built partial and somewhat homomorphic encryption schemes. Specifically partial homomorphic encryption (PHE) schemes support evaluation of a chosen atomic function over encrypted data (like Addition or Multiplication). For example, Paillier cryptosystem  supports addition over encrypted data without needing a secret key and ensures strong security guarantees. On the other hand, the somewhat homomorphic encryption (SHE ) scheme supports the computation of low degree polynomials over encrypted data. For example, BGN cryptosystem  supports evaluation of any polynomial of degree two over encrypted data, while in LFHE cryptosystem , degree d polynomials can be evaluated, but it bases security on weaker assumptions of learning with error (LWE) or ring-LWE (RLWE) problems.
In general, a SHE encryption scheme consists of five basic algorithms: a) key generation that takes as input the security parameter and output the secret key , the public key and public parameters b) encryption that takes the message and evaluates the corresponding ciphertext using and the public key c) decryption that takes the ciphertext and decrypts it using and the secret key and outputs the corresponding plaintext message d) addition which takes two ciphertexts and and adds them homomorphically such that the output , where and are the plaintext corresponding to the input ciphertexts and respectively and finally e) multiplication which multiplies homomorphically two ciphertexts and such that the output .
In this paper, we use the LFHE encryption scheme to compute squared Euclidean distance over encrypted databases, which is a key building block of our secure anonymization protocol. The details of LFHE cryptosystem can be found at .
Iii-E Differential Privacy
Differential privacy is a more recent development in the field of privacy-preserving data publishing and data mining. Achieved by adding randomness to the data, differential privacy renders individuals’ data and data mining outputs statistically indistinguishable, thereby protecting individuals’ privacy . Differential privacy has been shown to provide strong guarantees against auxiliary information attacks [31, 17], and in recent years has been adopted by large corporations when collecting/publishing sensitive data [11, 3, 16].
A mechanism M is said to be
-differentially private if adding or removing a single data item in a database only affects the probability of any outcome within a small multiplicative factor. Formally, a randomized mechanism M is-differentially private if for all data sets and differing on at most one element, and all then
There is a number of mechanisms available to achieve local differential privacy , covering many different types of data. When working with continuous numerical data, differential privacy is commonly achieved using the Laplace mechanism . The authors showed that, by adding noise from a suitably-scaled Laplace distribution, the resulting output will satisfy differential privacy. The geometric mechanism is a discrete variant of the Laplace mechanism, used when dealing with integer-valued data . For binary-valued data, differential privacy can be achieved by flipping values at random. The probability for flipping is equal to . In some cases, the addition of noise to the data does not make sense, e.g. categorical data. In this context, the exponential mechanism provides a means to achieve differential privacy. Developed by McSherry and Talwar , the exponential mechanism selects an output at random, weighted by a utility function which is specified by the data controller.
Iii-F Our Contributions
In this work we build a secure privacy-preserving data publishing workflow over encrypted datasets. The workflow consists of five major components. Figure 1 illustrates the steps of the workflow. The workflow expects as input the encrypted data as well as encrypted meta-data, such as the dictionaries to be used for masking or encrypted parameters for differential privacy. In Section V we present the details of our secure protocols.
Secure Privacy Vulnerability Identification. This step detects direct identifiers and combinations of attributes-values (quasi-identifiers)  that lead to high re-identification risk. The detection is based on the attribute values. In Section V-A we illustrate how direct and quasi-identifiers can be obtained from an encrypted database.
Secure Data Masking. This component protects the direct identifiers detected by the privacy vulnerability identification component. The values of direct identifiers can be replaced with fictionalized values or redacted; the action taken is based on a pre-defined configuration.
Secure k-anonymity and differential privacy. This component protects the quasi-identifiers, by applying algorithms with strong security guarantees, such as differential privacy and -anonymity. Here data are generalized and/or suppressed and/or perturbed so that the re-identification risk becomes smaller than a pre-specified threshold.
Risk assessment. This component assesses the risk associated with the dataset. It is an additional step during exploratory phases in which expert assessors and policymakers are still evaluating what additional privacy constraints to apply, in addition to what is required by the current legislation.
. This component allows the estimation of the loss in utility caused by the de-identification/anonymization process.
Iv Related work
L. Sweeney et al.  introduced the concept of k-anonymity and how it can protect against re-identification attacks via creating indistinguishable records. Khaled El Emam et al.  proposed a way to achieve globally optimal -anonymity. LeFevre et al.  proposed Mondrian as an approach to achieve good balance between generalization and information loss for multidimensional datasets. These works, along with numerous others that present optimal solution to achieve -anonymity, try to prevent re-identification attacks through generalization. All these approaches work on unmodified data and they do not include the notion of anonymity over encrypted datasets.
Achieving -anonymity using clustering is not a new concept. Bertino et al.  proposed an efficient k-anonymization algorithm called -member, which is useful in identifying required generalization to apply -anonymity to a given dataset. Loukides and Shao  propose novel clustering criteria that treat utility and privacy on equal terms and propose sampling-based techniques to optimally set up its parameters. Aggarwal et al.  use a personalized clustering algorithm in order to provide a level of anonymity to the individuals recorded in the dataset. All of the proposed algorithms require direct access to the data and do not operate over encrypted data.
Jiang and Clifton  propose a secure distributed framework for achieving -anonymity. Their paper describes a method to locally anonymize dataset so that the joined dataset will be -anonymous. A two-party secure distributed framework is developed which can be adopted to design a secure protocol to compute -anonymous data from two vertically partitioned sources. This framework does not apply to encrypted data shared in an hybrid cloud infrastructure. Jiang and Atzori  propose a privacy-preserving strategy to mine -anonymous frequent item sets between two, or more, parties. The proposed algorithm operate on encrypted data to extract insights. The original data are not modified and they are still not compliant with any privacy model after the application of the proposed algorithm.
Differential privacy and homomorphic encryption has been considered previously. In , differentially private encryption schemes were considered as a way to prevent leakage of information. The authors proposed the Encrypt+DP concept, that imposes differential privacy on the decryption process, rendering it a stochastic process that not always be correct. They also propose DP-then-Encrypt, whereby noise satisfying differential privacy is first added to the data before being encrypted. Both of these schemes are different from the one presented in this paper, as we achieve differential privacy on encrypted data, without having to see the plaintext and without having to decrypt the ciphertext.
The work that is closely related to our paper is the approach proposed by Liu et al. . In this paper a method for performing -means over homomorphic encrypted data is presented. The paper uses a specific encryption scheme. The main difference is that their approach does not extend to -anonymity and that the execution scenario described in their approach assumes that the clustering algorithm is performed in a single VM. Furthermore, our work extends to vulnerability identification, masking and differential privacy.
PRIvacy Masking and Anonymization (PRIMA)  provides several features for the strategy design and enforcement of data privacy in production grade systems. PRIMA aims to guide decision makers through the data de-identification process while minimizing required input. PRIMA operates on a different trust model, where the data are anonymized before reaching or in the cloud environment, and has no ability to work on encrypted data.
V Secure Protocols
As described in Section III-B, all the protocols proposed in the paper are considered in the two-party Honest-but-Curious cloud setting. The Data Owner (DO) has a plaintext database table consisting of N data points . Each data point is a dimensional value, i.e. . Furthermore, let the domain of plaintext space be and the domain of ciphertext space be . The DO calls the KeyGen function of the SHE algorithm to get the public key and secret key pair (pk, sk). Next, the DO encrypts the plaintext database using pk to generate an encrypted database such that . Please note, before encryption all the decimal values are converted to nearest integer. Further, the categorical values are first divided into different hierarchy levels from general to specific and each separate path in the hierarchy is assigned values from far-apart ranges as shown in Figure 2. These assigned values are then considered as representatives for categorical data. The values of the hierarchy are also encrypted on the DO side. This specific assignment technique will help us to securely identify common ancestor as shown in Section V-D4.
Then, the DO shares pk, and the identification threshold with Party and sk with Party .
V-a Privacy Vulnerability Identification
The privacy vulnerability identification process explores the combinatorial space of data attributes and aims to identify direct and quasi-identifiers – value sets that appear fewer times than a pre-defined identification threshold . The process starts by inspecting single attributes and tries to find values that appear fewer than times. All attributes detected to have values appearing fewer than times are reported as direct identifiers. In our example, each name value appears only once, thus the name attribute is a direct identifier. The process then starts to inspect pairs of attributes that are not direct identifiers, then the algorithm proceeds to inspect combinations of three identifiers and so on.
Naïve exploration of the entire combinatorial space is infeasible for a large number of attributes since for attributes combinations need to be checked (see Figure 3). Pruning techniques are employed to avoid the exploration of the full space. Pruning can be applied in the following two scenarios. First, if an attribute, or a set of attributes, is a quasi-identifier then all the combinations of attributes including this attribute, or set of attributes, are also quasi-identifiers. Second, if is not a quasi-identifier then all subset combinations of are not quasi-identifiers. This leads to a dramatic reduction of the number of combinations of attributes that need to be checked, thus resulting in a significant improvement in execution time of the protocol. As an example, consider the scenario shown in Figure 4. Here the impact of pruning is depicted in terms of the reduction of the search space. Refer to  for further discussion of the impact of pruning in the identification of privacy vulnerabilities.
We use Algorithm 1 to identify direct identifiers. Since our encryption function is non-deterministic, a direct comparison of encrypted attribute values will not be helpful. So for each encrypted value in the attribute, computes its difference from the remaining values, where is the number of tuples in the dataset (the difference will be zero if there is a value match within the attribute). Then multiples these differences with random values in the matrix and send the computed matrix to party . This is shown in Steps of Algorithm 1. Next, for each attribute, counts the number of zeros for every encrypted value, if there is an encrypted value for which the count of zeros is less than , then this attribute is a direct identifier. tracks the direct identifiers by setting to
the corresponding index in vector. This is shown in Steps in Algorithm 1. Then, returns the vector to Party .
Similarly, we utilize the matrix computed in Algorithm 1 along with the pruning mechanism described earlier to find quasi-identifiers.
V-B Data Masking
Data masking is applied when there is need to replace the original values with fictionalized ones. If we operate on non-encrypted data, then multiple options are available: format-preserving and semantic-preserving masking, compound masking as well as some generic masking providers, like nullification, hashing, randomization, truncation and numeric value shifting. Format-preserving masking dictates that the masked value will have the same format as the original one. Semantic-preserving masking ensures that parts of the original value that contain auxiliary information need to be maintained.
Since we operate on encrypted data, not all options are available. Semantic-preserving and format-preserving masking cannot be applied since they require access to the original value unless the data owner encrypts only the unique parts of the value. This requires additional metadata so the cloud environment knows how to handle each value (e.g. offsets and lengths of encrypted portions of the value). However, in this paper, we apply following masking operations:
Masking of dictionary-based entities. Entities like names, organization, cities, countries and many more rely on dictionaries to perform format-preserving masking. For example, if we want to replace a name with another one, then we pick a random name from its dictionary. We can apply the same operation over encrypted data. The user uploads a fully encrypted dictionary for the attribute. Then we select a random value from the encrypted dictionary and replace the value. However, the encrypted version of the dictionary needs to be immune to inference attacks. For specific attributes, an attacker can infer the attribute type and values based on cardinality attacks. As an example, a dictionary of two entries could potentially be a gender dictionary. To alleviate this problem, we can append copies of its values to the dictionary. Since the encryption is non-deterministic, we can increase the cardinality of the values infinitely.
Numerical masking operations: We can mask numerical values by using the following mechanisms:
Add a constant shift amount, for example adding value 10 to all values
Noise addition. Given percentage , we can mask the value and replace it with a random value in the range to
Randomization. Replace a value with a randomly generated number.
Redaction / fixed replacement: This is a special case where we create dictionaries with encryption of empty string or fixed values.
V-C Differential Privacy
In this section we demonstrate achieving differential privacy on encrypted data with select mechanisms. To implement the differential privacy mechanisms on numerical data given in Section III-E, some information on the data is required, such as the diameter for the Laplace mechanism, and the binary values for the binary mechanism. This information must somehow be provided to the CSP for the mechanisms to be implemented. As we will show later, it is sufficient for this information to be available in encrypted form. Making such information about the data publicly available may reveal unwanted information and lead to inference attacks (e.g. attribute type, extreme values, etc.), and is therefore not desirable.
Before the data is encrypted, the DO selects lower and upper bounds that are independent of the data. This may be performed by examination of the attribute in question (e.g. a person’s age), or by other means, but must not be a function of the data (i.e. the range of the data). In the case of binary-valued data, and will simply be the two binary values. Non-informative bounding, as discussed in , ensures no additional privacy leakage, allowing the entire privacy budget to be spent on the differential privacy mechanism itself. These bounds must then be encrypted and stored securely alongside the dataset in question. For the remainder of this subsection, we will refer to the encrypted values and .
Below, we detail how we can use and (Section III-D) to render the encrypted values differentially private, without having to decrypt the original values. This process is then applied independently to each value of interest.
Laplace mechanism: To achieve differential privacy, the required scale factor is . In determining the noise to add to the data, we sample , and add this to the encrypted value. In generating , we draw a value
at random from a uniform distribution on,
, and use the inverse of the cumulative probability distribution ofto find
where is the signum function, defined by
We cannot calculate the plaintext , since can only be calculated in encrypted form. We can, however, calculate its ciphertext as :
where is given by:
The resultant value that is stored is therefore
Binary mechanism: If the original data is binary, the binary mechanism can be used. This time we draw at random from the unit interval . If , then the value remains unchanged. However, if , then we flip by setting . Again, this can be done without knowing the value of , and by only knowing the ciphertexts and . We can implement this using to get , and then using as before.
In the case of the value being flipped, the value that is stored is
In this paper, we implement anonymization algorithms that support the -anonymity privacy guarantee as formally defined in . Given the identification threshold , achieving -anonymity over encrypted data is a three-step process. First, we securely partition the data into clusters. In this paper, we specifically apply -means clustering algorithm over encrypted data and use Squared Euclidean Distance (SED) metric to calculate the proximity of values in their respective feature space. Second, to ensure that each cluster has at least members we apply data suppression and re-assignment techniques as presented in Sections V-D2 and V-D3, respectively. And finally, we securely anonymize the original data values to a representative one. For the numerical attributes, we replace them with the cluster centroid. For each categorical attribute, we replace them with the common ancestor of the attribute value based on the respective generalization hierarchy.
Algorithm 2 outlines the procedure to compute -anonymized data for a database table having attributes. The algorithm takes as input the encrypted table , the identification threshold , the number of iterations of clustering algorithm rounds and the suppression threshold th at . Further has the secret key . In the end, the algorithm outputs the corresponding -anonymized database table .
In the following sections, we will describe different steps of Algorithm 2.
V-D1 Data Clustering
To produce a -anonymized database, can at most find clusters, each having at-least members, where is the total number of tuples in the table . In Step 2 we randomly select tuples as the initial cluster centres222Other initial cluster center selection methods can be used and in Steps 4 – 9, squared Euclidean distance is computed for all the tuples from all the cluster centres using the homomorphic properties of the SHE encryption scheme and the results are stored in matrix .
Next, in Step 11 new cluster assignment for all the tuples are identified by calling the function ComputeMinIndex described in Algorithm 3. This algorithm takes as input a vector of size and returns an encrypted vector of size having the encryption of value at the index of nearest cluster centre and encryption of at all other positions. In Steps 1 – 2, selects a monotonic increasing polynomial , such that iff and homomorphically evaluate the polynomial over all the values in vector and computes . Next, in Step 3 selects a pseudo-random permutation (PRP) and permutes the vector . Finally, it sends the vector to . Then, in Step 5, decrypts the vector and in Steps 6 – 12 identifies the index of the minimum value element in vector . In Steps 13 – 14, initializes a vector of size with and then sets the value at the index identified above to . Then, in Step 15 encrypts the vector and sends it to . Note that the vector contains the encryption of at exactly one position corresponding to the nearest cluster center, but since the vector was initially permuted by , hence does not learns the correct cluster assignment. Next, applies the inverse permutation to in Step 17. Note, has received the cluster assignment for tuple but it does not learn the cluster to which this tuple is assigned, since all the entries in the vector are encrypted using non-deterministic encryption. Similarly, receives the encrypted cluster assignment for every tuple in the encrypted table
Next, in Step 13 of Algorithm 2 calls the function RecomputeClusterCentres to recompute the cluster center representatives. Algorithm 4 provides the details of this function. It takes as input the encrypted cluster assignment and returns new cluster centres . In Steps 3 – 6, Algorithm 4 first computes the encrypted cluster count and sum for every cluster. Note in Step 5, tuple is added to if and only if it belongs to a cluster , since if tuple belongs to a cluster , else . Next, in Step 7, selects a random value and multiplies it with the cluster count for cluster
. This step produces a one time pad blinding of the cluster count. Similarly in Step8 the corresponding cluster sum is blinded with a random value . All the above operations are performed using the homomorphic properties of the SHE encryption scheme. Now, sends the blinded cluster count and sum to . In Steps 13 – 15 decrypts and divides the corresponding cluster sum and count and re-encrypts the results. then sends the encrypted divisions to . In Steps 19, multiplies the cluster divisions with division of the random values selected in Steps 7 and 8. This step removes the randomness and gets the updated cluster centres encrypted under the public key . The clustering process is repeated for rounds number of iterations in order to converge the cluster centres.
V-D2 Cluster Suppression
The above clustering process does not provide guarantee on the number of members in each cluster. In order to achieve -anonymity, we make sure that each cluster has at-least members. To achieve this, we apply a post-processing phase on the output clusters. In Step 16, Algorithm 2 calls function Non-kClusters to identify the clusters having fewer than members. The details of this function is described in Algorithm 5. It takes as input the encrypted cluster assignment () and encrypted cluster count () and returns a vector of size indicating the clusters which need further processing. In Step 1 selects a monotonically increasing polynomial and homomorphically evaluate the polynomial over the encrypted cluster counts and computes . Next, in Step 3 a PRP is selected and used to permute the order of . Then in Step 4, the identification factor is encrypted and masked with the polynomial . Next both and are sent to . Party initializes a vector of size with value . Then in Steps 7 – 13, it sets the entry of vector to if the count is less than and finally returns . Party then applies the inverse permutation and retrieves the vector .
Now, if the number of members is more than (i.e. if ), then the cluster is left unmodified. If, however, the cluster contains fewer than members we follow two strategies. First, we check if we can suppress the cluster and remove its points from the final anonymized output. For suppression, a threshold is required to specify the maximum percentage of the total points we are allowed to remove. As an example, if the suppression threshold is 10% and the input dataset contains 200 tuples, we are only allowed to remove up to 20 tuples. If suppression is not allowed or we have reached the suppression threshold, then we apply cluster re-assignment techniques, in which nearest clusters are identified to merge with the non- clusters. The cluster re-assignment strategies are described in Section V-D3.
V-D3 Cluster Re-assignment
Once the suppression threshold is reached, the remaining non- clusters are re-assigned to nearest clusters. The merging of two clusters is easily done using the encrypted cluster assignment vector. For example, say we want to merge cluster in cluster , we can achieve this by adding the th component of every encrypted cluster assignment vector to its th component. Further, merging the cluster with its nearest one does not guarantee -anonymity. For example, let’s assume we want to achieve -anonymity and a cluster has one member. Its nearest cluster also has a single member so merging them will not result in a cluster with a minimum of elements. Thus, we need to apply the process iteratively until all created clusters have more than or equal to data points. The nearest cluster could be identified using one of the following strategies:
Cluster to Cluster – The nearest cluster is computed based on the squared Euclidean distance of the target non- cluster centroid from the centroid of the rest of the clusters, as shown in Figure (a)a.
Point to Cluster – The nearest cluster is computed based on the squared Euclidean distance of the data points in the target non- cluster from the centroid of the rest of the clusters, as shown in Figure (b)b.
Point to Point – The nearest cluster is computed based on the squared Euclidean distance of the data points in the target non- cluster from the data points in the rest of the clusters, as shown in Figure (c)c.
After cluster re-assignment step all the identified clusters have a minimum of data points.
V-D4 Data Anonymization
For numerical attributes, we replace them with the cluster centroid. For the categorical attributes, we replace them with the common ancestor of the attribute value based on the respective generalization hierarchy. In order to avoid inference attacks based on the hierarchy structure and cardinality of number of nodes per level, we can employ similar approaches like with data masking dictionaries; we can randomly insert dummy nodes at each level. One approach to calculate the common ancestor for all values is the following. We first calculate the common ancestor between the first value and the second one, let’s call it . Then we calculate the common ancestor between and the third value, and so forth. If at some point one of the common ancestors calculated is the root of the hierarchy, then the calculations stop. This approach requires at maximum O(N) checks and each check requires operations, where is the total number of values and is the height of the generalization hierarchy.
We will now describe how we calculate the common ancestor between two values. We will use the hierarchy of Figure 2 as an example. Let us consider the root of the hierarchy to be level 2. We begin from the fact that all the encrypted values in the data will belong to the leaves of the hierarchy (level 0). Given two encrypted values from the data, v1 and v2
, we find the nearest node from level 1 for each value using a secure kNN approach[14, 32] with . Let us call the nearest nodes and . We subtract the values and and we forward the difference to Party P2. If the distance is zero, then it means it is the same node and thus we found the common ancestor. If the difference is non-zero, we then follow the same process for and and we find their nearest nodes from the next level and so on. As an optimization, whenever we want to calculate the common ancestor of two values, we look immediately for the maximum level stopped at the previous steps. The entire process stops if for any given pair we reach the root level.
V-E Risk and utility assessment
In this Section, we sketch out how various risk and utility assessment algorithms can be implemented on top of encrypted data.
Inference-based risk metrics, such as the ones described in [10, 26, 41] rely solely on the size of the equivalence classes and additional external information, such as population size (required) and bias estimation (optional). Thus, it is only required to group the data based on their equivalence class and count the size of each group.
Simple information loss metrics, such as Average Equivalence Class Size (AECS)  and discernibility , also rely on the equivalence class size to provide a result. Categorical precision  relies on the level of generalization applied for each value. This information is computed when we perform the data anonymization step (see Section V-D4). Similarly, generalized loss metric  requires the number of leaves for each anonymized value. However, metrics like non-uniform entropy  and global certainty penalty  require either frequency calculations or knowledge of the data diameter, which can be only aquired with access to the original values.
Vi Security Guarantees
As described earlier both the Parties and are considered in the Honest-but-Curious security model where both the parties correctly execute the protocol but may try to learn the plaintext value from their view of the encrypted data processing. We also assume that Party and Party do not collude. Further, Party is additionally trusted with the secret key of the SHE encryption scheme. We want to emphasize that this cloud model is not new and has been used in related problem domain [14, 32].
Given above assumptions, informally we will prove that, the views of Party and Party does not reveal any useful information about the plaintext database during the execution of secure -Anonymization protocol. We will formally prove this statement using Leakage Profile Analysis.
Vi-a Leakage profile at Party
Below we enumerate the leakage to Party :
Direct Identifier : In Algorithm 1, for each encrypted value in the attribute, Party computes its difference from the remaining values and then multiplies them with a different random value. Both these operations are performed using the homomorphic properties of the SHE scheme. Hence any leakage in this step will break the security guarantee of the underlying SHE encryption scheme.
ComputeMinIndex : In Algorithm 3, for each entry in vector , Party evaluates a randomly chosen polynomial using the homomorphic properties of the SHE scheme. Now, since the polynomial evaluation is done over encrypted data, hence the security guarantee of the SHE scheme ensures that there is no leakage to Party . Next chooses a pseudo-random permutation to hide the physical order of elements in vector . This step further breaks any physical order co-relation between the entries in different vectors.
RecomputeClusterCentres : In Algorithm 4, Party computes the number of data points in every cluster and the corresponding cluster sum. To compute the count, applies addition operation over the encrypted cluster assignment vector and to compute the cluster sum, first multiplies the encrypted vector and encrypted data points and then adds the encrypted values. All the operations in this algorithm are performed over encrypted data using the properties of the SHE scheme, hence the security guarantee of the SHE scheme ensures that there is no possible leakage to Party .
Non-kClusters : In Algorithm 5, Party evaluates a randomly chosen polynomial over the encrypted cluster count values and the encrypted identification factor , using the homomorphic properties of the SHE scheme. Hence, any leakage in these steps will break the security guarantee of the SHE scheme.
Suppress and Reassign Clusters : In this step, Party only replaces some encrypted cluster count values with random values or adds two encrypted vectors , hence there is no extra leakage.
The above leakage profile for Party leads to the following security guarantee :
Security Guarantee for Party : The secure -Anonymization protocol leaks no information to Party except that it learns if an attribute is a direct identifier and number points in the non-k clusters. In particular, Party does not gain any knowledge about the encrypted data points, the difference between two data points, the cluster to which a data point is assigned and the cluster centre representatives.
Vi-B Leakage profile at Party
Below we enumerate the leakage to Party :
Direct Identifier : In Algorithm 1, Party decrypts the encrypted matrix using the secret key . But since Party has multiplied each entry of the matrix with a different random value before sending it to Party , the decrypted matrix effectively contains random values. Hence the only leakage in this step is that Party learns if two values in the attribute are equal since the corresponding decrypted difference will be but nothing is revealed about the original data points or the difference between two unequal values.
ComputeMinIndex : In Algorithm 3, for every data point, Party receives a vector of size . The entries in this vector are the output of encrypted polynomial evaluation poly(x) over the distance of the data point from the cluster centres. Further, the order of elements in the vector is permuted using a secure pseudo-random permutation, hence the exact identity of cluster centres associated with any given difference value is hidden from Party .
Party decrypts the entries in the vector and since the polynomial poly(x) is order preserving, hence Party can sort the decrypted values and identify the index of the nearest cluster centre.
In Appendix A, we prove that recovering the plaintext distances form is computationally infeasible for Party . The only possible leakage to Party in this round is the presence of such points in the database that are equidistant from two or more cluster centres. This is leaked from the presence of identical values in the set . However, since the order of the values is randomly permuted by Party , Party cannot map these values back to the original index of either the data point or the corresponding cluster centres in the database.
Fig. 6: Execution time for varying number of data points (each having 2 dimensions) Fig. 7: Execution time for varying number of dimensions with 1800 data points
RecomputeClusterCentres : In this phase, Party gains access (by virtue of decryption) to the following plaintext (but randomized) quantities:
Sum of data points nearest in each cluster centre
Number of data points in each cluster centre
We note that since these quantities are multiplicatively randomized by Party , their actual values are effectively hidden from Party . It is also worth noting that the randomization used is different for each cluster, implying that Party cannot hope to leverage any sharing/re-use of randomization across different cluster centres to gain additional information about the sum or number of data points for any given cluster centre.
Non-kClusters : In Algorithm 5, Party gets access to the plaintext (but masked) of the anonymization factor and the number of data points in each cluster center. But since Party evaluates a random polynomial poly(x) over their encrypted values before sending them, hence Party does not learn the actual anonymization factor and the number of data points in each cluster centre. A similar proof as shown in Step 2 above can be presented here.
Suppress and Reassign Clusters : In this step, Party receives a permuted vector of size having the encrypted counts of the number of elements in non- clusters padded with some fake values. Hence after decryption of this vector Party cannot identify the number of elements in the non- clusters, since we have picked a secure pseudo-random permutation, which is computationally difficult to invert, implying that the exact identity of cluster centres associated with any cluster count is hidden from Party .
The above leakage profile for Party leads to the following security guarantee :
Security Guarantee for Party : The secure -Anonymization protocol leaks no information to Party except that it only learns if an attribute is a direct identifier but does not gain any knowledge about the encrypted data points or the cluster to which a data point is assigned and the cluster centre representatives.
In this section, we empirically evaluate the performance of our protocols. The experimental setup consists of three machines, representing the Data Owner, Party and Party . The configuration of machines representing Party and Party is: 4 core 2.8 GHz processors, 64 GB RAM running Ubuntu 16.04 LTS; the configuration of the machine representing Data Owner is: 4 core 2.8 GHz processors, 8 GB RAM running Ubuntu 16.04 LTS. We use the HELib  library to encrypt the data using LFHE. Specifically, for HELib we set (i) , a large prime between and , (ii) the maximum depth to and (iii) the security parameter to .
The two parameters affecting the performance of our protocols are the number of data points and the number of dimensions in the data. To study the independent effect of each of these parameters on our protocols, we use simulated data. We generated two datasets, one with a varying number of data points (results shown in Figure 6) and one with a varying number of dimensions (results shown in Figure 7). The data were generated using a uniform distribution. We repeated each experiment multiple times with a newly generated dataset. The average time across these experiments is reported here.
LFHE allows SIMD operations by packing multiple plaintext data values into a single structure and then encrypting them together into a single ciphertext. We utilize this feature of LFHE extensively. We encrypt each dimension of the data point independently. For each dimension, we pack data from multiple data points into a single structure and then encrypt this structure to get a single ciphertext. This ciphertext is then outsourced to Party . The time taken to encrypt the plaintext data is shown in Figure (a)a and Figure (a)a. These figures clearly show that the data encryption time scales linearly with the number of data points and the number of dimensions.
The second major step in our protocols is to check if a particular combination of dimensions is a privacy vulnerability identifier or not. The actual number of combinations that need to be tested is data-dependent. To remove this data dependence from the performance evaluation, we report the average time taken to identify a quasi-identifier. The results are shown in Figure (b)b and Figure (b)b. From the figures, it is clear that the time taken to identify a quasi-identifier scales linearly with the number of data points and is independent of the number of dimensions (this is because, the most computationally heavy step is decryption of distance at Party , which is independent of the number of dimensions).
Once a quasi-identifier is identified, the next step is to cluster the data in the quasi-identifiers. Furthermore, after clustering, we use the “Cluster to Cluster” re-assignment strategy to eliminate non- clusters. Both of these operations are highly dependent on the data and the choice of initial cluster centres. To remove this data dependence from the performance evaluation we report the average time taken for each iteration of clustering and the time taken to re-assign a single cluster. Figure (c)c and Figure (c)c show that both the above operations scale linearly with the number of data points. The number of dimensions has a negligible effect on the cluster reassignment (again, the most expensive step being decryption of inter-cluster distance at Party , which is independent of the number of dimensions).
The above performance evaluation shows that our protocols scale linearly with the number of data point as well as the number of dimensions in the dataset.
This paper presents a set of secure algorithms on how to apply anonymization over homomorphically encrypted databases. It does not focus on a single anonymization approach but touches various components that are required for end-to-end privacy. It demonstrated how to achieve uniqueness discovery, data masking, differential privacy and -anonymity over encrypted data without leaking information about original values. Feasibility of this solution is shown by empirical evaluation. This work is the first to perform several techniques, like vulnerability assessment, differential privacy and -anonymity, over encrypted datasets which means there is room for improvement and future work, especially on the performance and optimization side.
-  (2010-07) Achieving anonymity via clustering. ACM Trans. Algorithms 6 (3), pp. 49:1–49:19. External Links: Cited by: §IV.
-  (2018) PRIMA: an end-to-end framework for privacy at scale. In ICDE, Cited by: §IV.
-  (2016-06) Apple previews iOS 10, the biggest iOS release ever. Note: http://www.apple.com/newsroom/2016/06/apple-previews-ios-10-biggest-ios-release-ever.html [Accessed: 2016-07-26] Cited by: §III-E.
-  (2005) Data privacy through optimal k-anonymization. In ICDE, External Links: Cited by: §V-E.
-  (2005) Evaluating 2-dnf formulas on ciphertexts. In Theory of Cryptography Conference, pp. 325–341. Cited by: §III-D.
-  (2014) (Leveled) fully homomorphic encryption without bootstrapping. ACM Transactions on Computation Theory (TOCT) 6 (3), pp. 13. Cited by: §III-D, §III-D.
-  (2017) A differentially private encryption scheme. In Information Security, P. Q. Nguyen and J. Zhou (Eds.), Cham, pp. 309–326. External Links: Cited by: §IV.
-  (2011) Twin clouds: an architecture for secure cloud computing. In Workshop on Cryptography and Security in Clouds (WCSC 2011), Vol. 1217889. Cited by: item 2.
-  (2007) Efficient k-anonymization using clustering techniques. In Proceedings of the 12th International Conference on Database Systems for Advanced Applications, DASFAA’07, Berlin, Heidelberg, pp. 188–200. External Links: Cited by: §IV.
-  (1998) Estimation of identification disclosure risk in microdata. Journal of Official Statistics 14 (1), pp. 79. Cited by: §V-E.
-  (2012) Differential privacy as a response to the reidentification threat: the Facebook advertiser case study. North Carolina Law Review 90 (5). Cited by: §III-E.
-  (2006) Calibrating noise to sensitivity in private data analysis. In Theory of cryptography, pp. 265–284. Cited by: §III-E.
-  (2006) Differential privacy. In Automata, Languages and Programming: 33rd International Colloquium, ICALP 2006, Venice, Italy, July 10-14, 2006, Proceedings, Part II, pp. 1–12. External Links: Cited by: §III-E.
-  (2014-03) Secure k-nearest neighbor query over encrypted data in outsourced environments. In 2014 IEEE 30th International Conference on Data Engineering, Vol. , pp. 664–675. External Links: Cited by: item 2, §V-D4, §VI.
-  (2009) A globally optimal k-anonymity method for the de-identification of health data.. JAMIA 16 (5), pp. 670–682. Cited by: §IV.
-  (2014) RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, CCS ’14, New York, NY, USA, pp. 1054–1067. External Links: Cited by: §III-E.
-  (2008) Composition attacks and auxiliary information in data privacy. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 265–273. Cited by: §III-E.
-  (2009) Fully homomorphic encryption using ideal lattices.. In STOC, Vol. 9, pp. 169–178. Cited by: §III-D.
-  (2007) Fast data anonymization with low information loss. In VLDB, External Links: Cited by: §V-E.
-  (2012) Universally utility-maximizing privacy mechanisms. SIAM Journal on Computing 41 (6), pp. 1673–1693. Cited by: §III-E.
-  (2009) K-anonymization with minimal loss of information. IEEE TKDE 21 (2). External Links: Cited by: §V-E.
-  (2016) FPVI: A Scalable Method for Discovering Privacy Vulnerabilities in Microdata. In Proceedings of the Second IEEE ISC2, Cited by: §V-A.
-  (2014) Inter-cloud architectures and application brokering: taxonomy and survey. Software: Practice and Experience 44 (3), pp. 369–390. Cited by: item 2.
-  (2018-09) HElib. Note: https://github.com/shaih/HElib Cited by: §VII.
-  (2017-11) Optimal differentially private mechanisms for randomised response. IEEE Transactions on Information Forensics and Security 12 (11), pp. 2726–2735. External Links: Cited by: §III-E.
-  (2001) Applying pitman’s sampling formula to microdata disclosure risk assessment. Journal of Official Statistics 17 (4), pp. 499. Cited by: §V-E.
-  (2002) Transforming data to satisfy privacy constraints. In KDD, External Links: Cited by: §V-E.
-  (2006-12) Secure distributed k-anonymous pattern mining. In Sixth International Conference on Data Mining (ICDM’06), Vol. , pp. 319–329. External Links: Cited by: §IV.
-  (2006-11) A secure distributed framework for achieving k-anonymity. The VLDB Journal 15 (4), pp. 316–333. External Links: Cited by: §IV.
-  (2014) Extremal mechanisms for local differential privacy. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2879–2887. External Links: Cited by: §III-E.
-  (2008) A note on differential privacy: defining resistance to arbitrary side information. CoRR abs/0803.3946. Cited by: §III-E.
-  (2018) Efficient secure k-nearest neighbours over encrypted data. In Proceedings of the 21th International Conference on Extending Database Technology, EDBT 2018, pp. 564–575. Cited by: item 2, §V-D4, §VI.
-  (2006) Mondrian multidimensional k-anonymity. In ICDE, External Links: Cited by: §IV, §V-E.
Privacy of outsourced k-means clustering. In Proceedings of the 9th ACM Symposium on Information, Computer and Communications Security, ASIA CCS ’14, New York, NY, USA, pp. 123–134. External Links: Cited by: §IV.
-  (2016-07) Statistical properties of sanitized results from differentially private laplace mechanisms with noninformative bounding. ArXiv e-prints 1607.08554 [stat.ME]. Cited by: §V-C.
-  (2008-03) An efficient clustering algorithm for k-anonymisation. J. Comput. Sci. Technol. 23 (2), pp. 188–202. External Links: Cited by: §IV.
-  (2007) Mechanism design via differential privacy. In Foundations of Computer Science, 2007. FOCS’07. 48th Annual IEEE Symposium on, pp. 94–103. Cited by: §III-E.
-  (1999) Public-key cryptosystems based on composite degree residuosity classes. In Eurocrypt, Vol. 99, pp. 223–238. Cited by: §III-D.
-  (2002-10) Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10 (5). External Links: Cited by: §V-E.
-  (2002-10) K-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10 (5). External Links: Cited by: item 1, §III-A, §IV, §V-D.
-  (1991) Estimation of the percent of unique population elements on a microdata file using the sample. In Statistical Research Division Report Number: Census/SRD/RR-91/08, Cited by: §V-E.
Appendix A Leakage from ordered equations
We now examine the possibility of any leakage to Party from the resulting system of ordered equations. Let be the ordered set of plaintext distances, and be the ordered set of polynomial outputs obtained by Party upon decryption. As mentioned earlier, the polynomial is of the form for some random . Party can formulate the following system of equations for :
where only the left hand side of each equation is known to Party . Without loss of generality, we may assume that Party can guess with high probability the degree of the polynomial chosen by Party , as well as the range of values (say ) that each plaintext distance can take. This is a particularly relevant assumption in the context of real world datasets, where the adversary may possess some apriori knowledge of the range of Euclidean distances between the data points. In addition, since homomorphic polynomial evaluation in the encrypted domain is a costly operation, the degree can only take a small range of values, which Party can also accurately guess in a small number of trials. However, we prove that even if Party has full knowledge of the aforementioned parameters, it cannot recover the original data points within a feasible amount of computation time. Observe that the system of equations has exactly unknown variables from Party ’s point of view, while the number of equations is only . Hence, Party must correctly guess the smallest distances to recover the polynomial coefficients. The average number of possible values that these distances can take is , which is approximately the same as for . In other words, the probability that Party successfully recovers the polynomial coefficients, and subsequently the plaintext distances, is approximately , which is close to negligible. For example, for and , the probability that Party