Privacy in Data Service Composition

01/03/2020 ∙ by Mahmoud Barhamgi, et al. ∙ Universidad Politécnica de Madrid University Claude Bernard Lyon 1 IEEE National Chung Hsing University 0

In modern information systems different information features, about the same individual, are often collected and managed by autonomous data collection services that may have different privacy policies. Answering many end-users' legitimate queries requires the integration of data from multiple such services. However, data integration is often hindered by the lack of a trusted entity, often called a mediator, with which the services can share their data and delegate the enforcement of their privacy policies. In this paper, we propose a flexible privacy-preserving data integration approach for answering data integration queries without the need for a trusted mediator. In our approach, services are allowed to enforce their privacy policies locally. The mediator is considered to be untrusted, and only has access to encrypted information to allow it to link data subjects across the different services. Services, by virtue of a new privacy requirement, dubbed k-Protection, limiting privacy leaks, cannot infer information about the data held by each other. End-users, in turn, have access to privacy-sanitized data only. We evaluated our approach using an example and a real dataset from the healthcare application domain. The results are promising from both the privacy preservation and the performance perspectives.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 5

page 6

page 7

page 9

page 10

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data integration is the problem of bridging together a collection of data sources so that they can be queried as if they were parts of a single database. Despite intensive research works devoted to that problem over the last few decades [1, 31], developing a data integration system remains a challenging task. Data privacy, where the privacy of data subjects in one data source should be protected vis-a-vis other sources and the integration system as a whole, is among the key challenges involved in building a data integration system [15, 4].

Most of existing multi-source data integration solutions are built as data warehouses where data is periodically collected from individual data sources and stored within a central data warehouse. Privacy is often approached by signing privacy agreements between data sources and the warehouse to specify who can access the data and for what purposes. However, such privacy agreements do not provide guarantees to individual data sources that their data would not be misused by the warehouse or any other stakeholder involved in the data integration system

In this paper, we explore an alternative multi-source data integration approach that gives data providers the control on their data while answering data integration queries and reduces the ability of the integration system to misuse data. We illustrate the research challenges addressed in this paper through a real-world example from the healthcare application domain.

Fig. 1: (a) The data services of the running example; (b) A data integration plan; (c) Sample of the data accessed by the services.

1.1 Motivating Example

Data integration has important applications in the healthcare domain such as building the patient medical record and detecting the side effects of medications. Assume for example a data integration system with an access to the data services111It is a common practice in the healthcare domain to provide a service-oriented access to heterogeneous data sources [13]. This class of services is known as data sharing services [15] or simply data services [11]. in Figure 1-a, which in turn have access to the sample tables in Figure 1-c. Assume we need to investigate the psychological side-effects of a specific ingredient of HIV medicines on female patients. Our sample data services can be composed as in Figure 1-b to achieve our objective. Specifically, is invoked with the desired city to retrieve the identifiers (e.g. the social security numbers ssn) of HIV patients. Then, , which is provided by psychiatric hospitals, is invoked with obtained ssn numbers to retrieve the psychiatric disorders for which the patients have received some treatments for, if any. Then, for each HIV patient that has developed a psychiatric disorder, and are invoked in parallel to retrieve their age, sex and HIV medications, respectively. Consequently, is invoked to retrieve the quantity of the ingredient studied in each retrieved HIV medications. Then, the outputs of and are joined on ssn.

The execution of the data integration plan (or also the service composition plan222We use the terms ”service composition plan” and ”data integration plan” interchangeably throughout the paper.) in Figure 1-b involves a challenging dilemma. That is, if individual services were to apply locally their privacy policies, and privacy-sanitize their output data then the plan cannot be executed. Note that in such case the social security number ssn, which is used by the integration plan to link the different information features of the same patient, will not be disclosed by any of the services, as it is a personally identifiable information. On the other hand, if services share their output data with the integration system without any protection, then they may infer information about the data held by each other. For example, the provider of may infer that his patients , and (refer to table T2 in Figure 1-c) are also AIDS patients if were provided with an access to the integration plan333This is a reasonable assumption since data providers may require to be informed how their data is exploited, by which entities and for what purposes. Often, this is defined in agreed upon privacy polices.. He needs just to observe the inputs with which is queried. Providers of and may also infer that those same patients have psychiatric disorders and AIDS. Moreover, the integration system (i.e. the entity responsible of coordinating the execution of the integration plan), as well as the end-user (i.e. the stakeholder studying the side-effects of AIDS medications) will learn the diseases, medications and ages of all patients transmitted by the services.

A naive solution to avoid information leakage among involved services would be by getting each of the services to ship a copy of its underlying data sources to a centralized data integration system, which can store received copies and answer queries locally. However, such solution has several drawbacks that make it impractical for real-world applications. For example, in application domains where the data is dynamic, i.e. updated frequently on the data provider side, the results computed using the copies in the centralized system may not include the latest updates on the providers’ sides, which is crucial in critical domains such as the healthcare, where an incomplete or an erroneous query answer might have dramatic consequences on the patient’s life. Moreover, such a centralized system may quickly become a target of choice for attackers due to the data concentration effect involved in retrieving copies from multiple data sources. Furthermore, current privacy regulations may prevent data providers from shipping the data they collect to a third-party which makes the solution undoable in practice.

In this paper, we propose a practical solution to allow the execution of multi-source data integration plans without leaking information about data subjects444The term data subject refers to the person whose data is collected and managed by data services (e.g. patients). to involved stakeholders. That is, data services can not infer information about the data held by each other. Our solution can be used with trusted and semi-trusted (i.e. semi-honest) mediators. In the case of semi-trusted mediators, our solution enables involved data services to locally enforce their security and privacy policies (that relate to end-users), as the mediator cannot be trusted to enforce such policies. The solution still allows the mediator to join data subjects across the different services without leaking data from one service to another or to the mediator. In the case of trusted mediators, services can delegate the enforcement of their security and privacy policies to the mediator. The solution should just prevent data leakage among services.

1.2 Existing Solutions

Existing solutions for privacy-preserving data integration can be classified into three areas:

privacy-preserving data publishing, secure multi-party query computation and trusted mediators based data integration. We discuss below their limitations based on our running example.

Privacy-preserving data publishing: Privacy models, such as -Anonymity [34] and its variations -Diversity [26] and -Closeness [24] compute an anonymized view of a private table that can be shared with data consumers without the risk of disclosing the identity of specific data subjects (e.g. patients) in that view. Some recent works [27, 18] have attempted to apply the -Anonymity concept (and its variations) in distributed settings where the anonymized view is computed using multiple private tables managed by autonomous data providers.

However, while -Anonymity and its variations provide good privacy guarantees when the private table is owned by a single data provider, its implementations in distributed settings compromise the privacy of data subjects by leaking their private information to data providers while the latter compute the anonymized view. For example, all of the works [27, 18] assume that data providers, while computing the anonymized view, can know the list of data subjects they have in common555Data providers in these works exchange the identifiers of data subjects in clear while they cooperatively compute the anonymized view., but none of them should know the specific attributes’ values that are held by each other, beyond what is included in the computed view. In our example, this means that the provider of can learn that the patients , and are also managed by the provider of which is an HIV center, thus inferring that these patients have AIDS, thus violating their privacy. Furthermore, such solutions are not convenient for data integration scenarios where data at the data providers’ side is dynamic.

Secure Multi-party Computation SMC: SMC protocols allow two or more parties to evaluate a function over their private data, while revealing only the answer and nothing else about each party’s data. Initial solutions in this research area involved substantial computation costs that made them impractical [30]. Several new solutions with an improved efficiency have emerged over the last few years [28, 35]. However, the efficiency improvements come at an expensive cost, most of the new solutions such [28] require expensive (parallel) computation architectures that can be afforded by big corporations only. In addition, a recent evaluation study of the new solutions’ efficiency [22] suggests that they can be applied to non-critical applications only, where end-users can accept delayed answers.

Trusted mediators based data integration: Solutions in this category, such as [36, 9], rely on a centralized entity that can be trusted by all data providers to compute data integration queries. For example, Yau et al. [36] present a privacy-preserving data repository that can collect, from each data provider, only the data necessary to compute the answer of a query. The data collected is hashed in the repository in such a way to prevent its reuse for computing other queries. However, while such solutions could be used when the integrated data is dynamic, they leak privacy sensitive information to data providers about the data held by each other.

1.3 Proposed Approach and Contributions

In this paper, we propose a practical multi-source data integration approach that would preserve the privacy of data subjects against the different stakeholders involved in answering a data integration query including data services, mediator and end-users. Our approach has the particularity of reducing data leakage among services to a practical amount that would prevent services from inferring sensitive information about data subjects. The mediator has access only to encrypted information to join data subjects across involved services. The proposed approach can be used for both trusted and semi-trusted mediators. Our solution can be exploited in numerous applications where independent data sources should preserve the confidentiality of their data including healthcare [13], eGovernment [1], industrial collaboration scenarios [20], emergency management [14, 25], data management in smart environments [12, 8], personal data markets [29], multi sources data analytics [23, 37], etc.

The contributions of this paper are summarized as follows:

  • We define the main privacy requirements for the service-oriented class of data integration. These requirements consider the different actors involved in a composition of services answering a query, including the constituent services (i.e., data providers), the mediator (i.e., the composition system) and the end-users (i.e., data consumers).

  • We introduce a new privacy requirement, dubbed -Protection, to ensure that there are no data leaks among services (i.e., data providers) during the query computation. In a nutshell, when the output of a service is used by a composition as input to another service , the -Protection requirement protects the output of by preventing from distinguishing the exact input value from possible input values, where is set by . This would allow to reduce the inference capability of (and the other services in the composition) about its data below a practical threshold set by itself.

  • We propose an approach to evaluate multi-source queries over autonomous Web data services while respecting the -protection requirement. We validate our approach in the healthcare application domain by conducting a set of experiments using a real medical dataset of 1,150,000 records. The results show that our solution provides a practical privacy protection with acceptable performance overhead.

The remainder of the paper is organized as follows. In Section 2, we define our privacy requirements with respect to services, the mediator, and the end-user. In Section 3, we present our approach for query evaluation. In Section 4, we evaluate the performance of our approach using a real medical dataset and discuss its applicability to real life application domains. In Section 5, we compare our approach with related research works and conclude in Section 6.

2 Privacy Requirements in Service Composition

In this section, we identify and discuss the data privacy requirements with respect to the different actors involved in a composition of services, including services, end-users and the mediator (or the composition system).

2.1 Service Composition

In this paper, we assume that data sources are exposed to the data sharing environment through Web APIs, i.e. Web services, to provide a standardized interface to data666The class of Web services that access and query data sources is known as Data Services or Data Sharing Services [11, 15, 36], and is motivated by the flexibility, and the interoperability that service oriented architectures could bring to data integration..

End-users’ queries are resolved by service composition as follows (Figure 2). Given a query and a set of available data services, the integration system compares the received query with the descriptions of available services to select the relevant ones. Then the integration system rewrites the query in terms of calls to selected services. The rewriting, also called a composition, is then executed in such a way that preserves the data privacy relative to the different actors involved.

In this paper we will focus on the privacy issues raised by the execution of a composition. Readers interested in service selection and query reformulation in terms of services are referred to our previous work [6, 5] or to similar works [33].

In the following we formalize the notion of a composition execution plan and define the requirements that should be satisfied in a privacy-preserving execution of a composition plan with respect to the different actors involved, including the mediator, end-users and involved data sharing services.

Fig. 2: Privacy-Preserving Service Composition

Definition 1 (Composition Execution Plan): We adopt the definition given in [33]. A composition execution plan is a directed acyclic graph in which there is a node corresponding to each one of the data services involved in answering the query, and there is a directed edge from to if there is a precedence constraint between and . We say that a service must preceed if one of its outputs is an input for .

The composition execution plan of our example is shown in Figure 1-b. Please note that some nodes are simply preceded by end-users’ input, e.g. . The query results can be simply computed by joining the outputs of services that are leaves in the plan (e.g. and ) [33].

2.2 Data Privacy Requirements

We focus on data privacy in this paper, i.e. the privacy of data subjects whose data is processed in a composition plan (e.g. patients in the running example), as opposed to the privacy of end-users who receive the final result which was adequately addressed in the literature.

We say that the execution of a composition plan is privacy-preserving if it satisfies the following requirements with respect to the different actors involved including participating services, end-users and the mediator:

2.2.1 Requirements with respect to services

The execution of a composition should not leak information to its constituent services about the data held by each other.

Let represents the knowledge that a service holds a given tuple . Let concerns a data subject (e.g. a patient, a product, etc.). Let also represents the knowledge leaked form to that holds .

When a composition is executed without any privacy protection, the confidence of ’s provider in is = 1. For example, when in the running example is invoked with the value = , its provider will learn that both of and hold a tuple for the patient , i.e. = = 1 (thus inferring has HIV and mental disorders).

A service can control the leakage of its data by keeping below an accepted threshold (fixed by ). We define below a mechanism to control data leakage among services.

Definition 2 (-Protection): Given a composition

and a vector

, where is a positive integer determining the protection threshold (i.e. ) the service must provide for its output tuples against the other services in , then for each edge in , must be , where is the protection threshold of , which is, in turn, a (direct or indirect) parent of in . Note that has at least one parent in (i.e. , ).

The k-protection mechanism is inspired by the concepts of k-anonymity [34] and Private Information Retrieval PIR [32]. Intuitively, this mechanism ensures that when a service is invoked (with an input data value from ), it must not be able to precisely determine its input value between other possible input values (where ). In other words, instead of invoking with a precise input value , should be invoked with a generalized value of that matches with a range of values ( ) containing at least other possible values. This way, the certainty (i.e. the confidence) of that is held by is less than . One possible way to implement this mechanism is to compute such that it contains values for which has an output. Please note that the k-protection is similar to PIR in that when a service is queried (i.e. invoked) it does not know exactly the specific query it is executing on its own dataset.

Example: We continue on our running example. Given the data accessed by our sample services in Figure 1c, examples of the privacy breaches if these services were invoked without applying the k-protection mechanism include: will know that its patients , and have AIDS; will know that these same patients, in addition to having AIDS, suffer from severe psychiatric disorders, etc. Now, assume that , the k-protection mechanism ensures that must not be able to distinguish each of its input values (e.g. ) from at least 3 other values for which it has matching tuples in its table T2. Figure 3 shows how the k-protection is enforced on the edge . The value is generalized into a range of values which contains at least three values (e.g. , and ) for which has matching tuples. After the invocation of , the extraneous tuples are filtered out.

Fig. 3: Applying the k-protection mechanism on the edge (=3)

Preserving data privacy against the different data providers involved in a query has not been addressed adequately in the literature. For example, works on distributed privacy-preserving data publishing such as[27, 18] have focused on preserving the data privacy against final end-users while allowing involved data providers to know the list of data subjects they have in common. The -Protection mechanism complements those works by making them immune to information leakage among data providers.

2.2.2 Requirements with respect to end-users

End-users are the entities that issue a query and receive the final result of executing the composition answering the query. Depending on the application considered, end-users could be trusted or untrusted by data subjects. For examples, care givers (e.g. primary care doctors, nurses, etc.) could be trusted to access (all or part of) the medical information of their patients for treatment purposes. Researchers who study the health conditions of a population, could be only trusted to access to anonymized data that cannot be linked to any specific patient.

End-users should be only allowed to access the information they are entitled to. This can be ensured either by applying privacy-aware access control policies of individual services in case of trusted end-users as in [7], or by anonymizing the results returned by the composition to prevent the re-identification of data subjects.

2.2.3 Requirements with respect to the mediator

The mediator, i.e. the entity that executes the composition plan, is another important actor in a composition of services. It may not be necessarily managed by the final data recipient. It is responsible for carrying out the intermediary data operations in the composition plan (e.g. joining the outputs of different services, tuples selection, etc.).

Mediators could be trusted if managed by a trusted entity and are untrusted otherwise. An untrusted mediator should not have access to any Personally Identifiable Information PII. For example, the mediator should not be able to identify any of the patients , and whose private data is circulating throughout the composition plan.

This implies that the data produced by a service should be protected by the service itself before being released to the mediator (and used for the invocation of other services). This can be ensured either by encrypting the released data (in case of trusted end-users) or by anonymizing it (in case of untrusted end-users).

Different applications may require different combinations of requirements, i.e. not all requirements should be respected at the same time. For example, in applications such as the healthcare, most often only the requirements related to services and end-users are relevant. In that domain, data integration is usually carried out by a trusted authority (e.g., a government agency) to, for example, discover new medical knowledge. In that case, it is important to protect data privacy against individual data providers (i.e. services) and end-users (e.g. researchers), whereas the mediator itself (i.e., the authority) is trusted. In application domains such as cybersecurity and terrorist fighting, often the requirement related to services is the most relevant. For example, consider a scenario where the objective is to proactively identify potential airplane terror attacks before they happen by identifying risky passengers (e.g., passengers with a criminal history and suspicious behaviors) on passenger lists of airline companies. Data providers in such case could be airline companies, police and intelligence services, banks, etc. The final end-user is a police inspector and the mediator is a governmental agency. In such case, the end-user and the mediator can be trusted, whereas data providers should not know they have a certain person in common, without a deep investigation from the inspector (e.g., an airline company should not know that one of its passengers has a criminal history before the case is fully investigated by the inspector).

3 A Privacy-Preserving Query Evaluation Approach

In this section, we start by presenting our different assumptions and some key concepts. Then, we present an approach to evaluate multi-source queries while respecting the requirements discussed above.

3.1 Context and Assumptions

In this work, we made the following assumptions. We consider a distributed environment with heterogeneous distribution of data. This means that different services manage different features of information about the same set of data subjects. In contrast, in a homogeneous data distribution, different services manage the same features of information about different data subjects. The second case is easier to deal with since there is no real integration to be done. Therefore we only look at the first case.

We consider a honest but curious environment, where the stakeholders involved in the execution of a data integration plan will follow the given protocol. However, they may try to analyze exchanged data during the protocol execution. This setting is also known as semi-honest environment in the literature [17].

We assume that services can provide statistical information about their accessed datasets such as the service selectivity [33]. The selectivity of a data service relative to a range of input values , denoted as , is the number of output values when is queried with . Let us consider the sample data in (Figure 1-c), the selectivity of relative to some ranges is as follows: , where the range includes the whole table ; , .

3.2 An Overview of the Proposed Approach

In our approach we assume that services, the mediator and end-users are independent entities. Our approach ensures the privacy requirements relative to those entities as follows.

In our approach, data services involved in the integration plan can apply locally their security and privacy policies on non-identifier attributes. On the other hand, they are all required to encrypt the identifiers used by the mediator to join data by an Order Preserving Encryption Scheme OPES[2]. An OPES encrypts numeric data values while preserving the order relation between them. With OPES, we can apply equality and order comparison queries on encrypted data without decrypting the operands. Doing so, the mediator gets only access to encrypted values of identifier attributes and to values that are already protected by services for non-identifier attributes (e.g., by applying the desired anonymization techniques). This satisfies the privacy requirements relative to the mediator. Once the integration plan has been executed, the mediator removes encrypted identifier attributes from result. This way, the final recipient will only have the anonymized data without any individually identifiable information. This satisfies the privacy requirements relative to end-users.

To satisfy the privacy requirements relative to services, the mediator applies our -Protection mechanism by generalizing the encrypted identifiers’ values before using them to invoke the services. If a data service was to be invoked with a value (originating from a parent service ) and it is required the certainty of that is held by to be less than , then is generalized through our protocol (presented in the following) to match with, at least, tuples in . The value generalization is carried out by the mediator in collaboration with the service to be invoked.

Privacy requirements relative to end-users and the mediator are conventional requirements and have been extensively studied in the literature. For example, the services can use any of the anonymization algorithms that implement the -Anonymity and its variations [34, 26, 24] to locally anonymize their data. Privacy requirements with regard to services are, to the best of our knowledge, new, i.e., information leakage among data providers have not been property addressed in the literature. Therefore, in the following sections we focus on ensuring the -Protection by presenting a practical protocol for generalizing encrypted identifier values.

The anonymization applied by the individual data services (and its utility) is out of the scope of this paper. However, it is worth to mention that the anonymization can be realized by the services either in isolation by directly applying one of the algorithms [34, 26, 24], or cooperatively by extending one of the algorithms [27, 18] with our -Protection mechanism presented in the next section.

Fig. 4: The communication rounds between the mediator and to compute the generalized value.

3.3 Generalization of Encrypted Identifier Values

The objective of the generalization protocol can be formulated as follows: given an encrypted identifier value ( ) with which a service should be invoked, allow the mediator to compute a generalized value777We assume that data services provide different operations (i.e., functions) allowing to query the underlying datasets by precise or generalized values (e.g., intervals). of such that matches with, at least, possible values held by where is maximum protection factor required by the parents of in . Note that the mediator cannot decrypt to generalize it alone. Rather, it needs to collaborate with the services to be invoked to carry out the generalization.

We first describe two naive approaches to generalize the identifier values and analyze their limitations. Then, we build on our analysis to define a hybrid approach that addresses the identified limitations.

3.3.1 Domain-based identifier generalization

A naive approach to compute is to exploit the domain of the attribute (i.e., ). The idea here is to use as a starting value of then to gradually increase its precision by removing parts of until it is not possible to remove any part without violating the -Protection requirement.

For this purpose, the mediator determines for each data service the protection factor k that must be respected: k = MAX(), where represents the parents of in . Then, for each input tuple t, the mediator determines the minimum range of values R [a, b] that should be used to invoke instead of . To this end, the mediator queries the selectivity of with respect to a wide range of identifier values (we use the range ]-, +[ to denote the range covering all the tuples managed by ) along with a value occurring in the middle of the domain . Then if the returned selectivity is greater than , the mediator compares to to determine the half of covering . The last step is repeated with the obtained new interval until there is no interval with a selectivity greater than . Then, is invoked with the obtained interval. After the invocation, the mediator retains only the outputs related to , i.e., the false-positives are removed by the mediator, as it knows the original encrypted input value .

Example: Assume that the service is to be invoked with the value (i.e., the encrypted value of ) and = 2. Figure 4 shows the messages exchanged between and the mediator. For simplicity, the example assumes that =1024. First, the mediator queries the selectivity of relative to a wide range of values (denoted by ]-, +[). replies that it has 13 values and the value is in the middle of . The mediator compares with and determine the new range ]-. This step is repeated until the computed range cannot be divided while respecting the value of .

Privacy Analysis: From a privacy perspective the protocol has the following limitations:

  • First, the mediator learns precise information about , i.e., it receives the encrypted values of fixed data points in (i.e., , , …etc.). If the mediator knows the domain , then it can map the encrypted values to their original values (since it knows their positions in ), or at least establish lower and upper bounds about each encrypted value.

  • Second, it may leak additional knowledge to services by violating the -Protection requirement. For example, if the selectivity of the range in the previous example were 1 (instead of 2), then the mediator would return back to the range ]-. However, the service would still know that the mediator is interested in the range , as it has tested its selectivity. One would think that could be returned instead of the selectivity to avoid such a problem (by making sure that 2 before splitting it), however the problem may persist as it could happen that some values in the new range are not assigned to any individual, e.g., even though , the values in that range, except for , may not be assigned to patients (i.e., they are still virgin values), then will infer that is .

In addition, the protocol is not optimal. For instance, many of the rounds in Figure 4 (which involve important communication cost) did not reduce the selectivity, i.e., the first two ranges have the same selectivity, and the same applies to the fourth, the fifth, the sixth and the seventh range.

Fig. 5: Generalizing the value = before invoking .

3.3.2 Dataset-based identifier generalization

A second naive approach to generalize is to use the ordered dataset accessed by the service to be invoked, denoted by . In this approach, the mediator queries the selectivity of with respect to a wide range of identifier values (we used the range ]-, +[ to denote the range covering ) along with a value occurring in the middle of . Then, if the returned selectivity is greater than , the mediator compares to to determine the half of covering . The last step is repeated with the obtained new interval until there is no interval with a selectivity greater than . Then, is invoked. After the invocation, the mediator retains only the outputs related to , i.e., the false-positives are removed by the mediator, as it knows the original encrypted input value.

Example: We continue with our running example to show how we ensure the k-protection requirement on the edge . Assume that and require a protection factor k = 3. The invocation of returns the tuples corresponding to , and where these values are the encrypted values of , and . Instead of invoking directly with the tuple , the mediator generalizes as follows (refer to Figure 5). The mediator requests the selectivity of with a range covering all its possibly managed values (i.e., = ]-, +[); acknowledges it has 13 distinct values and that the value ( = ) occurs in the middle of these ordered values. The mediator compares to , and determines the new interval = ]-, ]. It then requests the selectivity of the new along with the new ; the new values of and are 7 and . It determines again the new interval by comparing to . The new interval is = ]-, ] and its selectivity is 4. The algorithm stops here as if the new interval was divided then Se will be less than k.

Fig. 6: (a) a dataset partitioned into buckets; (b) data elements ordered by their time-stamps; (c) the computation of candidate ranges inside buckets.

Privacy Analysis: This protocol does not suffer from the limitations discussed above. Specifically, it does not release the encrypted values of data points with known positions in . It does not also test the selectivities of ranges that may contain less than values, as the protocol verifies that the selectivity of a range is 2 before splitting it in two equal ranges. xThe major limitation of this approach is that the boundaries of the computed range depend on the dataset currently held by (i.e., depend on ), and may change if new tuples were inserted in , or if some existing tuples were deleted, as we show in the following example.

Example: Let us assume that = {, , , }, = 2, = and the computed is [, ]. If the new tuples and were inserted, i.e., = {, , , , , }, then would become = [, ], and the provider of would be able to infer that is , as [, ] [, ] = {}.

3.3.3 Hybrid protocol for identifier generalization

Based on the limitations discussed above, a good data generalization scheme should satisfy the following criteria. First, it should guarantee that the generalized value (i.e., the computed range ) should always remain the same every time the composition plan is executed. In other words, the generalization scheme should be deterministic. Moreover, if the dataset held by a service is changed (because of data insertions or deletions), then the newly computed range should be a subset or superset of previously computed ranges for the same value and with, at least, intersecting values. Second, it should not leak additional information to the mediator that could help the latter to map the encrypted values to their real ones.

Before describing our proposed scheme for the generalization of encrypted data that avoids the discussed limitations, we first discuss two requirements that data services should satisfy to participate in the scheme. First, services must timestamp their accessed data. This requirement can be simply implemented by timestamping new data insertions. That is, when joining the data integration system, if the dataset accessed by a service is not time-stamped, the service can timestamp it with the current time, then timestamp new data insertions when they occur. Second, the dataset accessed by a data service , denoted as , should keep track of the values of identifier attributes. That is, when a tuple inside the dataset must be deleted, the dataset should keep the value of the identifier attribute and its timestamp. These two requirements are realistic and can be easily implemented, as have been discussed in previous works in data integration such as [16].

We now present our data generalization scheme which satisfies the criteria discussed above by combining the two previously discussed data generalization approaches while avoiding their limitations. Our scheme proceeds along the following steps.

  1. The first step is carried out offline when data services join the data integration system. Every partitions the domain to buckets. Services are free to choose the partitioning criteria and the number . For example, a service may choose to divide into buckets with 50 stored values in each bucket, while another one may choose to divide into buckets with equal absolute length.

  2. The mediator executes the dataset based protocol described above and narrows down the computed range as long as its selectivity is , where is an integer value 1 selected by the mediator. As we will see later in our discussion, guarantees that there is at least candidate ranges satisfying the -protection requirement in the computed so far. We will explain the effect of later in our discussion.

  3. The service determines the bucket (or the set of buckets) that covers and divides them into ranges that respect the -Protection requirement. Data elements in each bucket are organized into candidate ranges as follows: () they are ordered based on their time-stamps, i.e., data elements with the same time-stamp have the same order (which is the case for the initial set of data elements when has joined the system). () An initial set of candidate ranges are formed using the data elements with the highest order. () Subsequent data points are inserted one after another in the computed ranges and when the selectivity of a candidate range becomes 2 it gets split into two candidate ranges (that could also evolve independently).

  4. The computed ranges that intersect with the initial range (that is computed in the first step) are sent to the mediator, which can now select the candidate range covering and use it to invoke .

Illustrative Example: Assume that the ordered dataset accessed by a data service is shown in Figure 6 (part-a). For simplicity, the figure shows the content of only one bucket (i.e., ). Each tuple in that dataset is time-stamped (e.g., the tuple has the time-stamp ). Assume also that = , = 2, and = 5. Assume also that after running the dataset based protocol (in the second step) the computed range that satisfies the condition is =[, ].

In the third step, the mediator asks to compute the candidate ranges that cover . To this end, finds the buckets that cover and computes their candidate ranges. As can be noticed from the dataset in Fig. 6 (part-a), .

Figure 6 (part-b and part-c) shows how the ranges are computed. First, the service sorts the tuples inside based on their timestamps (part-b). Then, it starts to consider the tuples in one by one and divide into ranges with at least values. The tuples , , and are considered first (they have the same time-stamp ), which results into two initial ranges, i.e., [, ] and [+, ], where 0. Then, the tuples and are considered (as they have the same time-stamp ). The selectivity of becomes 4 (i.e., 2), and therefore it is split into two sub ranges [, ] and [+, ]. Similarly, when all the remaining tuples are considered, we have as a result 9 ranges (shown in Fig 6 (part-c).

The range =[, ] computed in the second step intersects with seven ranges, i.e., , , , , , and . Since the range intersects partially with , it is merged with . The same applies to , therefore it is merged with . The obtained five candidate ranges are then sent back to the mediator, which will be able to select the final range with which the service will be invoked, i.e., [+, ] ( ).

Our proposed protocol is implemented by two algorithms, one is implemented by the mediator for computing the initial range R and carrying out the service invocation (Algorithm 1) and one is implemented by the services for computing the candidate ranges (Algorithm 2), both are self-described.

Note that the ranges computed inside each bucket are deterministic, i.e., the protocol results always in the same ranges whenever the composition plan is executed, as they are computed using the data insertion order in the dataset (which is deterministic). The addition of new data elements to the dataset does not violate the -Protection requirement, as it only results in splitting a range into a set of ranges that intersect with it by at least elements. For example, when = , if the plan was executed at an instant where then the service will be invoked with range , and when it is executed at another instant where , then the computed range will be . Since and intersect in which has at least values, i.e. the -Protection requirement still holds.

3.3.4 The effect of on privacy guarantee

We intuitively explain the effect of through an example. Consider the following dataset {, , , , , , } and assume that = and is 2.

When = 1, then the range that is computed in the second step (of our protocol) could take two different values [, ] and [,

] (because of data changes) with equivalent probabilities

= = . When the same query is replayed, the probability of one of these two ranges happening if the other range has already happened is = = .

When = 2, then the range that is computed in the second step could take four different values [, ], [, ], [, ] and [, ] with equivalent probabilities = = . When the same query is replayed, the probability of [, ] happening when [, ] has already happened (these two ranges intersect in ) is = = . Similarly, when = 5, the probability of such privacy breach becomes = , and when = 10 this probability becomes . In conclusion, practical values of (e.g., 5) cut down the probability of privacy breaches to an accepted threshold. The value of can be computed as follows = , where is the threshold of accepted probability.

4 Evaluation

Fig. 7: The Evaluation Architecture

In this section, we present an evaluation study of our proposed approach, report on its performance and privacy preserving strength, and discuss some of the key applications where it can be used.

4.1 Evaluation Setup

To evaluate our approach, we used a real dataset that was provided to us by the European project PAIRSE [9]. The dataset was created by merging data from seven real databases of three French hospitals (specialized in psychiatry and cardiology). The dataset contained one big table with roughly 1,150,000 records (for approximately 850,000 patients). The table has the schema (, , , , , ). The values in the provided table were replaced by synthetic numeric values by the original healthcare facilities. The same patient may have different rows in the table corresponding to the different diseases for which he or she has been treated.

We constructed three tables out of : , and . contained only heart patients (with a total number of 510,000 patients). contained the patients that have been treated for a mental illness (with a total number of 403,000 patients). For each of the tables , and , we constructed eight datasets of various sizes by randomly selecting patients of the tables. The constructed datasets have the sizes: 50K, 100K, 150K, 200K, 250K, 300K, 350K, 400K. For each dataset, we developed a data service to access the dataset. These 24 services have the following three signatures: , , , and were deployed on independent servers (with independent resources).

In our experiments we used the following composition: ; is invoked with a given city name, then for each obtained patient is invoked to verify if the patient has been treated for some mental illness, then is invoked with only those who have been treated for a mental illness to retrieve their age and sex.

We implemented Algorithm 1 in Java and integrated it into the data integration system in [9]. Figure 7 shows the modified system, which consists of two main modules: service composition engine and the composition execution engine. The first module rewrites end-users queries into compositions of services. The second module executes the compositions. We modified, as the figure shows, the composition execution engine to accommodate Algorithm 1. All services are deployed on GlasFish 3.0 servers and datasets are stored in MySQL servers. Algorithm 2 is implemented in Java on the server side and is accessed as a simple operation of each developed data service (i.e., each service provides an operation for querying its underlying dataset and a set of operations for participating in the proposed protocol, e.g., operations for computing the candidate ranges, computing the selectivity of a range, etc.).

We conducted our experiments on machines with 3.2GHz Intel Processor running Windows 7 with 8 GB RAM.

Fig. 8: (a) The performance of the proposed approach as the size of datasets and the value of increase; (b) The effect of the bucket size on performance; (c) The performance before and after applying the optimizations.

4.2 Performance Evaluation

Assuming is the number of services in a composition, is the average number a service is invoked in a composition and is the average size of the datasets accessed by the services, the complexity of evaluating a query plan is of the order           ( ).

We conducted a set of experiments to evaluate the performance of our privacy-preserving query evaluation approach using real databases. Specifically, we evaluated the effects of () the protection factor = and () the size of accessed datasets on the composition execution time.

We measured the composition execution time of our eight compositions; all compositions are identical, except that all services in each composition are accessing datasets of one of the aforementioned sizes. We set the bucket size to 10,000 records (i.e., = 10,000) for all services. We chose three high values for the protection factor : = 25 ( = 5, = 5), = 50 ( = 10, = 5) and = 100 ( = 20, = 5). The probabilities of a privacy breach for these values are: = = 0.0016, = = 0.0004, = = 0.0001.

The composition execution time was computed as follows. contained patients from 47 cities (with an average of 10,850 patients per city). Therefore, each composition was executed 100 times with each one of these 47 cities as input. Then, we computed the average execution time for each composition (i.e., for each one of the considered dataset sizes). Figure 8 (a) shows the obtained result together with the time required to execute the compositions without enforcing the -protection requirement.

The obtained result shows that, for all the considered dataset sizes, the overhead involved in enforcing the -protection requirement does not exceed two orders of magnitude of the time required to execute the composition without a protection. We view this as reasonable compared to the cost incurred by private information retrieval and secure multi-party computation protocols which exceeds by hundred times the cost of the original query without a protection. The reader is referred to [22] for discussion about the practicality of these approaches in real life applications.

The result shows also that increasing the values of the protection factor = only introduces a minimal additional overhead. This means that services can choose large values for (thus providing better privacy protection to their outputs) and the mediator can choose large values for (thus a very small probability of a privacy breach) without degrading the overall performance.

Figure 8 (b) shows the effect of changing the bucket size on the overall query evaluation time. Obviously, reducing the bucket size will reduce also the time required to compute the candidate ranges, and thus improve the overall performance. Please note that services are not required to use a large bucket size. It can be defined based on the accepted probability of a privacy breach. For example, if the bucket size was 2000, the probability of a privacy breach would be of the order = 0.00000025, which is a practical value. Additionally, the cost incurred in computing the candidate ranges inside a bucket can be offset altogether by computing these ranges offline, as we show in the next subsection.

4.3 Discussion

As mentioned earlier, the value of is selected by each individual service to limit the inference capability of other services in the composition/query about the data held by the service. The parameter is an integer value ( 1) selected by the mediator to reduce the probability of a privacy breach (due to replay attacks) to a specific threshold. The higher the factor , the better the privacy protection. However, as the results show, the composition execution time increases also by increasing that factor.

The values of and can be tuned to strike a balance between privacy protection and performance for the considered application domain. For example, the experiments show that if = 10 and =5 ( = 50), the probability of inferring that a service holds a specific tuple is = = 0.0004 (i.e., the same query needs to be replayed at least 2500 times for that privacy breach to happen), and the composition execution time remains less than two times the time needed without any protection. In practice, the values of and are selected such that the factor is always higher than the number of times the same query can be executed by the same end-user (the system can limit the number a same query can be executed by the same end-user within a specific time window).

4.4 Performance Optimization

The performance of our approach can be improved further by considering the following optimizations:

Optimization 1: Reuse of pre-computed selectivities and ranges. At the composition execution time, the same service is likely to be invoked multiple times for different input values. The ranges and the selectivities that were computed in previous invocations (within the same composition execution occurrence) can be reused, even partially, instead of re-computing them each time the same service is invoked. Table 1 shows, for example, the numbers of patients returned by when invoked with the cities “Lyon 2”, “Lyon 5” and “Lyon 8”, as well as the average number of the computed selectivities when is invoked. Without this optimization, the number of computed selectivities would be (dataset size) = 18.6. The table shows also the numbers of reused candidate ranges.

City # Patients (by ) # Computed selectivities # Reused ranges
Lyon 2 13,738 7.3 257
Lyon 5 11,923 8.2 306
Lyon 8 9,736 8.7 119
TABLE I: Experimental Results

Optimization 2: Use of user preferences. In real-life scenarios, many data subjects (e.g., patients, citizens, etc.) may accept to release some of their private data to some recipients for some legitimate data uses (e.g., conducting medical research, law enforcement, etc.) without any protection. For example, the dataset was provided to us along with a patient preferences table specifying some of the entities each patient accepts to disclose his data; nearly 19% of the patients accepted to share their medical data (including the ) with medical institutions for improving the healthcare system. In our second optimization, we exploit the preferences of data subjects to lift the privacy protection when a data subject consents to data disclosure. For instance, in our running query the -protection requirement could be lifted when and are invoked with input values originating from those patients who accepted to disclose their data to medical institutions (such as those providing and ).

Please note that, one would think that invoking a service with a precise input value (of an individual who accepted to release her identifier) could invalidate the -protection for another input value (of another individual who did not accept to release her identifier) if happens to be one of the possible values that match with the computed generalized value (as the service would be able to eliminate one value of the possible values matching with ). However, this will not happen in the first place, as the output tuples corresponding to will be retrieved when the service is invoked with , and thus there is no need to invoke it again. That is, lifting the k-protection requirement for those patients who agreed to release their identifiers has no impact on the privacy protection provided for those who did not accept, as long as the services are invoked with input values of patients who did not agree to release their data before the ones who did.

Optimization 3: Off-line computation of k-protection ranges. Services incur an important computation cost when they divide their buckets (that are relevant to an invocation) into ranges respecting the k-protection requirement (i.e., Step 3 of the proposed hybrid protocol). However, this step can be carried out offline in an incremental fashion, as they insert new tuples.

We conducted a set of experiments to evaluate the performance improvement that could result from the optimizations presented above. Figure 8 (c), shows the obtained results when () none, () only one and () all of the optimizations is/are applied. The result shows that applying those optimizations substantially improves the performance. In fact, the query execution becomes less than two orders of magnitude of the query execution time when no protection is applied (regardless of the considered dataset size).

4.5 Privacy Analysis

Our protocol is immune against replay attacks. That is, even if the same composition was executed several times with the same inputs, data service providers cannot increase their confidence sufficiently to infer with certainty the identity of a data subject with which they are queried. For example, as explained when we analyzed the effect of on the privacy guarantee, if was set to 20, and was set to 5, then the probability of a privacy breach happening is = = ; and when = 200 and = 5, that probability becomes 1000000. That is, the same composition needs to be executed million times in order to have two candidate ranges (R=) whose intersection is the identifier of a targeted data subject.

Some would argue that the order preserving encryption scheme (OPES) of Agrawal et al. [2] is insecure and can be broken by the mediator. However, for us, that scheme is only a means, not an end. We can use any of the recent OPESs reviewed in [10] or even a partial implementation of the Homomorphic encryption [19]. Our protocol requires only to allow the mediator to carry out equality and comparison operations (i.e., =, , ) which could be realized with any of the aforementioned schemes.

One limitation of the approach is that when the domain of the identifiers is small, if the mediator knows all of the values in , then it can, after a certain number of queries (i.e., compositions), build a mapping table between the encrypted values and the real ones. However, this can be overcome by encrypting the identifiers with a different key every time a query is executed. Moreover, the proposed generalization protocol itself can be applied on both real and encrypted values. In some application domains, the mediator would be trusted, and service providers may not need to encrypt the identifiers when they are released to the mediator, whereas the mediator would still need to generalize the identifiers (with our protocol) when it queries individual services.

5 Related Work

Our solution for privacy preservation relates closely to research works in the areas of mediator based data integration, privacy-preserving data publishing and secure multi-party computation. In this section, we review some of the most prominent works in those areas and compare them to our approach.

Several research works addressed the privacy issue in the context of data service composition. For example, Yau et al. in [36] proposed a repository to answer end-user queries by integrating data across autonomous data sharing services. The privacy in that work is addressed by () allowing the repository to collect only the data necessary to compute the result of a query (instead of retrieving the whole datasets behind the services), and by () hashing the identifiers (used to link data subjects) between the services of each couple of interconnected services in the composition graph. Unfortunately, this solution does not resolve the privacy breaches addressed in our work. In fact, the repository still has access to intermediate and integrated data, as the hashing is carried out by the repository itself. In addition, services can learn information about the data held by each other, as they are invoked with precise data values. Benslimane et al. in [9] proposed a privacy-preserving access control model to preserve the data privacy against data consumers. Queries are rewritten, by a mediator, to include applicable privacy constraints before they are resolved by service composition. Ammar and Bertino in [3] proposed to take into account the context of data consumers when the privacy policies are applied on their queries before they are resolved by service composition. Unfortunately, these solutions, while providing good privacy protection against data consumers, do not provide any privacy protection against un-trusted mediators and data services.

Our data generalization protocol for service invocation with imprecise values (e.g. ranges) is reminiscent of the works on the Private Information Retrieval (PIR) problem [35, 32]. The objective of these works is to execute private queries on a remote server without letting the server to learn anything about executed queries or their results. While current PIR protocols could provide strong privacy guarantees against untrusted services, they involve prohibitive computation costs (mostly due to their cryptographic nature) that make them impractical [32] for real applications. In contrast, our model for service invocation takes a practical stance on the performance/information leakage trade-off; i.e. services are allowed to learn a controlled amount of knowledge in return for an important performance improvement.

Several research works have addressed the problem of distributed data integration [18, 21]. Fung and Mohammed [18] proposed a data mashup system that can anonymize data from several data sources to provide data consumers with datasets that satisfy the k-anonymity property. While that work was geared towards data consumers, our focus is on data providers and the integration system itself (i.e. the mashup server). We assumed in our work that data providers can freely apply their privacy policies (e.g. data anonymization techniques) on their sides. However, our solution can be extended with that of [18] to make the anonymization on the integration system side. Jurczyk and Xiong [21] proposed an algorithm to securely integrate horizontally partitioned data from multiple data sources. However, that work does not address the vertically partitioned data which is closer to our work.

6 Conclusion and future work

In this paper, we proposed a privacy-preserving approach to evaluate data integration queries over autonomous data services. Our approach allows services involved in a query to apply their privacy policies locally. The data integration system (i.e. the mediator) is given encrypted information only to allow it to link data subjects across the different services which, in turn, cannot learn information about the data held by each other. We evaluated our approach in the healthcare application domain, and the results showed that our solution can be applied to cost effectively integrate voluminous datasets. We intend to extend our approach to improve the performance further. An interesting direction to explore is the possibility of invoking services with chunks of input data tuples [33]. The data value generalization algorithm should then be extended then to generalize chunks instead of single data items.

References

  • [1] D. Abadi, R. Agrawal, A. Ailamaki, M. J. Carey, S. Chaudhuri, J. Dean, A. Doan, J. Gehrke, L. M. Haas, A. Y. Halevy, H. V. Jagadish, D. Kossmann, S. Madden, S. Mehrotra, T. Milo, V. Markl, C. Olston, B. C. Ooi, C. Ré, D. Suciu, M. Stonebraker, T. Walter, and J. Widom (2016) The beckman report on database research. Commun. ACM 59 (2), pp. 92–99. Cited by: §1.3, §1.
  • [2] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu (2004) Order-preserving encryption for numeric data. In SIGMOD, pp. 63–74. Cited by: §3.2, §4.5.
  • [3] N. Ammar and E. Bertino (2014) Dynamic privacy policy management in services-based interactions. In DEXA 2014, pp. 248–262. Cited by: §5.
  • [4] M. Barhamgi, A. K. Bandara, Y. Yu, K. Belhajjame, and B. Nuseibeh (2016) Protecting privacy in the cloud: current practices, future directions. IEEE Computer 49 (2), pp. 68–72. Cited by: §1.
  • [5] M. Barhamgi, D. Benslimane, Y. Amghar, N. Cuppens-Boulahia, and F. Cuppens (2013) PrivComp: a privacy-aware data service composition system. In Joint 2013 EDBT/ICDT Conferences, EDBT ’13 Proceedings, Genoa, Italy, March 18-22, 2013, pp. 757–760. Cited by: §2.1.
  • [6] M. Barhamgi, D. Benslimane, and B. Medjahed (2010) A query rewriting approach for web service composition. IEEE Transactions on Services Computing 3 (3), pp. 206–222. Cited by: §2.1.
  • [7] M. Barhamgi, D. Benslimane, S. Oulmakhzoune, N. Cuppens-Boulahia, and F. Cuppens (2013) Secure and privacy-preserving execution model for data services. In CAiSE, pp. 35–50. Cited by: §2.2.2.
  • [8] M. Barhamgi, C. Perera, C. Ghedira, and D. Benslimane (2018) User-centric privacy engineering for the internet of things. IEEE Cloud Computing 5 (5), pp. 47–57. Cited by: §1.3.
  • [9] D. Benslimane, M. Barhamgi, F. Cuppens, F. Morvan, B. Defude, E. Nageba, F. Paulus, S. Morucci, M. Mrissa, N. Cuppens-Boulahia, C. Ghedira, R. Mokadem, S. Oulmakhzoune, and J. Fayn (2013) PAIRSE: a privacy-preserving service-oriented data integration system. SIGMOD Record 42 (3), pp. 42–47. Cited by: §1.2, §4.1, §4.1, §5.
  • [10] A. Boldyreva, N. Chenette, and A. O’Neill (2012) Order-preserving encryption revisited: improved security analysis and alternative solutions. IACR Cryptology ePrint Archive 2012, pp. 625. Cited by: §4.5.
  • [11] M. J. Carey, N. Onose, and M. Petropoulos (2012) Data services. Commun. ACM 55 (6), pp. 86–97. Cited by: footnote 1, footnote 6.
  • [12] C. D’Orazio, K. R. Choo, and L. T. Yang (2017) Data exfiltration from internet of things devices: ios devices as case studies. IEEE Internet of Things Journal 4 (2), pp. 524–535. Cited by: §1.3.
  • [13] A. Dogac (2012) Interoperability in ehealth systems. PVLDB 5 (12), pp. 2026–2027. Cited by: §1.3, footnote 1.
  • [14] M. Dong, H. Li, K. Ota, L. T. Yang, and H. Zhu (2014) Multicloud-based evacuation services for emergency management. IEEE Cloud Computing 1 (4), pp. 50–59. Cited by: §1.3.
  • [15] S. Dustdar, R. Pichler, and H. L. Truong (2012) Quality-aware service-oriented data integration: requirements, state of the art and open challenges. SIGMOD Record 41 (1), pp. 11–19. Cited by: §1, footnote 1, footnote 6.
  • [16] H. Elmeleegy, M. Ouzzani, A. K. Elmagarmid, and A. M. Abusalah (2010) Preserving privacy and fairness in peer-to-peer data integration. In ACM SIGMOD 2010, pp. 759–770. Cited by: §3.3.3.
  • [17] F. Emekçi, D. Agrawal, and A. E. Abbadi (2006) Privacy preserving query processing using third parties. In ICDE, pp. 27. Cited by: §3.1.
  • [18] B. C. M. Fung, T. Trojer, and L. Xiong (2012) Service-oriented architecture for high-dimensional private data mashup. IEEE Transactions on Services Computing 5 (3), pp. 73–86. Cited by: §1.2, §1.2, §2.2.1, §3.2, §5.
  • [19] C. Gentry, A. Sahai, and B. Waters (2013) Homomorphic encryption from learning with errors: conceptually-simpler, asymptotically-faster, attribute-based. In CRYPTO, pp. 75–92. Cited by: §4.5.
  • [20] D. Georgakopoulos, P. P. Jayaraman, M. Fazia, M. Villari, and R. Ranjan (2016) Internet of things and edge cloud computing roadmap for manufacturing. IEEE Cloud Computing 3 (4), pp. 66–73. Cited by: §1.3.
  • [21] P. Jurczyk and L. Xiong (2009) Distributed anonymization: achieving privacy for data subjects. In DBsec, pp. 11–22. Cited by: §5.
  • [22] L. Kamm (2015) Privacy-preserving statistical analysis using secure multi-party computation. Ph.D. Thesis. Cited by: §1.2, §4.2.
  • [23] A. Khoshkbarforoushha, A. Khosravian, and R. Ranjan (2017) Elasticity management of streaming data analytics flows on clouds. Journal of Computer and System Sciences 89, pp. 24–40. Cited by: §1.3.
  • [24] N. Li, T. Li, and S. Venkatasubramanian (2007) T-closeness: privacy beyond k-anonymity and l-diversity. In ICDE, pp. 106–115. Cited by: §1.2, §3.2, §3.2.
  • [25] M. Liu, X. Zhang, C. Yang, S. Pang, D. Puthal, and K. Ren (2017) Privacy-preserving detection of statically mutually exclusive roles constraints violation in interoperable role-based access control. In 2017 IEEE Trustcom/BigDataSE/ICESS, Sydney, Australia, August 1-4, 2017, pp. 502–509. Cited by: §1.3.
  • [26] A. Machanavajjhala, D. Kifer, and J. Gehrke (2007) L-diversity: privacy beyond k-anonymity. TKDD 1 (1). Cited by: §1.2, §3.2, §3.2.
  • [27] N. Mohammed, X. Jiang, R. Chen, B. C. M. Fung, and L. Ohno-Machado (2013) Privacy-preserving heterogeneous health data sharing. JAMIA 20 (3), pp. 462–469. Cited by: §1.2, §1.2, §2.2.1, §3.2.
  • [28] K. Nayak, X. S. Wang, S. Ioannidis, U. Weinsberg, N. Taft, and E. Shi (2015) GraphSC: parallel secure computation made easy. In 2015 IEEE Symposium on Security and Privacy, SP 2015, pp. 377–394. Cited by: §1.2.
  • [29] C. Perera, M. Barhamgi, S. De, T. Baarslag, M. Vecchio, and K. R. Choo (2018-12) Designing the sensing as a service ecosystem for the internet of things. IEEE Internet of Things Magazine 1 (2), pp. 18–23. External Links: Document, ISSN Cited by: §1.3.
  • [30] M. Prabhakan and A. Sahai (Eds.) (2013) Secure multi-party computation. Cryptology and Security Series, Vol. 10, IOS Press. External Links: ISBN 978-1-61499-168-7 Cited by: §1.2.
  • [31] R. Ranjan, O. F. Rana, S. Nepal, M. Yousif, P. James, Z. Wen, S. L. Barr, P. Watson, P. P. Jayaraman, D. Georgakopoulos, M. Villari, M. Fazio, S. K. Garg, R. Buyya, L. Wang, A. Y. Zomaya, and S. Dustdar (2018)

    The next grand challenges: integrating the internet of things and data science

    .
    IEEE Cloud Computing 5 (3), pp. 12–26. Cited by: §1.
  • [32] R. Sion and B. Carbunar (2007) On the practicality of private information retrieval. In NDSS, Cited by: §2.2.1, §5.
  • [33] U. Srivastava, K. Munagala, J. Widom, and R. Motwani (2006) Query optimization over web services. In VLDB, pp. 355–366. Cited by: §2.1, §2.1, §2.1, §3.1, §6.
  • [34] L. Sweeney (2002) K-anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10 (05), pp. 557–570. Cited by: §1.2, §2.2.1, §3.2, §3.2.
  • [35] S. Wang, D. Agrawal, and A. El Abbadi (2014) Towards practical private processing of database queries over public data. Distributed and Parallel Databases 32 (1), pp. 65–89. Cited by: §1.2, §5.
  • [36] S. S. Yau and Y. Yin (2008) A privacy preserving repository for data integration across data sharing services. IEEE T. Services Computing 1 (3), pp. 130–140. Cited by: §1.2, §5, footnote 6.
  • [37] C. Zhu, H. Wang, X. Liu, L. Shu, L. T. Yang, and V. C. M. Leung (2016) A novel sensory data processing framework to integrate sensor networks with mobile cloud. IEEE Systems Journal 10 (3), pp. 1125–1136. Cited by: §1.3.