Privacy-Preserving Identification of Target Patients from Outsourced Patient Data

With the increasing affordability and availability of patient data, hospitals tend to outsource their data to cloud service providers (CSPs) for the purpose of storage and analytics. However, the concern of data privacy significantly limits the data owners' choice. In this work, we propose the first solution, to the best of our knowledge, that allows a CSP to perform efficient identification of target patients (e.g., pre-processing for a genome-wide association study - GWAS) over multi-tenant encrypted phenotype data (owned by multiple hospitals or data owners). We first propose an encryption mechanism for phenotype data, where each data owner is allowed to encrypt its data with a unique secret key. Moreover, the ciphertext supports privacy-preserving search and, consequently, enables the selection of the target group of patients (e.g., case and control groups). In addition, we provide a per-query based authorization mechanism for a client to access and operate on the data stored at the CSP. Based on the identified patients, the proposed scheme can either (i) directly conduct GWAS (i.e., computation of statistics about genomic variants) at the CSP or (ii) provide the identified groups to the client to directly query the corresponding data owners and conduct GWAS using existing distributed solutions. We implement the proposed scheme and run experiments over a real-life genomic dataset to show its effectiveness. The result shows that the proposed solution is capable to efficiently identify the case/control groups in a privacy-preserving way.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/04/2019

Privacy-Preserving Search for a Similar Genomic Makeup in the Cloud

In this paper, we attempt to provide a privacy-preserving and efficient ...
03/29/2020

Tracking and Controlling the Spread of a Virus in a Privacy-Preserving Way

Today, tracking and controlling the spread of a virus is a crucial need ...
08/28/2020

Data-driven control on encrypted data

We provide an efficient and private solution to the problem of encryptio...
02/01/2020

A Quantum-based Database Query Scheme for Privacy Preservation in Cloud Environment

Cloud computing is a powerful and popular information technology paradig...
05/16/2021

Private Facial Diagnosis as an Edge Service for Parkinson's DBS Treatment Valuation

Facial phenotyping has recently been successfully exploited for medical ...
06/07/2018

Privacy-Preserving Identification via Layered Sparse Code Design: Distributed Servers and Multiple Access Authorization

We propose a new computationally efficient privacy-preserving identifica...
12/08/2020

On Aadhaar Identity Management System

A unique identification for citizens can lead to effective governance to...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With the advent of precision medicine, healthcare providers are empowered with the ability to select treatments based on a genetic understanding of the patient’s disease. For instance, for cancer patients, a potential treatment may be a combination of surgery, chemotherapy, radiation, and immunotherapy, depending on the type of cancer and its stage. Precision medicine can help decide on specific personalized treatment plans with certain drugs proving more effective treatment for specific genetic profiles.

However, to pave the way to such precision medicine, research over large volumes of patient data is required. This leads to the popularity of large-scale patient data sharing, such as Patient-Centered Clinical Research Network (PCORNet) (Selby et al., 2012) in the US, TranSMART (Athey et al., 2013) in the EU, and the Global Alliance for Genomics and Health (GA4GH) (for Genomics and Health, 2016). These data-sharing systems are required to comply with the increasingly stringent privacy regulations (e.g., HIPAA (of Health and Services, 2021) or GDPR (Parlament, 2021)).

Both centralized and decentralized approaches have been explored in this context. In the centralized schemes (Kim and Lauter, 2015; Sadat et al., 2017), all data owners encrypt their data and outsource them to a common repository, where collective computation becomes possible. In the decentralized schemes (Froelicher et al., 2021; Raisaro et al., 2018), all data owners store their data locally and run a distributed protocol to conduct the computation over all databases. Although both of these approaches provide the ability to conduct computation over combined data, both approaches assume that the client (who sends a query for the computation) already knows which database owners to query for the analysis. However, in real-life, there is a pre-computation step to identify which databases are utilized in the query processing. For instance, if a physician wants to identify similar patients to a given one based on the symptoms and then conduct some statistical tests on the identified patients, they should first contact all database owners to identify the useful databases. Similarly, to conduct a genome-wide association study (GWAS), a client first needs to identify which database owners have genomes belonging to the desired case and control group specifications.

In this work we aim to fill this gap. In general, we propose to develop efficient, privacy-preserving algorithms that can conduct this pre-computation (i.e., identification of target patients) at a centralized setting, at a CSP, over data from multiple database owners while also providing inherent access control. After the pre-computation step, we support either (i) computation of statistics over the identified records at the CSP (e.g., for simple statistical operations, such as calculation of minor allele frequencies and chi-square values) or (ii) utilization of other distributed solutions (e.g., (Froelicher et al., 2021; Raisaro et al., 2018)) between the data owners of the identified records (for more complex operations).

In the following, we discuss the main challenges of conducting large-scale privacy-preserving identification of target patients over multi-tenant patient data at the CSP.

Encryption of data using unique cryptographic keys. To protect data from a potentially curious CSP, each data owner (e.g., hospital) should encrypt its dataset before outsourcing. Moreover, each hospital should use its unique secret key for encryption. By doing so, hospitals can maintain data privacy even if some of the hospitals are corrupted. On the other hand, when each hospital encrypts its dataset with its own unique key, computation over a combined dataset becomes challenging.

Efficient and privacy-preserving search over encrypted data across hospitals. All the patient data should be encrypted against a potentially curious CSP. However, the ciphertext should support secure search operation to facilitate selection of target patients (e.g., case and control groups for the association study) in a privacy-preserving and efficient way. Moreover, to avoid issuing a separate query for each hospital’s dataset, searchability of the encrypted data should be supported across hospitals.

Fine-grained access control of data. Since the data is stored at the CSP, the data owner loses the direct control of the data. In order to guarantee that the data is accessed properly, the data owner needs to enforce an access policy over the outsourced data.

In this work, we propose a privacy-preserving framework for the identification of target patients over outsourced patient data. To the best of our knowledge, this is the first framework that tackles all the aforementioned challenges. In the proposed scheme, each hospital encrypts its own dataset with its unique key and outsources the storage and processing of the search operation over encrypted data. We mainly focus on identifying the target patients based on their encrypted phenotype data (which is typically the case in GWAS). For privacy of phenotype data, we encrypt them with a novel pairing-based encryption algorithm. The proposed encryption algorithm also provides the ability to efficiently search over encrypted phenotype data from several hospitals. This enables efficient identification of target group of patients (e.g., case and control groups) across hospitals. Furthermore, to let the phenotype data be properly accessible by legitimate/authorized clients, we introduce a fine-grained authorization scheme. For a single authorization request, each phenotype is considered as a unit of authorization. We implement and evaluate the proposed scheme using a real-life genomic dataset. Our result shows that the proposed scheme requires less than 2 seconds to identify the case and control groups (each of size 50) over a dataset with 1052 patients.

The rest of the paper is organized as follows. In the next section, we discuss the state of the art. In Section 3, we introduce the primitives we used in this paper. In Section 4, we present the models, including system model, data model, query model, and security model. Then, in Section 5 we describe the proposed scheme in detail. In Section 6, we provide the privacy analysis of the proposed solution. In Section 7, we evaluate the performance. Finally, we conclude the paper in Section 9.

2. Related Work

With the increasing volume of patient data, there are two lines of research work to process the data. The first approach is the centralized solution, which requires large amount of data to be stored in a single repository. In order to match the requirement of storage and computation power, the CSP is the most popular central repository. However, concerns about data privacy in cloud storage arise due to malicious attacks from inside and outside of the CSPs (Goud, 2020). To resolve the concern, Kim et al. (Kim and Lauter, 2015) applied the BGV scheme (Gentry et al., 2012) and YASHE scheme (Bos et al., 2013) to encrypt the patient data and conduct secure evaluation of distribution over the encrypted data. Sadat et al. (Sadat et al., 2017) proposed a hybrid system called SAFETY, which combines Paillier encryption scheme and Intel Software Guard Extensions (Intel SGX) to improve the efficiency of the chi-square test. The second approach is the decentralized solution, where each participant stores its data locally. In order to inter-operate the patient data over multiple data providers, Kamm et al. (Kamm et al., 2013) proposed to secretly share the sensitive data among several parties and compute GWAS over the distributed data. Similarly, Bogdandov et al. (Bogdanov et al., 2014) adopted secret-sharing based techniques to implement a privacy-preserving framework for statistical analysis on federated datasets. Raisaro et al (Raisaro et al., 2018) combines homomorphic encryption and obfuscation techniques to achieve privacy-preserving medical data sharing among many clinical sites. Froelicher et al. (Froelicher et al., 2021) proposed a multiparty homomorphic encryption algorithm. Based on the algorithm, each party can store the data locally and be able to run analysis algorithms over all the participants’ data without privacy violation.

In the previous work (Zhu et al., 2019), we proposed a scheme to find similar patients based on genomic makeup. In that scheme, with an assumption that all the data owners share a secret key, all the data owners build an index for their data using the secret key to support similar patient search. In this paper, we attempt to remove not only the assumption, but also the index.

Contribution. Compared with the existing work, our main contributions are as follows:

  1. Our scheme supports multiple hospitals to outsource their data to the CSP and each of them encrypts its data with its own key while computation can be properly conducted over all the data.

  2. Previous work assumes that the case and control groups for the association study are already known. This cannot meet the requirement of dynamic selection of case/control groups. In our paper, we propose a scheme that supports to dynamically select the case and control groups, which can be easily integrated to the existing solutions of association study.

  3. The authorization mechanism in the proposed scheme is designed based on per query request and it supports fine-grained authorization.

3. Background

In this section, we first briefly introduce the background of genomics. Then, we introduce symmetric bilinear groups applied in the proposed pairing-based encryption algorithm.

3.1. Genomics Background

The most common mutation in human population is called single nucleotide polymorphism (SNP). It is the variation in a single nucleotide at a particular position of the genome (Risch, 2000). There are about 5 million SNPs observed per individual and sensitive information about individuals (such as disease predispositions) are typically inferred by analyzing the SNPs. Two kinds of nucleotides (or alleles) are observed for each SNP: (i) major allele is the one that is observed with a high frequency and (ii) minor allele is the one that is observed with low frequency. The frequency of the minor allele in a given population is denoted as the minor allele frequency (MAF). Each SNP includes two nucleotides, one inherited from the father and the other one from mother. For simplicity, we represent the value of a SNP as the number of its minor alleles, and hence . A SNP is represented by an (ID, value) pair, where the ID is taken from a large standardized set of strings and the value is in . In the following sections, if we mention a SNP (or SNPs) without mentioning the ID or value, we mean both parts.

3.2. Symmetric Bilinear Groups

Let be an additive group of prime order and be the generators of and is a random element from . Let be a function which maps two elements from to an element in the target group having prime order . The tuple is an symmetric bilinear group if following properties hold:
(a) the group operations in , are efficiently computable.
(b) the mapping from to is efficiently computable.
(c) the mapping is non-degenerate: .
(d) the mapping is bilinear: for all , . Based on the symmetric bilinear group, we design an encryption algorithm to encrypt phenotypes of patients.

4. Models

In this section, we describe the system, data, query, and security models for our proposed scheme. We present the frequently used notation in Table 1.

the security parameter of the proposed scheme
the number of hospitals
the number of phenotype attributes
for all the hospitals
the number of SNP identities for all the hospitals
the number of patients in hospital
the size of case and control groups
the global ordered set of phenotype attributes,
P.name the attribute of a phenotype, P.name
P.val the value of a phenotype, P.val {, }
SNP.ID the identity of a SNP
SNP.val the value of a SNP
the pseudonym of patient in hospital .

the vector of SNP values for patient

in hospital
set of all patient records in hospital
a record of phenotype of patient in hospital ,
including the pseudonym and all phenotypes
of patient
a record of SNP data of patient in hospital ,
including pseudonym and set of pairs of
SNP.id and SNP.val
the encrypted SNP dataset of hospital
the encrypted phenotype dataset of hospital
the symmetric key of hospital ,
applied to encrypt/decrypt phenotype data
the private key of hospital , only revealed to a
legitimate client
the master key of hospital
the query generated by client c
the transform key that hospital generates for
authorizing client c to access the phenotype
a hash function
a bilinear mapping
Table 1. Frequently used notation

4.1. System Model

As shown in Figure 1, the system consists of three types of entities: clients, hospitals, and a cloud service provider (CSP). The hospital collects biological samples from patients and sequences them with patients’ consent. In parallel, the hospital records various phenotypes of patients (e.g., height, eye color, and blood type). Instead of storing all genotype and phenotype data locally, the hospital encrypts the data and outsources them to the CSP. The client (e.g., a medical researcher) queries the CSP for different association studies on datasets of several hospitals. Before sending a query to the CSP, the client needs authorization from the involved hospital(s). If the client gets such an authorization, the corresponding computation is allowed to be conducted over all hospitals’ datasets. Upon receiving a query from a client, the CSP first constructs the identification of target group of patients (e.g., case and control groups) by running the query over the encrypted phenotype data, and then executes the other algorithms (e.g., GWAS) over the encrypted genotype data of individuals in the identified groups (e.g., case and control groups).

Figure 1. Proposed system model. Hospitals encrypt their data and outsource them to the cloud service provider (CSP). A client sends an authorization request to a hospital for accessing its data. If the request is approved, the client gets authorization and is able to send request to the CSP to access the hospital’s data.

4.2. Data Model

In the proposed scheme, we make an assumptions to make sure that the data model is uniform across hospitals. We assume there exists a common set of terms applied to describe all the phenotypes across hospitals (referred as “phenotype attributes”). That is, all the hospitals use the same terms to describe the same phenotypes.

For efficiency, we represent the value of a phenotype attribute in a binary format (as or ). means that the patient matches the phenotype attribute while denotes the opposite. For instance, Table 2 illustrates a partial taxonomy of phenotypes that includes different height ranges in centimeters, as well as presence of breast cancer.

The value of each SNP is set to represent its number of minor alleles (, , or ) (see the details in Section 3.1). Based on these settings, the phenotype and SNP dataset for hospital are represented as in Tables 2 and 3, respectively. In the following sections, when we mention phenotype, it means both phenotype attribute and value, and when we mention SNP, it means both SNP identity and value.

PseudonymP.name height breast cancer
[100,120)
0 1 0 0
0 0 0 1
Table 2. Phenotype dataset of hospital in terms of height and breast cancer. A patient record includes a list of phenotype attribute values and each of them is either or .
PseudonymSNP identity
1 0
2 1
Table 3. SNP dataset of hospital . A patient record have a list of SNP values and each SNP value is either , or .

4.3. Query Model

The query mechanism is designed to find target groups of patients for specified phenotypes. In the proposed scheme, the query includes two parameters: (i) a list of phenotypes of interest and (ii) a parameter to set the size of matching groups (e.g., case and control groups). As an instance, we use case and groups to illustrate the query mechanism. For simplicity, we assume the size of case and control groups to be equal, but it can be customized to support different sizes.

Prior to outsourcing the data to the CSP, a hospital first encrypts its phenotype dataset by using the proposed pairing-based encryption algorithm that supports efficient search over encrypted phenotypes (as discussed in Section 5). A query is generated by the client with input phenotype attributes and sent to the CSP. The proposed scheme allows a client to form the case and control groups based on multiple phenotype criterion (e.g., an association study for a particular type of cancer can be conducted only on males within a specific age range). The CSP, using the properties of the proposed pairing-based encryption, can check whether an encrypted patient’s record (in a hospital’s dataset) contains the phenotype attributes inside the query. If all the phenotype attributes in the query are included in a patient’s record, the record is added into case group. In contrast, if a patient’s record does not include any of the phenotype attributes in the query, the patient’s record is added into the control group. Note that the query is encrypted without disclosing any phenotype information to the CSP and the CSP cannot learn any information from the search process. Thus, the CSP identifies the individuals in the case and control groups in a privacy-preserving way. The case and control groups may possibly contain patients from multiple hospitals. Specifically, given a query from a client and transform keys (detailed in Section 5.6) from hospitals that authorize the client to access their data, the CSP can search multiple hospitals’ phenotype datasets (encrypted by using different secret keys). Thus, the search result is not limited to one hospital’s dataset.

4.4. Threat Model

Our threat model is consistent with previous work (Lu et al., 2015; Schneider and Tkachenko, 2019). Client, hospital(s), and the CSP are assumed to be semi-honest, that is, they honestly follow the protocol while trying to learn extra information during the protocol. A hospital may try to use its own data and knowledge to infer another hospital’s data either via collusion with the CSP or exploring the common patient records in different hospitals. The CSP may analyze the stored ciphertext and observe the encrypted queries. Based on this information, the CSP may try to extract sensitive information. Also, a client may try to infer patients’ data without having proper access authorization. Specifically, we consider following attacks:

Ciphertext analysis: Since all the data are outsourced to the CSP, the CSP can run different algorithms to analyze ciphertext and try to extract meaningful information.

Query analysis: Since all the queries are sent to the CSP, the CSP may analyze the received queries and their frequency. Consequently, the CSP may try to infer the query pattern and content.

Operation analysis: The CSP conducts search computation and is able to obtain all the transcripts of the operation. Based on this information, the CSP may try to infer the content of query and stored ciphertext.

Unauthorized access: Since all the hospitals’ data is outsourced to the CSP, a client may try to access a hospital’s data without authorization from that hospital.

Collusion between hospitals: To infer a target hospital’s data, several hospital may collude with each other and combine their knowledge.

Collusion between hospitals and the CSP: If some hospitals and the CSP reach consensus on common interest to learn another hospital’s (victim’s) data, all the related parties combine their knowledge and try to infer the sought information. For instance, if the search (at the CSP) includes two hospitals and if one of these hospital collude with the CSP, the CSP can learn which patients of the other (victim) hospital has the considered phenotype as a result of the search operation. However, to provide the common search functionality, this attack is unavoidable, and hence we do not consider it in this work.

We thoroughly analyze all these attacks and robustness of the proposed scheme against them in Section 6.

5. Proposed Scheme

In this section, we first give an overview of the proposed scheme, and then describe its details.

5.1. Overview

In the proposed scheme, hospitals independently encrypt and outsource their phenotype datasets to a CSP and the CSP conduct search operation over the outsourced federated data (from multiple hospitals) to identify target group of patients. To process phenotype data in an efficient and privacy-preserving way, we propose a novel encryption mechanism to encrypt the phenotype data. The proposed encryption scheme allows different hospitals to encrypt their phenotype data independently with their own secret keys. Moreover, the encryption algorithm supports identification of case and control groups efficiently, without information leakage. Furthermore, the identification process requires less communication compared with secure multi-party computation-based approaches (Schneider and Tkachenko, 2019) and less computation compared to homomorphic encryption-based approaches (Akavia et al., 2019). In general, the execution of privacy-preserving identification of target group of patients can be divided into seven phases: data preprocessing, initialization, key generation, data encryption, client authorization, query generation, identification of case and control groups. We now present a high-level overview of these seven phases while the rest of this section provides in-depth details.

Data Preprocessing. Phenotype and genotype data need to be properly encoded in the required data format so that further processing can be conducted.

Initialization. In this phase, the Setup is performed to initialize the parameters and functions.

Key generation. In this phase, the KeyGen function is executed to generate secret keys for hospitals.

Data encryption. Hospitals recursively call SinglePhenotypeEncrypt to encrypt their phenotype data.

Client authorization. Before requesting access to a hospital’s dataset (from the CSP), a client obtains an authorization from the hospital.

Query generation. To run query over outsourced data with specified phenotypes, a query is generated by recursively running query generation algorithm SinglePhenotypeQueryGen with the input of target phenotypes.

Identification of the case and control groups. Given a query, the CSP identifies whether a patient record is in case group or control group by running the Search algorithm.

5.2. Data Preprocessing

In this section, we present the procedures we use to preprocess the phenotype data, so that the processed data matches the data format requirements.

5.2.1. Preprocessing Phenotype Data

The representations of phenotype attributes should be uniform for all the hospitals, and hence we assume that all the hospitals share a common ordered set of phenotype attributes, denoted as . We also assume that the value corresponding to a phenotype attribute is binary: represents a patient having such phenotype attribute, while means the opposite. In Table 2, we illustrate the processed phenotype data for height and breast cancer. As seen in the table, in the dataset () of hospital , a patient record can be represented as , where , , and .

5.3. System Initialization

In this section, we present the procedures during the initialization so that all algorithms use the same initial parameters.

5.3.1. Initialization for Phenotype Data Encryption

With the input of the security parameter , the system first generates a symmetric bilinear mapping , where the multiplicative cyclic group is generated by generator and has the prime order (). Then, a cryptographic hash function is selected. Above procedures are detailed in Setup algorithm that is shown in Algorithm 1.

1:
2:,
3:set
4:choose
5:return ,
Algorithm 1 Setup

5.4. Key Generation

In this section, we present the procedures to generate the required keys in the system.

5.4.1. Key Generation for Phenotype Data Encryption

Hospital randomly selects a master key , and sets its private key as . In addition, hospital selects a secret key from . This is detailed in the Keygen procedure presented in Algorithm 2.

1:,
2:, ,
3:
4:
5:
6:return , ,
Algorithm 2 KeyGen

5.4.2. Key Generation for client

The system randomly selects a number and generates a key for each client. Once a client joins the system, it is assigned and .

5.5. Data Encryption

In this section, we describe the proposed phenotype data encryption algorithm, which supports efficient privacy-preserving search and update.

5.5.1. Phenotype Data Encryption

The set of phenotype attributes is denoted by , as described before. Phenotype data of a patient in hospital is represented as , (, ), , (, . For each phenotype of the patient (phenotype name-value pair, i.e., ), the hospital first selects a random number from , where is a large prime that is larger than . The pair of phenotype attribute and value is first hashed and then encrypted into a group value . Afterwards, a symmetric encryption algorithm SE (e.g., AES) is invoked with the input of a secret key and the pair of phenotype attribute and value . Finally, the algorithm outputs the ciphertext consisting of a random group element , an encoded group element , and a symmetrically encrypted ciphertext .

Algorithm 3 shows how hospital encrypts a single phenotype of patient (i.e., ). To encrypt all patients’ phenotype data in a hospital, we iteratively encrypt each patient’s phenotypes. In detail, for each patient in hospital , the hospital first reads its phenotype data , and then invokes the SinglePhenotypeEncrypt to encrypt each phenotype of the patient. Once all the phenotypes of the patient is encrypted, the result denoted as is stored in the dataset with the patient pseudonym . This process is repeated for each patient, detailed in Algorithm 4.

1:(, ), ,
2:
3:
4:
5:
6:
7:
8:return
Algorithm 3 SinglePhenotypeEncrypt
1:, ,
2:
3:initialize a dictionary
4:for all  do
5:     
6:     for all   do
7:         
8:               
9:     
10:     
11:return
Algorithm 4 PhenotypeEnc

5.6. Client Authorization

In the proposed scheme, a client needs to get authorization from a hospital before it can access to (operate on) a hospital’s data. Meanwhile, the hospital is capable to authorise the client in fine-granularity and the CSP should not learn the data that the hospital authorises the client to access. In this section, we first present a simple client authorization mechanism, which requires a client to generate a query for each hospital that authorizes the client to access its data. In addition, this mechanism allows a client to access a hospital’s data without any limitations once it is authorized. Then, to achieve flexible authorization and let the client generate a single query that can be used to operate on all authorized hospitals’ datasets, we further present an improved mechanism. The improved mechanism supports per-query based fine-grained authorization and it allows a single query to access multiple hospitals’ data.

5.6.1. Simple Authorization

In the simple authorization mechanism, if a client gets authorization from a hospital, the client can access all the data of the corresponding hospital for all future queries. To present this authorization mechanism, we assume that there exists a client with private key who wants to access (operate on) data from hospital . The client first sends an authorization request to hospital with its private key . If the hospital approves the request, the hospital signs the private key and sends it back to the client. The client recovers . Upon obtaining , the client can generate a legitimate query. The details of this authorization mechanism are also shown in Algorithm 5.

1:, ,
2:
3:Client: send to hospital
4:Hospital : compute and send to the client
5:Client: compute
Algorithm 5 SimpleAuthorization

5.6.2. Improved Authorization

In the simple authorization mechanism, a client can access hospital’s data indefinitely once the hospital authorizes the client. Also, the hospital is unable to control the granularity of the authorization. In other words, the hospital authorizes either all its data to a client or none of them. Furthermore, simple authorization mechanism results in a query that cannot be executed across multiple hospitals’ data (i.e., the client needs to generate separate queries for each hospital’s dataset).

In order to overcome these concerns, we propose an improved authorization mechanism. The new authorization mechanism supports fine-grained authorization on a per-query basis. Moreover, using the transform key function, a single query sent to the CSP can be executed across hospitals. As a result, the complexity of a client’s query generation is reduced from O() to , where is the number of accessible hospitals. The steps for the improved authorization mechanism are as follows. A client first generates an authorization request and sends it to a hospital. If the hospital approves the request, the hospital first generates a transform key for the client (per query) and sends it to the CSP, such that the CSP can apply the transform key to transform the ciphertext into the format that supports the client’s query. After the transformation, the query can be executed by the CSP over the phenotype data of each hospital that authorizes the client. The construction of the transform key needs to consider the privacy of the transform key, transformation process, and the query. With these privacy requirements in mind, we set the transform key at the granularity of phenotypes and compute it as , where denotes the target hospital, represents the associated data from a client, and represents the phenotype attribute in phenotype attribute set . The details of the transform key construction are as follows.

  1. A client randomly selects   () and sends it to a hospital with and () to request corresponding data access.

  2. Upon receiving the query request, the hospital decides which phenotype attributes can be accessed and generates () for each approved phenotype attribute. Specifically, if the hospital approves the requested , the corresponding is properly constructed based on and . Otherwise, a random group element is selected, such that the access structure can be protected from the CSP.

  3. For each phenotype attribute, a transform key ()) is generated. The hospital sends all of them to the CSP and replies to the client with a success message.

Given the transform key, the CSP can transform the ciphertext into the format that supports corresponding client’s query. Using this technique, the CSP can process all the ciphertext before a query is launched. The details of the transformation are discussed in Section 5.8.

5.7. Query Generation

The query is generated based on client’s input that specifies phenotype data of interest. Then, the query is executed over encrypted phenotype data. Algorithm 6 shows query generation based on the input of a single phenotype. The client first applies the hash function on the input phenotype attribute and value, obtaining () as the output. Then, the client computes a group element , where is selected during the generation of authorization request. is included in the query. Afterwards, the client computes the other part of the query: . To support multiple phenotypes in a query, the client can iteratively invoke the SinglePhenotypeQueryGen algorithm. In detail, given a set of phenotypes, for each phenotype inside , the client calls SinglePhenotypeQueryGen to encrypt it. The result () is stored in a vector , which has the same dimension and the same order of phenotype attributes as . For the phenotypes whose phenotype attributes are in but not in , the corresponding value is set to in vector . is a set and it includes the vector . Finally, in addition to , the size of case and control groups is also included in before is sent to the CSP.

Algorithm 7 shows how to extends the input from a single phenotype to multiple phenotypes. Here, for each phenotype inside the input phenotype data set , the client invokes Algorithm 6. Once all the input data are processed, the algorithm outputs the final query . Note that all the positions, where phenotype attributes are not included in the input phenotype data are filled with zeros. For example, a client can input a set of pairs of phenotype attribute and value as {(breast cancer, 1), (lung cancer, 0), (blue eye color, 1)} to generate a query. Applying above algorithms, each target phenotype attribute is corresponding to a component in the query, the remaining that is not included in the input phenotype attribute set is set to . The final query also includes the size of case and control groups.

1:, (),
2:
3:
4:
5:return
Algorithm 6 SinglePhenotypeQueryGen
1:,
2:
3:
4:for all  do
5:     
6:     
7:fill with in locations of phenotype attributes in
8:
9:return
Algorithm 7 QueryGen

5.8. Identification of Case and Control groups

Without loss of generality, we present the identification process at the CSP over a single hospital ’s dataset. Multiple hospitals’ datasets are processed separately and in parallel. We describe the identification process in two steps. First, we show the search operation over a single phenotype. Then, we extend the algorithm to multiple phenotypes. The details of search over a single phenotype are shown in Algorithm 8. Specifically, given the ciphertext of phenotype of patient in hospital , we first extract the components , , from , denoted as , , and . Based on the input of the query from client querying for phenotype , we extract from , denoted as and . With the three pairs (, ), (, ), and ( ), we compute the bilinear mapping over each pair and multiply all the results one by one. If the computed result is , the current phenotype matches the query, otherwise, it does not match. The correctness of the algorithm is shown in Eq. 1.

1:, ,
2: or
3:
4:
5:
6:
7:if  then
8:     return 1
9:else
10:     return 0
Algorithm 8 SinglePhenotypeSearch
(1)

To support a query that contains multiple phenotypes, we extend Algorithm 8 to Algorithm 9. In detail, we first initialize two lists and . Then, for each patient with pseudonym in hospital ’s encrypted phenotype dataset , the corresponding encrypted phenotype data is read. Given the encrypted phenotype data , query , and transform key ( , , ), the per-hospital components of the given data are extracted and passed as the input to the SinglePhenotypeSearch algorithm. For example, both the first element of phenotype data query and transform key are extracted, and then input into the SinglePhenotypeSearch algorithm. If a patient contains all the phenotype data inside a query, the patient’s pseudonym is added to the case group . If a patient does not match query phenotypes, its pseudonym is added to the control group . This process is executed until the size of both case and control groups reaches or until all the data is processed.

1:, , (,, )
2: and
3:initialize two lists and
4:set the number of non-zero elements in as
5:
6:for all  do
7:     
8:     for all  do
9:          score score + SinglePhenotypeSearch(,, )      
10:     if score = and  then
11:          .append()
12:     else if score = and  then
13:          .append()      
14:     if  and  then
15:          break      
16:return and
Algorithm 9 Search

5.9. Computation over Identified Target Patients

Here, we briefly discuss how our scheme can compute GWAS (or other statistics) over identified patients in centralized and decentralized approaches. In a centralized setting, the patient genomic data is also stored at the CSP, and the genomic data of each patient can be encrypted using multi-key fully homomorphic encryption mechanism (e.g., (Chen et al., 2019)) for storage at the CSP. Then, after identifying the case and control groups (as in Section 5.8), the CSP can directly execute the GWAS algorithm over the identified patient data using the homomorphic properties of multi-key fully homomorphic encryption mechanism. In a decentralized setting, the patient genomic data is not stored at the CSP. In this case, the CSP can return the identified patients to the client. Based on the identified patients, the client can send requests to all the involved hospitals to get access to their data for computing GWAS in a distributed way (Raisaro et al., 2018; Froelicher et al., 2021).

5.10. Managing Dynamic Phenotype Data

Here, we show how the proposed scheme supports efficient update of patient phenotype data stored at the CSP. Assume the hospital wants to update phenotype of patient . Given the patient pseudonym and the phenotype (), the phenotype encryption algorithm (in Section 5.5) is called to encrypt the phenotype. Once the encryption is completed, the vector of the update query is constructed by inserting the ciphertext at the position where is located in and setting the remaining values to . Additionally, the command “update” is added to the query. The update query is sent to the CSP. Upon receiving the update query, the CSP first identifies the entry of the patient , then identifies the location corresponding to the non-zero element inside the vector of the update query, and finally replaces the old cipher with the new cipher from the update query. To insert a new phenotype or to delete an existing phenotype from a patient record, the proposed update algorithm can also be used by directly appending or removing a record from stored ciphertext.

6. Privacy Analysis

In this section, we analyze the privacy of phenotype data. We first provide a high level discussion on how the proposed scheme achieves robustness in the presence of attacks presented in Section 4.4. Then we present formal privacy definition and proof.

6.1. Privacy Against Ciphertext Analysis

Phenotype data is encrypted and stored at the CSP. This allows the CSP to analyze the stored ciphertext. The encrypted phenotype data can be split into two parts (as in Section 5.5.1). The first part of the encrypted phenotype data is constructed in two steps. Hospital first computes the hash of the phenotype attribute and value. Then, hospital randomly selects a number to mask above hash result and raised the result to the power of a group element. The privacy of this part relies on the hardness of discrete logarithm problem, one-wayness of the hash function, and randomness of the selected number. The second part of the encrypted phenotype data results from directly encrypting phenotype attribute and value. Hospital uses its secret key and invokes symmetric encryption algorithm (e.g., AES) to encrypt the concatenation of phenotype attribute and value. The privacy of the second part relies on the robustness of the utilized symmetric encryption algorithm (e.g., AES). From the above description, we can conclude that both parts of the encrypted phenotype are semantically secure. Thus, the CSP is unable to learn significant information from the ciphertext analysis.

6.2. Privacy Against Query Analysis

The input of the query includes a set of phenotypes. Each phenotype includes a pair of phenotype attribute and value, which is first hashed and the hash result is lifted as a power of a group element (as in Section 5.7). Additionally, a random mask is selected to hide this result, which enables the encrypted query to be semantically secure. Due to the semantic security of the query, the CSP is unable to learn the query content from the query analysis.

6.3. Privacy Against Operation Analysis

Since the ciphertext of phenotype genotype data is stored at the CSP, the CSP is responsible for conducting search over the ciphertext. For the search operation, the CSP applies the query to search over the ciphertext of phenotype data. If a phenotype is included in the query, the corresponding search result is , otherwise, it is . Through the execution of the search process, the CSP can learn the number of matching phenotypes of each patient. However, for each matching phenotype, its value can either be 0 or 1, and the CSP cannot distinguish between the two. Thus, the CSP cannot learn any information about the query and ciphertext from the search operation.

6.4. Robustness Against Unauthorized Access

Access control is designed through the transform key (as discussed in Section 5.6). A client selects random numbers () and sends them to the target hospital. The hospital generates the transform key () by lifting these numbers into the power of a group element. Due to the randomness of and the hardness of the discrete logarithm problem, the CSP is unable to extract any information from the transform key. Analyzing the search operation, the output of using incorrect transform key is , which does not disclose any information to the CSP, as described in Section 6.3. Therefore, the proposed authorization scheme is robust against the unauthorized access.

6.5. Robustness Against Colluding Hospitals

Each hospital independently encrypts its phenotype data and genotype data with its own secret key. Even if several hospitals collude with each other, they cannot get any advantage to infer another hospital’s data.

6.6. Robustness against Collusion between a Hospital and CSP

If one or more hospitals collude with the CSP, the CSP cannot obtain any advantage to infer the remaining hospitals’ data since each hospital’s data is encrypted independently. However, as we clarified in Section 4.4, we do not consider following case. For instance, if the search (at the CSP) includes two hospitals and if one of these hospital collude with the CSP, the CSP can learn which patients of the other (victim) hospital has the considered phenotype as a result of the search operation (but not the genomic data of the identified patients). To provide the common search functionality, this attack is unavoidable, and hence we do not consider it in this work.

6.7. Privacy Analysis

In the following, we provide a formal privacy analysis of the proposed scheme. Following previous works (Chen et al., 2019), the allowed leakage includes (i) size pattern and (ii) access pattern. The size pattern discloses the size of the ciphertext, while the access pattern reveals the access frequency of matching patient records. The allowed leakage is not considered violation of our privacy goal. The privacy of the proposed scheme is based on the hardness of discrete logarithm problem, the randomness of selected random mask, and the robustness of applied symmetric encryption (e.g., AES).

The privacy of the proposed scheme can be divided into two elements: phenotype data privacy, and query privacy. The privacy of phenotype data can be further divided into two parts. One part is encrypted by using symmetric encryption algorithm (the third element in a ciphertext, detailed in Section 5.5.1) and the other part is not (the first and second elements in a ciphertext, detailed in Section 5.5.1). Due to the robustness of symmetric encryption algorithm (e.g., AES), the part with the symmetric encryption is also semantically secure. The other part is first protected by a hash function and then masked with a random value. Both the hash result and random mask are put into the power of a group element. Due to the random mask and the hardness of discrete logarithm problem, the ciphertext is semantically secure in the presence of chosen plaintext attack. The query privacy is achieved similarly, relying on the random mask and discrete logarithm problem.

Formally, we define the privacy experiments as follows. Let be the scheme, the advantage of the adversary is defined as , where and are defined below. In the following, we detail the game.

Init: The adversary selects two datasets and with same size and sends them to the challenger.

Setup: With the input of security parameter , the challenger runs Setup to initialize the parameters. Then, the challenger calls KeyGen to generate the keys.

Phase 1: is allowed to ask the following request:

Phenotype data encryption request::

is allowed to send a dataset with phenotype data to ask for encryption. The challenger calls Encrypt algorithm to encrypt the dataset and sends the result back to .

Challenge: The challenger randomly selects from , encrypts the dataset , and sends it to the adversary .

Phase 2: is allowed to send requests as in Phase 1. Additionally, is allowed to send a query request. The challenger only authorizes a query containing the phenotype attributes, where two datasets have the same value. Then, it generates a transform key for , where the mask () in transform key is randomly selected by the challenger (see details in Section 5.6 ).

Guess: outputs as the guess for .

We say the scheme is privacy-preserving if the advantage of the adversary is negligible, i.e, , where is a negligible function in .

From above defined privacy game, the adversary is only allowed to learn the information from Phase 1 and Phase 2. The difference of Phase 2 from Phase 1 is that holds the challenged ciphertext and is allowed to ask a query request. Thus, we only need to prove that what the adversary can learn from ciphertext request and query request is negligible as follows.

Phenotype data encryption request::

submits the dataset of phenotype data to ask for encryption from hospital .

If the PhenotypeEnc algorithm is semantically secure, is unable to distinguish ciphertext from a random string. The ciphertext of each pair of phenotype attribute and value includes three components. The first component is randomly selected from , which does not reveal any information. The second component is . Due to the hardness of discrete logarithm problem, is unable to extract the power of a group element. Similarly, is unable to distinguish the second component from a random element in . The third component is encrypted by using a symmetric encryption algorithm (e.g., AES), which is semantically secure.

Therefore, the ciphertext obtained through Encrypt algorithm is semantically secure.

Query request::

First, the transform key is indistinguishable from a random element of group . Second, for each pair of phenotype attribute and value, the query is , , where is an element from . Based on the hardness of discrete logarithm problem and the randomness of , the query is indistinguishable from two random elements from . Third, given the query and transform key, the adversary is capable to run search operation over the ciphertext of phenotype data. Furthermore, is also capable to run analysis algorithms on ciphertext of genotype data. However, due to the constraint of issuing client authorization, two datasets should output the same search result. Thus, cannot learn any significant information via executing search operation.

Based on above analysis, we can conclude is negligible and the proposed scheme is privacy-preserving.

7. Evaluation

In this section, we evaluate the performance of the proposed scheme. We run the experiments on a commodity machine with CPU and 16GB RAM. The proposed phenotype encryption algorithm is implemented by Charm (Akinyele et al., 2013) written in Python while the SNP encryption algorithm is implemented by HEAAN (13) written in C++. Each experiment is run 10 times; we report the average results.

7.1. Data Model

We use the rsnps tool (22) to obtain all the raw patient files from the publicly available OpenSNP dataset (18)

. Then, we converted the raw patient files into the VCF format using an open source software named

personal-genome-analysis (Hammerbacher, 27-09-2018). Eventually, we ended up with 1052 valid VCF files. For the phenotype data, we also used the OpenSNP dataset. In total, we collected 1052 records and extracted 1052 phenotype attributes.

7.2. Experimental Results

In this section, we first show the efficiency and scalability of the phenotype and genotype data encryption algorithms. After that, we present the scalability and efficiency of client authorization and query generation.

7.2.1. Phenotype Data Encryption

We adopt symmetric pairing group SS512 to construct the bilinear mapping and AES to implement the symmetric encryption. Accordingly, the time required to encrypt the phenotype data can be divided into two constituents. The first constituent is due to the pairing group operation. The second constituent is due to using AES to encrypt phenotype attribute and value. The secret key of AES is set to bits. Table 4 shows the performance of the phenotype data encryption for different number of phenotypes. We observed that with the linear increase in the number of patients, the time cost of phenotype encryption for both AES and non-AES (pairing based encryption) constituents increases linearly. Similarly, when the number of phenotypes increases, the required encryption time also increases linearly.

# patients # phenotype Non-AES (s) AES (s)
1052 1052 8137.5 5
1052 526 3954.8 2.7
1052 263 2068.9 1.38
526 1052 3948 2.7
263 1052 2042.1 1.38
Table 4. Time cost of phenotype data encryption for different number of patients and phenotypes
# requested phenotypes time (s)
1052 4.47
526 2.28
263 1.14
Table 5. Time cost of authorization request generation for different number of requested phenotypes.

7.2.2. Client Authorization

The process of client authorization can be divided into two phases. In the first phase, the client generates an authorization request while in the second phase, the hospital generates the transform key. Table 5 shows the efficiency of authorization request generation for different number of requested phenotypes. We observe that the required time of authorization request generation increases linearly upon increasing the number of requested phenotype attributes. Table 6 shows the performance of the transform key generation for different number of phenotype attributes. Here, we observe that the time required for transform key generation increases linearly as a function of the number of phenotype attributes.

# authorized phenotypes time (s)
1052 5.87
526 3.2
263 1.62
Table 6. Time cost of transform key generation for different number of authorized phenotypes
# input phenotypes time (s)
1052 4.8
526 2.39
263 1.19
Table 7. Time cost of query generation for different number of input phenotypes

7.2.3. Query Generation

The performance of the query generation algorithm is affected by the number of input phenotypes. The experimental results are shown in Table 7. From the table, we observe that with the linear increase in the number of input phenotypes, the time required for the query generation also increases linearly.

7.2.4. Phenotype Data Search

The search process involves a number of patient records to be processed to construct the case and control groups. To access each hospital’s data, the query needs to be transformed using the transform key. In addition, the desired size of case and control groups can be a factor to stop the search process earlier once the required number of individuals are identified in case and control groups. Table 8 shows the effect of the number of hospitals, number of input phenotypes, and the size of case and control groups on the efficiency. We observed that if the number of patients is fixed, the number of hospitals almost does not affect the efficiency, while the performance is sensitive to the size of case and control groups and the number of input phenotypes. When the size of case/control groups are set to and , the search time is reduced to s and s, respectively. The reason is that once the required number of patients are identified for the case and control groups, the search process is terminated. We also evaluate the performance for reduced number of input phenotypes. The observed search times are s, s, and s for , , and phenotypes, respectively. From these results, we can say that the search time increases linearly as the number of input phenotypes grows.

# hospitals # queried phenotypes # case/control groups time (s)
1 10 100 32.2
10 10 100 33.1
100 10 100 34.7
1 10 10 0.29
1 10 50 1.78
1 50 100 166.7
1 100 100 327.1
Table 8. Time cost of phenotype data search for the proposed algorithm with a total of 1052 patients and each patient having 1052 phenotypes

To show the efficiency of the proposed algorithm for phenotype data search, we also designed a following fully homomorphic encryption (FHE)-based version of it for comparison. In detail, the alternative FHE-based approach includes four steps. First, all the phenotype data is encrypted by using CKKS (Cheon et al., 2017). Second, the client sends a query to the CSP in order to determine the case/control groups. Third, the CSP sends the computed result to different hospitals. Fourth, hospitals decrypt the result in parallel and send the result (identified case/control groups) to the client. Without considering the time cost of communication, Table 9 shows the required time to complete the search operation using this FHE-based scheme for different number of patients and phenotypes. Comparing Table 8 with Table 9, we observe that the proposed algorithm is more than 20 times faster than the FHE-based algorithm when the number of queried phenotypes is at most 10. Notably, when the size of case/control groups is smaller than 50, the proposed algorithm is more than 300 times faster. Furthermore, in Table 10 we show the comparison between the FHE-based solution and proposed algorithm for different number of patients. From the table, we observe that (i) the proposed scheme consistently exhibits similar performance advantage over the FHE-based scheme and (ii) the time cost of both algorithms increases linearly with the number of patients.

# hospitals # queried phenotypes # case/control group each hospital decryption time (s) total
1 10 100 6.2 684.6
10 10 100 5.9 683.9
100 10 100 0.6 678.6
1 10 10 1.5 678.5
1 10 50 3.1 681.1
1 50 100 6.3 684.7
1 100 100 6.5 684.9
Table 9. Time cost of phenotype data search for the CKKS algorithm with a total of 1052 patients and each patient having 1052 phenotypes
# hospital # queried phenotypes # patients total time (s)
CKKS algorithm
1 10 1052 747.5
1 10 526 355.9
Proposed algorithm
1 10 1052 35.9
1 10 526 17.4
Table 10. Time cost of phenotype data search for the CKKS and proposed algorithm for different number of patients (each patient having 1052 phenotypes). The size of case/control groups is not limited as input parameter.

8. Discussion

The proposed pairing-based encryption scheme for phenotype data (in Section 5.8) is not limited to efficient identification of case and control groups. The scheme can be extended to support additional functions, such as similar patient search, target record retrieval, and statistical analysis of phenotype data. In the following, we discuss some of these potential extensions.

Similar patient search. In the proposed scheme, a client is allowed to input several phenotypes of interest into a query and send it to the CSP. Upon receiving the query, the CSP searches the stored datasets of several hospitals. Based on the number of matching phenotypes, the CSP can order the search results and return similar patient records. Similar functionality can also be provided in the genome level by encrypting genome data with the proposed pairing-based encryption.

Phenotype data retrieval. In the proposed scheme, we show how to use the Search algorithm (in Section 5.8) to find target pseudonyms based on the query. One can extend this function to support a client to retrieve phenotype data of interest. For instance, a client (e.g., a physician) may be interested to know the phenotypes of patients having lung cancer. Then, the client can generate a query and send it to the CSP to retrieve the phenotype data from patients that are diagnosed with lung cancer. Once the client receives the phenotype data from the CSP, the SinglePhenotypeDecrypt algorithm 10 is called to decrypt the phenotype data.

1:,
2:,
3:
4:
5:return
Algorithm 10 SinglePhenotypeDecrypt

9. Conclusion

In this paper, we have designed a privacy-preserving framework for identification of a target group of patients across multi-tenant data. To achieve this, we have proposed a novel phenotype encryption algorithm. To support search and computation over multi-tenant data by a cloud service provider (CSP), we have introduced a transform key to enable the CSP to transform a single query and execute it over different hospitals’ datasets without privacy violation. To manage the authorization of clients, we have proposed a per-query based authorization mechanism supporting selective phenotype data authorization. Via simulations on real genomic data, we have shown the practicality and efficiency of the proposed scheme. We believe that the proposed scheme will further facilitate the use of genomic data in clinical settings and pave the way for personalized medicine. In future work, we will focus on improving the search efficiency of genomic data and batch queries.

References

  • A. Akavia, C. Gentry, S. Halevi, and M. Leibovich (2019) Setup-free secure search on encrypted data: faster and post-processing free. Proceedings on Privacy Enhancing Technologies 2019 (3), pp. 87–107. Cited by: §5.1.
  • J. A. Akinyele, C. Garman, I. Miers, M. W. Pagano, M. Rushanan, M. Green, and A. D. Rubin (2013) Charm: a framework for rapidly prototyping cryptosystems. Journal of Cryptographic Engineering 3 (2), pp. 111–128. External Links: ISSN 2190-8508, Document, Link Cited by: §7.
  • B. Athey, M. Braxenthaler, M. Haas, and G. Y (2013) TranSMART: an open source and community-driven informatics and data sharing platform for clinical and translational research. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science 6-8. Cited by: §1.
  • D. Bogdanov, L. Kamm, S. Laur, P. Pruulmann-Vengerfeldt, R. Talviste, and J. Willemson (2014) Privacy-preserving statistical data analysis on federated databases. In Annual Privacy Forum, pp. 30–55. Cited by: §2.
  • J. W. Bos, K. Lauter, J. Loftus, and M. Naehrig (2013) Improved security for a ring-based fully homomorphic encryption scheme. In IMA International Conference on Cryptography and Coding, pp. 45–64. Cited by: §2.
  • H. Chen, W. Dai, M. Kim, and Y. Song (2019)

    Efficient multi-key homomorphic encryption with packed ciphertexts with application to oblivious neural network inference

    .
    In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 395–412. Cited by: §5.9, §6.7.
  • J. H. Cheon, A. Kim, M. Kim, and Y. Song (2017) Homomorphic encryption for arithmetic of approximate numbers. In International Conference on the Theory and Application of Cryptology and Information Security, pp. 409–437. Cited by: §7.2.4.
  • T. G. A. for Genomics and Health (2016) A federated ecosystem for sharing genomic, clinical data. Science 352 (6291), pp. 1278–1280. External Links: Document, ISSN 0036-8075, Link, https://science.sciencemag.org/content/352/6291/1278.full.pdf Cited by: §1.
  • D. Froelicher, J. R. Troncoso-Pastoriza, J. L. Raisaro, M. Cuendet, J. S. Sousa, J. Fellay, and J. Hubaux (2021) Truly privacy-preserving federated analytics for precision medicine with multiparty homomorphic encryption. bioRxiv. Cited by: §1, §1, §2, §5.9.
  • C. Gentry, S. Halevi, and N. P. Smart (2012) Homomorphic evaluation of the aes circuit. In Annual Cryptology Conference, pp. 850–867. Cited by: §2.
  • N. Goud (2020) Top 5 cloud security related data breaches!. External Links: Link Cited by: §2.
  • J. Hammerbacher (27-09-2018) Personal-genome-analysis. External Links: Link Cited by: §7.1.
  • [13] (02-10-2019) Heaan. External Links: Link Cited by: §7.
  • L. Kamm, D. Bogdanov, S. Laur, and J. Vilo (2013) A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics 29 (7), pp. 886–893. Cited by: §2.
  • M. Kim and K. Lauter (2015) Private genome analysis through homomorphic encryption. In BMC medical informatics and decision making, Vol. 15, pp. S3. Cited by: §1, §2.
  • W. Lu, Y. Yamada, and J. Sakuma (2015) Privacy-preserving genome-wide association studies on cloud environment using fully homomorphic encryption. In BMC medical informatics and decision making, Vol. 15, pp. S1. Cited by: §4.4.
  • U.S. D. of Health and H. Services (2021) The health insurance portability and accountability act (hipaa). External Links: Link Cited by: §1.
  • [18] (14-10-2018) Opensnp. External Links: Link Cited by: §7.1.
  • E. Parlament (2021) The eu general data protection regulation (gdpr). External Links: Link Cited by: §1.
  • J. L. Raisaro, J. R. Troncoso-Pastoriza, M. Misbach, J. S. Sousa, S. Pradervand, E. Missiaglia, O. Michielin, B. Ford, and J. Hubaux (2018) MedCo: enabling secure and privacy-preserving exploration of distributed clinical and genomic data. IEEE/ACM transactions on computational biology and bioinformatics 16 (4), pp. 1328–1341. Cited by: §1, §1, §2, §5.9.
  • N. J. Risch (2000) Searching for genetic determinants in the new millennium. Nature 405 (6788), pp. 847. Cited by: §3.1.
  • [22] (02-12-2018) Rsnps. External Links: Link Cited by: §7.1.
  • M. N. Sadat, M. M. A. Aziz, N. Mohammed, F. Chen, S. Wang, and X. Jiang (2017) Safety: secure gwas in federated environment through a hybrid solution with intel sgx and homomorphic encryption. arXiv preprint arXiv:1703.02577. Cited by: §1, §2.
  • T. Schneider and O. Tkachenko (2019) EPISODE: efficient privacy-preserving similar sequence queries on outsourced genomic databases. Cited by: §4.4, §5.1.
  • J. V. Selby, A. C. Beal, and L. Frank (2012) The patient-centered outcomes research institute (pcori) national priorities for research and initial research agenda. Jama 307 (15), pp. 1583–1584. Cited by: §1.
  • X. Zhu, E. Ayday, and R. Vitenberg (2019) A privacy-preserving framework for outsourcing location-based services to the cloud. IEEE Transactions on Dependable and Secure Computing. Cited by: §2.