1 Introduction
Edit distance algorithm approximates how similar two DNA sequences are. This similarity then can be used in finding similar cancer patients across organizations which helps in deciding the appropriate treatment. Due to privacy constraints, organizations are not willing to reveal their clients’ private information.
Problem Definition: In this paper, we aim to propose an efficient privacy preserving protocol to find the most similar patients in a database on a panel of genes measured by the Edit distance between a query sequence and sequences in the database. Formally, the server and client want to privately compute the distance between one versus sequences. The client’s input represents a sequence while the server holds sequences . Then, they jointly calculate the Edit distance between and each . At the end of the protocol, s with less than distance to will be returned to the client.
This problem is a type of secure twoparty computation in which two parties jointly compute a function on their private inputs without disclosing their data to each other except the final output. Recently, secure computation has attracted attention in computational biology and bioinformatics to preserve privacy of biological data [8, 15, 5, 6, 13, 10]. Early approaches were based on pure (additive) Homomorphic Encryption (HE) e.g., [12]. Later work showed that protocols using generic secure computation techniques such as Yao’s garbled circuits and GMW circuits outperform HE. These protocols are based on either a combination of HE and circuitbased approaches [5, 6, 13] or pure circuitbased techniques [7, 14, 17].
On the other hand, recent optimization of Oblivious Transfer (OT) – known as OT extension– is recognised as the most efficient technology in twoparty settings [4, 16, 11]. For example, Demmler et al. showed that an OTbased solution for privacypreserving identification outperforms HE and circuitbased techniques [11]. To the best of our knowledge, we propose a privacy preserving Edit distance using OT extension for the first time.
However, communication is a bottleneck in the existing OT based protocols. Schneider showed that an increase in the bitlength of the transferred data and/ or an increase in the number of the required OT lead to a significant increase in communication bandwidth [18]. Kolensnikov et al. improved OT extension for transferring short messages such as binary messages [16] to address communication overhead.
As comparison of two values is the main building block for Edit distance algorithm, we reduce the problem of secure Edit distance into secure comparison and propose an Efficient Secure Comparison protocol based on Oblivious Transfer (ESCOT) with binary representation of the data transferring among computing parties. We also provide the security and accuracy analysis of ESCOT. We implemented ESCOT algorithm in Java and the source code will be available to reproduce the results by the time the paper is published.
2 Background
In this Section, we provide some background information about Edit distance algorithm and Oblivious Transfer (OT).
2.1 Edit Distance
Definition: Edit distance between two strings or sequence of characters is the minimum number of insertions, deletions and substitutions required to transform into .
In classic distance measures like Euclidean distance, the sequences are required to have equal length. Therefore, for measuring the similarity between DNA sequences with different lengths, more complicated distance measures like Edit distance are needed.
In our scenario, the client owns length sequence of characters and the server has length sequence of characters where and the characters belong to a finite alphabet set. can be transformed into by applying Edit distance algorithm. The Edit distance is the minimum aggregate cost necessary to perform this transformation.
The basic Edit distance algorithm is WangerFisher [20] which works in a bruteforce manner to compare two sequences and its complexity is or . This quadratic complexity is inefficient in the cryptography domain. Therefore, we need to seek for more sophisticated algorithms with lower complexity that help privacy preserving genome analysis.
Ukkonen’s algorithm [19] improves WagnerFisher algorithm by limiting the number of operations, provided that the Edit distance is less than a given threshold . It runs in time which improves the complexity significantly when the sequences are lengthy.
2.2 Oblivious Transfer (OT)
In Oblivious Transfer (OT), two parties known as sender and receiver participate in protocol. In 1outof2 OT, sender has two private messages (, ) and receiver has a selection bit . At the end of the protocol, the receiver only learns and learns no information about and the sender learns nothing about . In its generalized form i.e., 1outofn OT, the sender has messages while the receiver has a selection value to obtain .
Preliminary OTbased protocols consist of expensive publickey operations while recent improvements of OT, called OTextension [4, 16], allow executing many OTs using only symmetric operations with a constant and small number () of publickey operations as baseOTs.
To perform base OTs, we can use either homomorphic encryption or DeffieHellman key exchange protocol [9]. In our implementation, we use the latter one as follows:
The sender selects a random number and sends to the receiver ( is the group generator). The receiver picks a random number and calculates (if ) or (if ). He then sends to the sender. The sender calculates and such that and act as the secret keys in a symmetric encryption . Then, he encrypts its messages, , , and sends the ciphertexts to the receiver. Then, the receiver calculates and decrypts the desired message by its key as . stands for a secure hash function.
The security of base OT protocols directly depends on the security of the underlying homomorphic encryption or DeffieHellman protocols. In particular, DeffieHellman key exchange protocol is secure due to the hardness of computing discrete logarithms. Now, the results obtained from the execution of the baseOT are used to perform many OTs efficiently using lightweight symmetric operations.
The OT extension protocol proposed in [16] is the recent optimization of OT extensions, specially for short messages, that supports 1outofn () OT in addition to 1outof2 OT. We adopt this algorithm in our secure comparison protocol.
3 The Proposed Approach
In this section, we first propose our secure comparison protocol. Then, introduce our algorithm for privacy preserving Edit distance.
3.1 ESCOT Protocol
Unlike classic distance measures like Euclidean distance which are composed of four basic mathematical operations (), Edit distance is based on boolean comparison. It checks whether two specific characters from two separate sequences are equal (Algorithm 3.2, Line 18).
We propose a novel protocol for secure comparison based on OT called ESCOT. For example, and are respectively the client and server’s sequences in an alphabet set of size . If we encode the characters as numbers, the code value vary from to . The goal of secure comparison is to check whether and are equal where .
In ESCOT protocol, client acts as the receiver and server acts as the sender in OT. Sender generates number of OT messages for each character in its sequence as follows:
(1) 
On the other side, receiver puts the value of as its selection bit. Since the number of OT messages is , execution of 1outofn OT is required. The logic of this protocol is that if and are the same, then will be transferred; otherwise, will be transferred. Since the length of OT messages is one bit, execution of the 1outofn protocol proposed in [16] is highly fit for our problem as it is the most efficient protocol known today for shortlength messages. ESCOT protocol to compare and is described in Algorithm 1.
Correctness Analysis: The message corresponding to the selection bit is transferred (). Intuitively, if then the condition is satisfied and the message is 1. If then the transferred message is 0.
Security Analysis: The security of ESCOT depends on security of the underlying OT protocol. The receiver only receives the message corresponding to its selection bit and gain no information about the other messages. The sender will not learn any thing about the selection bit. In addition, in the Edit distance algorithm we consider in this paper, only sequences with enough similarity (based on the threshold ) are processed to the end and if the Edit distance exceeds the threshold then the execution will stop. This way, we can minimize the information leakage.
Communication Analysis: For each comparison, the communication bandwidth takes bits where is security parameter. Edit distance algorithm requires number of comparisons so the consumption of the bandwidth is bits in our algorithm.
3.2 Private Edit Distance based on ESCOT:
In Algorithm 2, we combine Ukkonen’s Edit distance algorithm [19] with our ESCOT protocol to address privacy preserving Edit distance.
4 Experiments
To calculate Edit distance between two sequences, ESCOT protocol executes times, where is the distance threshold and is the maximum length of the input sequences.
Threat Model: Our threat model is semihonest that means both sender and receiver follow the protocol specification accurately and do not try to change their messages in order to obtain private information from other party. The only information they learn is the result of the comparison.
Security Parameters: We evaluate our approach with different public key security parameters for basOTs. The symmetric security parameter which determine the number of baseOTs is set to 80 or 128.
Dataset: We evaluate our proposed approach using a genome database released by “iDASH Security and Privacy Workshop 2016" [1]. Briefly, the server holds a database of 50 different sequences and client holds one sequence to be evaluated against all the sequences in the database. The length of the sequences in average is 3500 characters from alphabet set. Therefore, the value of in 1outofn OT would be 4.
Setup: We evaluate our approach with respect to execution time and communication bandwidth. The goal of performance evaluation is to show the feasibility of ESCOT protocol in realworld genome matching scenarios. We implement the framework in Java while client and server communicate through sockets. We run the framework over both LAN and WAN networks. For LAN setting, we use the VM machines provided by “iDASH Security and Privacy Workshop 2016" so, we can provide a fair comparison with state of the art work [3] which run the experiments on the same VM machines. For WAN, we use an intercontinental cloud setting and perform the experiments on two freetier Amazon instances with a 64bit Intel Xeon dualcore CPU with 2.8 GHz and 3.75 GB RAM. The client and server are located in Oregon and Tokyo respectively. Evaluation over WAN gives us a better approximation of realworld scenarios. All the experiments are the average of 10 execution rounds.
Results: We set Edit distance threshold to 60, 80 and 100. The goal is to return the sequences with equal or less than the threshold distance to the client sequence. Obviously, by increasing the threshold the complexity increases. Experimental results are shown in Figure 1 with different security parameters (, ). Figures (a) and (b) measures the running time in second on LAN and WAN network respectively while Figure (c) shows the bandwidth consumption in KB. Execution time varies from 8 to 38 seconds on LAN and 45 to 75 seconds on WAN. The execution time on WAN is higher due to network latency. Bandwidth directly depends on symmetric security parameter or number of baseOTs, as it is shown in Figure (c) the publickey security parameter does not affect the communication. The bandwidth varies from 18 to 40 MB.
Analysis: The private Edit distance protocols proposed in [21] and [3] are executed in 516 and 23 seconds respectively on the same dataset and over the same VM machines with baseline security parameters ( and ). While, our proposed protocol runs only in 8 seconds with same configuration. The other advantage of our approach over [3] is that ESCOT protocol calculates accurate Edit distance while the other work approximates the Edit distance.
4.1 Related Work
Shantanu and Boufounos proposed an approach to calculate Edit distance using HE [2]. They reduced the problem to privacypreserving minimum finding protocol that should be executed times ( and are the length of the input sequences). Huang et al. proposed a protocol to calculate Edit distance based on Garbled circuits [14].
5 Conclusion
In this paper, we proposed an efficient solution for privacy preserving Edit distance using OT extension for the first time without losing accuracy. To do this, we proposed ESCOT protocol for boolean comparison based on 1outofn OT inspired by recent advances in secure twoparty computation and Oblivious Transfer. We evaluate our approach on a genome dataset released by iDASH 2016 [1]. The experimental results confirm the efficiency of our approach over state of the art efforts for privacy preserving Edit distance.
References
 [1] In GENOPRI WORKSHOP. http://www.humangenomeprivacy.org/2016/competitiontasks.html, 2016.
 [2] C. AguilarMelchor and et al. Recent advances in homomorphic encryption. Signal Processing Magazine, 2013.
 [3] M. M. Al Aziz and et al. Secure approximation of edit distance on genomic data. BMC Medical Genomics, 2017.
 [4] G. Asharov and et al. More efficient oblivious transfer and extensions for faster secure computation. In CCS. ACM, 2013.
 [5] M. Barni and et al. Privacypreserving fingercode authentication. In Multimedia and security. ACM, 2010.
 [6] M. Blanton and P. Gasti. Secure and efficient protocols for iris and fingerprint identification. In ESORICS. Springer, 2011.
 [7] J. Bringer and et al. Faster secure computation for biometric identification using filtering. In ICB. IEEE, 2012.
 [8] G. S. Cetin and et al. Private queries on encrypted genomic data. BMC Medical Genomics, 2017.
 [9] T. Chou and C. Orlandi. The simplest protocol for oblivious transfer. In Cryptology and Information Security in Latin America. Springer, 2015.
 [10] H. Chun and et al. Outsourceable twoparty privacypreserving biometric authentication. In ICCS. ACM, 2014.
 [11] D. Demmler and et al. Abya framework for efficient mixedprotocol secure twoparty computation. In NDSS, 2015.

[12]
Z. Erkin and et al.
Privacypreserving face recognition.
In PETS. Springer, 2009.  [13] D. Evans and et al. Efficient privacypreserving biometric identification. In NDSS, 2011.
 [14] Y. Huang and et al. Faster secure twoparty computation using garbled circuits. In USENIX, 2011.
 [15] M. Kim and et al. Secure searching of biomarkers through hybrid homomorphic encryption scheme. BMC Medical Genomics, 2017.
 [16] V. Kolesnikov and R. Kumaresan. Improved ot extension for transferring short secrets. In CRYPTO. Springer, 2013.
 [17] Y. Luo and et al. An efficient protocol for private iriscode matching by means of garbled circuits. In Image Processing. IEEE, 2012.
 [18] T. Schneider. Aby  a framework for efficient mixedprotocol secure twoparty computation. In Securing Computation Workshop. EC SPRIDE, 2015.
 [19] E. Ukkonen. Algorithms for approximate string matching. Information and control, 1985.
 [20] R. A. Wagner. On the complexity of the extended stringtostring correction problem. In Theory of computing. ACM, 1975.
 [21] X. S. Wang and et al. Efficient genomewide, privacypreserving similar patient query based on private edit distance. In CCS. ACM, 2015.
Comments
There are no comments yet.