Edit distance algorithm approximates how similar two DNA sequences are. This similarity then can be used in finding similar cancer patients across organizations which helps in deciding the appropriate treatment. Due to privacy constraints, organizations are not willing to reveal their clients’ private information.
Problem Definition: In this paper, we aim to propose an efficient privacy preserving protocol to find the most similar patients in a database on a panel of genes measured by the Edit distance between a query sequence and sequences in the database. Formally, the server and client want to privately compute the distance between one versus sequences. The client’s input represents a sequence while the server holds sequences . Then, they jointly calculate the Edit distance between and each . At the end of the protocol, s with less than distance to will be returned to the client.
This problem is a type of secure two-party computation in which two parties jointly compute a function on their private inputs without disclosing their data to each other except the final output. Recently, secure computation has attracted attention in computational biology and bioinformatics to preserve privacy of biological data [8, 15, 5, 6, 13, 10]. Early approaches were based on pure (additive) Homomorphic Encryption (HE) e.g., . Later work showed that protocols using generic secure computation techniques such as Yao’s garbled circuits and GMW circuits outperform HE. These protocols are based on either a combination of HE and circuit-based approaches [5, 6, 13] or pure circuit-based techniques [7, 14, 17].
On the other hand, recent optimization of Oblivious Transfer (OT) – known as OT extension– is recognised as the most efficient technology in two-party settings [4, 16, 11]. For example, Demmler et al. showed that an OT-based solution for privacy-preserving identification outperforms HE and circuit-based techniques . To the best of our knowledge, we propose a privacy preserving Edit distance using OT extension for the first time.
However, communication is a bottleneck in the existing OT based protocols. Schneider showed that an increase in the bit-length of the transferred data and/ or an increase in the number of the required OT lead to a significant increase in communication bandwidth . Kolensnikov et al. improved OT extension for transferring short messages such as binary messages  to address communication overhead.
As comparison of two values is the main building block for Edit distance algorithm, we reduce the problem of secure Edit distance into secure comparison and propose an Efficient Secure Comparison protocol based on Oblivious Transfer (ESCOT) with binary representation of the data transferring among computing parties. We also provide the security and accuracy analysis of ESCOT. We implemented ESCOT algorithm in Java and the source code will be available to reproduce the results by the time the paper is published.
In this Section, we provide some background information about Edit distance algorithm and Oblivious Transfer (OT).
2.1 Edit Distance
Definition: Edit distance between two strings or sequence of characters is the minimum number of insertions, deletions and substitutions required to transform into .
In classic distance measures like Euclidean distance, the sequences are required to have equal length. Therefore, for measuring the similarity between DNA sequences with different lengths, more complicated distance measures like Edit distance are needed.
In our scenario, the client owns -length sequence of characters and the server has -length sequence of characters where and the characters belong to a finite alphabet set. can be transformed into by applying Edit distance algorithm. The Edit distance is the minimum aggregate cost necessary to perform this transformation.
The basic Edit distance algorithm is Wanger-Fisher  which works in a brute-force manner to compare two sequences and its complexity is or . This quadratic complexity is inefficient in the cryptography domain. Therefore, we need to seek for more sophisticated algorithms with lower complexity that help privacy preserving genome analysis.
Ukkonen’s algorithm  improves Wagner-Fisher algorithm by limiting the number of operations, provided that the Edit distance is less than a given threshold . It runs in time which improves the complexity significantly when the sequences are lengthy.
2.2 Oblivious Transfer (OT)
In Oblivious Transfer (OT), two parties known as sender and receiver participate in protocol. In 1-out-of-2 OT, sender has two private messages (, ) and receiver has a selection bit . At the end of the protocol, the receiver only learns and learns no information about and the sender learns nothing about . In its generalized form i.e., 1-out-of-n OT, the sender has messages while the receiver has a selection value to obtain .
Preliminary OT-based protocols consist of expensive public-key operations while recent improvements of OT, called OT-extension [4, 16], allow executing many OTs using only symmetric operations with a constant and small number () of public-key operations as base-OTs.
To perform base OTs, we can use either homomorphic encryption or Deffie-Hellman key exchange protocol . In our implementation, we use the latter one as follows:
The sender selects a random number and sends to the receiver ( is the group generator). The receiver picks a random number and calculates (if ) or (if ). He then sends to the sender. The sender calculates and such that and act as the secret keys in a symmetric encryption . Then, he encrypts its messages, , , and sends the ciphertexts to the receiver. Then, the receiver calculates and decrypts the desired message by its key as . stands for a secure hash function.
The security of base OT protocols directly depends on the security of the underlying homomorphic encryption or Deffie-Hellman protocols. In particular, Deffie-Hellman key exchange protocol is secure due to the hardness of computing discrete logarithms. Now, the results obtained from the execution of the base-OT are used to perform many OTs efficiently using lightweight symmetric operations.
The OT extension protocol proposed in  is the recent optimization of OT extensions, specially for short messages, that supports 1-out-of-n () OT in addition to 1-out-of-2 OT. We adopt this algorithm in our secure comparison protocol.
3 The Proposed Approach
In this section, we first propose our secure comparison protocol. Then, introduce our algorithm for privacy preserving Edit distance.
3.1 ESCOT Protocol
Unlike classic distance measures like Euclidean distance which are composed of four basic mathematical operations (), Edit distance is based on boolean comparison. It checks whether two specific characters from two separate sequences are equal (Algorithm 3.2, Line 18).
We propose a novel protocol for secure comparison based on OT called ESCOT. For example, and are respectively the client and server’s sequences in an alphabet set of size . If we encode the characters as numbers, the code value vary from to . The goal of secure comparison is to check whether and are equal where .
In ESCOT protocol, client acts as the receiver and server acts as the sender in OT. Sender generates number of OT messages for each character in its sequence as follows:
On the other side, receiver puts the value of as its selection bit. Since the number of OT messages is , execution of 1-out-of-n OT is required. The logic of this protocol is that if and are the same, then will be transferred; otherwise, will be transferred. Since the length of OT messages is one bit, execution of the 1-out-of-n protocol proposed in  is highly fit for our problem as it is the most efficient protocol known today for short-length messages. ESCOT protocol to compare and is described in Algorithm 1.
Correctness Analysis: The message corresponding to the selection bit is transferred (). Intuitively, if then the condition is satisfied and the message is 1. If then the transferred message is 0.
Security Analysis: The security of ESCOT depends on security of the underlying OT protocol. The receiver only receives the message corresponding to its selection bit and gain no information about the other messages. The sender will not learn any thing about the selection bit. In addition, in the Edit distance algorithm we consider in this paper, only sequences with enough similarity (based on the threshold ) are processed to the end and if the Edit distance exceeds the threshold then the execution will stop. This way, we can minimize the information leakage.
Communication Analysis: For each comparison, the communication bandwidth takes bits where is security parameter. Edit distance algorithm requires number of comparisons so the consumption of the bandwidth is bits in our algorithm.
3.2 Private Edit Distance based on ESCOT:
In Algorithm 2, we combine Ukkonen’s Edit distance algorithm  with our ESCOT protocol to address privacy preserving Edit distance.
To calculate Edit distance between two sequences, ESCOT protocol executes times, where is the distance threshold and is the maximum length of the input sequences.
Threat Model: Our threat model is semi-honest that means both sender and receiver follow the protocol specification accurately and do not try to change their messages in order to obtain private information from other party. The only information they learn is the result of the comparison.
Security Parameters: We evaluate our approach with different public key security parameters for bas-OTs. The symmetric security parameter which determine the number of base-OTs is set to 80 or 128.
Dataset: We evaluate our proposed approach using a genome database released by “iDASH Security and Privacy Workshop 2016" . Briefly, the server holds a database of 50 different sequences and client holds one sequence to be evaluated against all the sequences in the database. The length of the sequences in average is 3500 characters from alphabet set. Therefore, the value of in 1-out-of-n OT would be 4.
Setup: We evaluate our approach with respect to execution time and communication bandwidth. The goal of performance evaluation is to show the feasibility of ESCOT protocol in real-world genome matching scenarios. We implement the framework in Java while client and server communicate through sockets. We run the framework over both LAN and WAN networks. For LAN setting, we use the VM machines provided by “iDASH Security and Privacy Workshop 2016" so, we can provide a fair comparison with state of the art work  which run the experiments on the same VM machines. For WAN, we use an intercontinental cloud setting and perform the experiments on two free-tier Amazon instances with a 64-bit Intel Xeon dualcore CPU with 2.8 GHz and 3.75 GB RAM. The client and server are located in Oregon and Tokyo respectively. Evaluation over WAN gives us a better approximation of real-world scenarios. All the experiments are the average of 10 execution rounds.
Results: We set Edit distance threshold to 60, 80 and 100. The goal is to return the sequences with equal or less than the threshold distance to the client sequence. Obviously, by increasing the threshold the complexity increases. Experimental results are shown in Figure 1 with different security parameters (, ). Figures (a) and (b) measures the running time in second on LAN and WAN network respectively while Figure (c) shows the bandwidth consumption in KB. Execution time varies from 8 to 38 seconds on LAN and 45 to 75 seconds on WAN. The execution time on WAN is higher due to network latency. Bandwidth directly depends on symmetric security parameter or number of base-OTs, as it is shown in Figure (c) the public-key security parameter does not affect the communication. The bandwidth varies from 18 to 40 MB.
Analysis: The private Edit distance protocols proposed in  and  are executed in 516 and 23 seconds respectively on the same dataset and over the same VM machines with baseline security parameters ( and ). While, our proposed protocol runs only in 8 seconds with same configuration. The other advantage of our approach over  is that ESCOT protocol calculates accurate Edit distance while the other work approximates the Edit distance.
4.1 Related Work
Shantanu and Boufounos proposed an approach to calculate Edit distance using HE . They reduced the problem to privacy-preserving minimum finding protocol that should be executed times ( and are the length of the input sequences). Huang et al. proposed a protocol to calculate Edit distance based on Garbled circuits .
In this paper, we proposed an efficient solution for privacy preserving Edit distance using OT extension for the first time without losing accuracy. To do this, we proposed ESCOT protocol for boolean comparison based on 1-out-of-n OT inspired by recent advances in secure two-party computation and Oblivious Transfer. We evaluate our approach on a genome dataset released by iDASH 2016 . The experimental results confirm the efficiency of our approach over state of the art efforts for privacy preserving Edit distance.
-  In GENOPRI WORKSHOP. http://www.humangenomeprivacy.org/2016/competition-tasks.html, 2016.
-  C. Aguilar-Melchor and et al. Recent advances in homomorphic encryption. Signal Processing Magazine, 2013.
-  M. M. Al Aziz and et al. Secure approximation of edit distance on genomic data. BMC Medical Genomics, 2017.
-  G. Asharov and et al. More efficient oblivious transfer and extensions for faster secure computation. In CCS. ACM, 2013.
-  M. Barni and et al. Privacy-preserving fingercode authentication. In Multimedia and security. ACM, 2010.
-  M. Blanton and P. Gasti. Secure and efficient protocols for iris and fingerprint identification. In ESORICS. Springer, 2011.
-  J. Bringer and et al. Faster secure computation for biometric identification using filtering. In ICB. IEEE, 2012.
-  G. S. Cetin and et al. Private queries on encrypted genomic data. BMC Medical Genomics, 2017.
-  T. Chou and C. Orlandi. The simplest protocol for oblivious transfer. In Cryptology and Information Security in Latin America. Springer, 2015.
-  H. Chun and et al. Outsourceable two-party privacy-preserving biometric authentication. In ICCS. ACM, 2014.
-  D. Demmler and et al. Aby-a framework for efficient mixed-protocol secure two-party computation. In NDSS, 2015.
Z. Erkin and et al.
Privacy-preserving face recognition.In PETS. Springer, 2009.
-  D. Evans and et al. Efficient privacy-preserving biometric identification. In NDSS, 2011.
-  Y. Huang and et al. Faster secure two-party computation using garbled circuits. In USENIX, 2011.
-  M. Kim and et al. Secure searching of biomarkers through hybrid homomorphic encryption scheme. BMC Medical Genomics, 2017.
-  V. Kolesnikov and R. Kumaresan. Improved ot extension for transferring short secrets. In CRYPTO. Springer, 2013.
-  Y. Luo and et al. An efficient protocol for private iris-code matching by means of garbled circuits. In Image Processing. IEEE, 2012.
-  T. Schneider. Aby - a framework for efficient mixed-protocol secure two-party computation. In Securing Computation Workshop. EC SPRIDE, 2015.
-  E. Ukkonen. Algorithms for approximate string matching. Information and control, 1985.
-  R. A. Wagner. On the complexity of the extended string-to-string correction problem. In Theory of computing. ACM, 1975.
-  X. S. Wang and et al. Efficient genome-wide, privacy-preserving similar patient query based on private edit distance. In CCS. ACM, 2015.