Since the beginning of 2020, the emergence of COVID-19 has caused a worldwide pandemic. Many governments and companies are developing various measures and technologies to prevent the spread of the virus (wang2020response; qin2020dysregulation; salathe2020covid; cho2020contact). At present, contact tracing is expected to be a powerful countermeasure for controlling the spread of infection. The effectiveness of contact tracing has already been shown by several previous studies (ferretti2020quantifying; Reichert2020PrivacyPreservingCT; tang2020privacy; brack2020decentralized). However, conducting effective contact tracing often requires collecting citizens’ personal information, such as locations (privatedata) or telephone numbers (thorneloe2020scoping), which raises ethical issues and serious privacy violations (jensen2009location). Therefore, acceptable private contact tracing (PCT) is urgently needed.
Recently, Bluetooth-based private contact tracing has been intensively studied (troncoso2020decentralized; trieu2020epione; rivest2020pact; becker2019tracking; gvili2020security). Decentralized privacy-preserving proximity tracing (DP3T) (troncoso2020decentralized), which is an open protocol for PCT using Bluetooth low energy beacons, is already being used in applications developed in the world. To strongly protect users’ privacy, it uses only contact (proximity) history detected by the Bluetooth low energy beacons. In DP3T, the base mechanism is that the applications use the Bluetooth signal of a smartphone to broadcast a random ID that does not include sensitive information such as the user’s identity or location, and the nearby smartphone devices receive and store the data for a limited time. Users who are then discovered to be infected with the coronavirus send a report to the server that includes the random IDs they have generated. Meanwhile, the user routinely checks to see if the random IDs received from devices they have contacted in the past have been uploaded to the server. Additionally, there are similar methods for adopting decentralized architecture, such as Epione (trieu2020epione), PACT protocol (rivest2020pact), CEN (becker2019tracking; cengithub) and Google and Apple specifications (gvili2020security).
However, Bluetooth-based private contact tracing has several limitations in terms of functionality and flexibility. First, Bluetooth-based PCT only detects direct contact (i.e., human-human contact) but cannot detect indirect contact (i.e., human-object, e.g., used the same elevator shortly after a COVID-19 patient used it). The Centers for Disease Control and Prevention (CDC) in the US has shown that it is possible that a person can obtain COVID-19 by touching a surface or object that has the virus on it and then touching their own mouth, nose, or eyes (cdcreport) — even though they do not have direct contact with COVID-19 patients. Second, the Bluetooth-based PCT lacks flexibility in terms of determining the rule of “risky contact”. Essentially, the rule of risky contact in the Bluetooth-based PCT is hard-wired into the Bluetooth device since the risky contact is implicitly defined as two devices in close proximity to each other’s signal range. In practice, whether or not it is a risky contact varies with the environmental situation and the nature of the virus. The rules of risky contact in COVID-19 have been updated along with the understanding of the virus (kgwreport). For example, in the beginning of the pandemic, professionals believed that transmission only took place through direct human-human contact; however, recently, it was argued that airborne transmission should be taken into account (tjtreport). Moreover, recent reviews have pointed out the current PCT application limitations (nanni2020give; 10.1145/3431843.3431845), which are the inability to detect infections that do not involve direct contact and radio signal limitations for contact detection (ferretti2020quantifying; abbas2020covid).
In this work, we propose secure and efficient trajectory-based private contact tracing to enable both direct and indirect contact tracing. By comparing the trajectory data between a user and the infected patients, we can check whether or not the user visits the “infected locations” within the certain time period. The rule of risky contact can be flexibly defined according to the condition of a location and the nature of the virus. The following four requirements for trajectory-based private contact tracing are listed as follows.
Efficiency: The query throughput that can be handled by the central server is crucial.
Security: A client’s trajectory data must be protected from the server and any other clients. However, nothing about the server side data is disclosed to the client except the query result.
Flexibility: The rule of risky contact should be flexibly changeable.
Accuracy: The server must return the true result because the result is so sensitive and can significantly affect the users.
As shown in Figure 1, we assume that the health agency (e.g., the government or official healthcare institute) registers trajectory data of the confirmed COVID-19 patients (these data are encrypted or released under the consent of the patients) to a server that is untrusted by clients (i.e., queriers). The server receives queries and encrypted personal trajectories from clients and returns a Boolean value of whether there is risky contact or not by computing an intersection between server and client trajectories in private manner.
Although the problem of trajectory-based PCT is similar to the well-studied problem of private set intersection (PSI), the existing approaches for PSI cannot satisfy all of the above-mentioned four requirements. PSI ensures that two (or more) parties collaboratively calculate the intersection of their private sets, while nothing about the private data will be disclosed to the other party except the existing information of the intersection or the result. However, existing techniques for PSI, mostly based on cryptographic primitives, cannot achieve all of the abovementioned requirements. The state-of-the-art cryptography-based PSI approach, such as oblivious transfer (rindal2017malicious) or homomorphic encryption (chen2017fast) has limitations in terms of efficiency, and there are still performance problems (narayanan2011location; Reichert2020PrivacyPreservingCT) in medium or large workloads. This is mainly due to the heavy use of time-consuming cryptographic primitives. Recently, secure hardware (such as Intel SGX or ARM TrustZone) -based approaches have received increasing attention. It enables to make Trusted Execution Environment (TEE) (sabt2015trusted; costan2016intel), which is used for speeding up secure computations on untrusted party. Tamrakar et al. (tamrakar2017circle) proposed the first efficient TEE-based PSI. It is efficient, however, it does not satisfy our requirement of accuracy since it introduces a nonzero false positive rate because of using probabilistic data structures, and flexibility has not been considered. Thus, we summary the comparison with our work in Table 1, adding (Reichert2020PrivacyPreservingCT) which is MPC-based private contact tracing system using trajectories. Functionality means the capability to detect indirect contact.
Our contributions in this paper are three-folded. First, we formulate the problem of trajectory-based PCT. We show that our problem is a generalization of well-studied private proximity testing (narayanan2011location) and private set intersection (PSI). Our formulation is parametrized for both time and space and can be used in general settings. We name this formulation Spatiotemporal Private Set Intersection. Second, we propose PCT-TEE, a TEE-based system architecture and efficient algorithms for trajectory-based PCT. In addition to satisfying the abovementioned requirements, a challenge in designing the TEE-based algorithm is the constraint of the secure memory (i.e., enclave) on secure hardware. We solve these problems by designing a novel trajectory data encoding method, TrajectoryaHsh, and combining it with finite state automaton. We show this is generalization of our previous encoding (kato2020secure). TrajectoryHash
and finite state automaton enable algorithmic flexibility, more efficient compression, and deterministic and fast search performance for high-speed PSI on TEE. Third, we implement the proposed system on Intel SGX and open source the prototype code in GitHub111https://github.com/ylab-public/PCT. Our experiments on real-world datasets show that the proposed system is efficient and effective in practical scenarios. Specifically, the proposed encoding and data structure compresses the actual trajectory data to one-sixth the size of the hash table with the same performance, and as a result, the total runtime is substantially reduced. Moreover, we show that our system implemented on a single machine equipped with SGX can handle thousands of queries on tens million records of trajectory data in a few seconds.
Outline. In Section 2, we show some features of Intel SGX that are related to our architecture. In Section 3, we show a comparison compare TEE-based PSI and conventional cryptography-based PSI performance. In Section 4, we describe the problem statement and formulate the PCT problem. In Section 5, we explain overview of our architecture, and in Section 6, algorithm and trajectory-based data compression. In Section 7, we show how our system achieve the requirements. In Section 8, we show the experimental results and evaluation. In Section 9, we show related works including related recent PCT applications and our position. Finally, we provide the conclusions in Section 10.
2. Trusted Execution Environment
Before explaining our system, we introduce the secure hardware used in this paper for ease of understanding our system. Below we focus on Intel SGX, which is representative implementation of TEE. The proposed architecture and algorithms in this paper can be applied to any other types of secure hardware. Intel SGX(costan2016intel) is the extended instruction set of Intel x86 processors, which enables the creation of an isolated trusted execution environment, called the enclave. In addition to powerful server machines, SGX is installed on some PCs. SGX is also available on some public cloud platforms, such as Azure Confidential Computing, Alibaba Cloud, and IBM Cloud. We show a brief overview of SGX in the following paragraphs.
Enclave resides in the protected memory region, called the Enclave Page Cache (EPC), in which all programs and data can be unencrypted and fast processed while they transparently encrypted outside the CPU package by a memory encryption engine using a secret key that only processor hardware can access. In other words, SGX adopts a model that considers the CPU package as a trust boundary and everything outside as untrusted. In this trusted space, accesses from any untrusted software, including OS/Hypervisor, are prohibited by the CPU, protecting the confidentiality and integrity of the program and data inside the enclave. Therefore, programs using SGX must use two types of instructions called OCALL/ECALL to invoke functions across trust boundaries under strict control. These instructions often require too much clock cycles (tian2018switchless), and so does uploading data to enclave. This observation is important to improve our system performance.
Memory size limitation. A challenge in designing algorithms for Intel SGX is the size constraint of EPC. The maximum size of EPC is limited to 128 MB, including 32 MB meta-data for secure management (or 256 MB including 64 MB meta-data in the recent Intel hi-end processor(hiendsgx)). This limitation may actually be gradually improved, but it will continue to be a problem for hardware and memory securing performance. Assume that memory is allocated beyond this memory size constraint. In this case, SGX with Linux allows paging with special encryption. However, many studies have shown that the performance is greatly degraded by severe overhead (gueron2016memory; gjerdrum2017performance; taassori2018vault), which is derived from a requirement to preserve confidentiality and integrity even outside the enclave. Therefore, it is necessary to design an efficient algorithm that works within SGX. It is still important problem, and Kockan et al. (kockan2020sketching) shows a method to overcome the severe memory limitation of the TEEs for genomic data analysis.
Attestation. SGX supports two types of attestations, local and remote, which can verify the correct initial state and genuineness of the trusted environment of the enclave from outside. In our paper, we focus on remote attestation (RA)(costan2016intel). We can request RA to the enclave and receive a report with measurements (e.g., MRENCLAVE and MRSIGNER) based on the hash of the initial enclave state and other environment as a hash-chain, which can identify the programs, complete memory layout, and even builder’s key information, and this measurement cannot be tampered with. Intel Enhanced Privacy ID signs this measurement, and Intel Attestation Service can verify the correctness of the signature as a role of a trusted third party. In addition to verifying the SGX environment, secure key exchange between the enclave and remote client is performed within this RA protocol. Therefore, after that protocol, we can communicate over a secure channel with a remote enclave by fast encryption scheme such as AES-GCM, and finally, we can safely perform a confidential calculation in the remote enclave. Our system utilizes this primitive for private computation.
3. Private Set Intersection
Private Set Intersection (PSI) is well-studied and important problem. PSI refers to a setting where multiple parties each hold a set of private sets, and wish to learn the intersection of their sets without revealing any information except for the intersection itself. Existing main approach is to use cryptographic primitives, the summary is following. We can summary conventional approaches in two aspects, methodology and security model.
In the first aspect, firstly there is a method that is based on the commutative properties of the Diffie–Hellman (DH) key exchange (de2010linear)
. It requires to compute polynomial interpolation which needs computation cost so much. Huang et al.(huang2012private) describes garbled circuit-based approach. Their proposed SCS circuit family improved the efficiency at that time. This approach is similar to secure hardware-based approach described later in terms of leveraging general-purpose secure computation. Oblivious transfer (OT) (otbased2014benny; rindal2017malicious; pinkas2018scalable) is one of the most promising approach. While they are generally for semi-honest adversaries, (rindal2017malicious) extends the method to malicious adversary using dual execution technique (mohassel2006efficiency). Homomorphic encryption (HE) (chen2017fast; chen2018labeled) is suitable for unbalanced setting because it can replace oblivious pseudorandom function in OT-based approach with leveled fully HE and substantially reduce the amount of data to be transmitted. Lastly, there is the method extended from private information retrieval (demmler2018pir). Many improvements are proposed in this way, however, there is still no method to achieve practical efficiency on large scale data in execution time and communication bandwidth.
In the second aspect, there are semi-honest (goldreich2009foundations) or malicious adversary. Roughly speaking, a semi-honest adversary is an attacker who tries to infer secret information from the information he obtains, although he follows correct protocols and does not craft send and received data, while a malicious adversary is an attacker who crafts the send and received data and execute the protocol as much times as possible to extract secret information. Generally, malicious client setting requires more secure standard and more heavy costs. Which model we should secure depends on the applications and situations, but in our scenario, we should consider malicious adversary because generally untrusted server can be malicious.
We consider secure hardware-based approach can be better option. From point of view of methodology, we do not have to use such cryptographic primitives. Using Intel SGX, platform verification and transparent memory encryption by hardware work so fast and totally it can achieve highly efficient PSI. Additionally, TEE provides refined security model for malicious adversary. (subramanyan2017formal) shows any operation and inputs expose no information about the inside state or data of the trusted execution environment. All we have to care about is privacy leakage outside TEE and software implementation bugs. So, secure hardware-based method provides sufficient security model. (tamrakar2017circle) shows the design for private membership test using TEE. Although not directly related to the speed of PSI, their proposed carousel approach is useful to improve query throughput.
We describe a more close comparison existing cryptography-based PSI and secure hardware-based PSI in efficiency. We recognize one of the state-of-the-art approaches are (rindal2017malicious) in balanced setting, and (chen2017fast) in unbalanced. The second one is for semi-honest, it is faster than any other one for malicious-secure one. Table 2 shows a comparison of properties between them and secure hardware (Intel SGX). It includes relatively rough asymptotic bandwidth and computation cost at online-phase. We denote , as clients data size and server data size respectively and . Asymptotic comparisons may not be unreliable in such an area because of the large impact of different coefficients. But we can see, with secure hardware, both communication and computing costs are dramatically efficient, it is proportional only to a client data size (Table 2). On the other hand, secure hardware-based approach requires remote attestation in advance and hardware with special functionality. Cryptography-based methods do not need any special devices, they need only algorithm. As shown in Table 2, by using a variant of Cuckoo hashing, OT-based one (rindal2017malicious) reduces communication cost to from naive cost. And in (chen2017fast), communication cost is efficiently reduced into and the server computation cost is homomorphic evaluations on large circuits of size . These facts show that secure hardware-based method needs extra, cost such as special hardware, but the better option to be used in large scale because of their significant efficiency. We describe an experiment of the comparison of both methods when executing PSI in Section 8.1.
4. Problem Formulation
We first introduce the scenario of trajectory-based private contact tracing, and then we formulate our problem based on well-studied private proximity testing.
In our scenario, we assume that trajectory-based PCT is used to prevent the spread of COVID-19. We consider a centralized architecture that stores the trajectory data of infected patients on a central server and accepts PCT requests from users with their trajectory data. In practice, infected patients’ trajectories can be received in bulk from public institutions such as a government or health agency.
In the operation of the system, based on the incubation period of the virus, the server always keeps the trajectory data of the infected patients for the past 14 days (up to 21 days). All the data periodically updates in batches (e.g., once per day), adding and deleting data. The server transforms the trajectory data to an appropriate structure in advance and is always ready to accept PCT requests from clients. The client sends encrypted trajectory data for the past 14 days as a PCT request and the server performs contact detection and then returns the results to the client. The results can be time-stamped and signed in SGX as needed, so that they can be verified by a third party, allowing clients to use them in various agencies and events to show that the risk of infection is low.
Finally, the threat model is a malicious server with privileges for software/hardware and honest clients. The client can only trust the CPU package equipped with SGX used on the server.
4.2. Problem Statement
Trajectory-based PCT. The trajectory-based PCT protocol is an asymmetric protocol between a client and a server. When a client wants to know the contact with trajectories stored on a server, this protocol returns 1 or 0 to the client, depending on the result, and does not disclose the private information of the client to the server. In the use case for infections, each client has a set of trajectory data for one person, and the server has trajectory data for many infected patients.
In conventional private proximity testing (narayanan2011location), when two people, user and , have geographic data of time , then executes the protocol, and can obtain the results as follows;
where is the proximity threshold. After that, does not learn any information about and does not learn information except .
In the simplest form, trajectory-based PCT can be represented as an extension of such a formulation. For contact tracings, a contact can be determined according to human time-series tracking data. We can perform private proximity testing by extending single geographic data to time-series trajectory data. A threshold can also be extended to two-dimensional thresholds to check spatial-temporal proximity. PCT allows for the capture of indirect contact by confirming that the patients are in the same place within a specific period. Therefore, we obtain the following formula denoting the trajectory data of user as , and can obtain the result of contact with ,
where is spatial and is the temporal proximity threshold. Furthermore, does not learn any information about , and can obtain only 1 or 0 about in this protocol. And we define it as Trajectory-basd PCT.
Definition 4.0 (Trajectory-based PCT).
For and , if the protocol follows Eq. (1), the protocol is Trajectory-based PCT.
We simplify this problem by mapping any continuous data to a discrete space for computational efficiency. We denote as all symbol set in discrete space, and as its -th element. By mapping , any points in the trajectory data is mapped to a single symbol . We call this mapping “encoding” and we introduce the corresponding method in Section 6. Encoding must be adjusted according to parameter , which corresponds to the size of pre-defined subspace in the 3D spatiotemporal space and each subspace corresponds to one unique symbol, as shown in Figure 2. For example, suppose , , , , , trajectory point is mapped to and ,,, are mapped to in the Figure.
We determine contact by considering the intersection of these symbol set between and . This can be formulate as follows,
does not learn any information about , and can obtain only 1 or 0 about in this protocol. And we name this Spatiotemporal Private Set Intersection.
Definition 4.0 (Spatiotemporal Private Set Intersection).
For and , if the protocol follows Eq. (2), the protocol is Spatiotemporal Private Set Intersection.
Theorem 4.3 ().
In this work, we consider Spatiotemporal Private Set Intersection as contact between and . Note that, even if Theorem. 4.3 holds, the reverse is not always true. It is therefore an approximation, but we admit this for the sake of computational efficiency.
Efficiency. Trajectory-based PCT requires efficiency in several aspects. The first is response throughput since the server will always be exposed to requests from a large number of users. It can be a substantial workload in such a centralized protocol. The second aspect is the bandwidth. Since the protocol is applied to many users, it is necessary to reduce the bandwidth for communication efficiency. The third aspect is scalability. For instance, for COVID-19, the size of the infected patients’ data and the user’s size may increase in the event of the infection spreading. The efficiency requirements depend entirely on the context in which PCT is deployed and are determined by the number of users, frequency of use, number of data, etc.
Security. Considering the threat model in trajectory-based PCT, the most prominent and necessary model that we should defend against is the malicious server. Informally, a malicious server tries to obtain information illegally without the constraint of following the protocol. In a typical model, the server runs in the untrusted software and hardware environment, and the privileged attacker has full control over the OS and/or Hypervisor, memory hardware units, and packet monitoring in the network. Under this model, it is essentially necessary to have cryptographic indistinguishability in the PCT processing on the server and delivering on the network to protect user privacy because the adversary can monitor raw data. Other party, such as other clients, can be an another threat, however, in our scenario they cannot interfere other client protocols. The most harmful thing they can do is to monitor the network, except for hacking the server system, and it is enough to consider the security of untrusted servers, including the communication part.
More formal security definition for malicious attacker on the remote attestation protocol using SGX follows (subramanyan2017formal; bahmani2017secure).
Flexibility and Accuracy. Flexibility, expressed slightly more formally, is the requirement that are parameters in the system. For example, these parameters need to be changed to minimal values if it is found after the system is released that only direct contact needs to be captured because of the virus’s capacity for transmission. Accuracy does not allow the PCT to return any probabilistic answer, which is highly dependent on the domain where PCT system is deployed. Since our system does not return statistics, we believe it is better to avoid probabilistic data structures whenever possible, satisfying both deterministic and efficient is the best.
5. System Overview
We introduce an overview of the system. Table 3 shows the symbols and parameters that are used in the rest of the paper.
|number of raw trajectory data of infected people|
|raw trajectory data of infected people|
|number of clients|
|a client and all clients set|
|number of chunks of central data|
|mapped , array of efficient chunks|
|-th chunked data of , efficient representation (e.g., FSA)|
|client ’s query data (raw trajectory data)|
|merged and mapped query data (e.g., unique array)|
|unique size of|
|parameter of PCT,|
|-th row of trajectory data, time and location|
Figure 3 shows the overview of our architecture using a trusted enclave. Our method consists of several steps, including the transformation of data maintained on the server side and the transformation of data sent from the clients as follows.
First, we describe the data of infected patients on the server.
: (Update master data) The government updates the infected patient data in batch processing. is in the raw form of trajectory data , which does not have to include the user IDs since there is no need to distinguish trajectory data by infected users.
: (Mapping) step 2 is executed in the same batch processing as step 1. We map from the raw data format to efficient dictionary representation with function . This mapping function includes encoding, chunking, and transforming to dictionary representation. Encoding is to encode each trajectory data into 1D string representation. It corresponds to in Def. 4.2. Chunking is to split the data set into chunks. Transforming is to transform each chunks into dictionary representations that consists of chunks , where each chunk fits in the enclave memory limitation. How to represent the chunked data specialized in PSI under SGX memory constraint is our challenge. These encoding and compression scheme are described in Section 6.
The next part is the processing of queries from clients.
: (Remote attestation) The client verifies the remote enclave through the remote attestation protocol before sending the request to the server. The client can confirm that the enclave has not been tampered with and then securely exchange keys with the enclave. Thereafter, the key is used to encrypt the data, which enables secret communication to the remote enclave through a secure channel.
: (Request) Many clients send PCT requests to the server. In the figure, sends as a parameter of the query that which contains her trajectory data. Trajectory data is encoded by before encryption, so server and client share the parameter . is encrypted in all the untrusted areas after leaving client environment, and is only visible in the verified enclave.
: (Queuing) Until a certain number () of requests are accumulated, is queued outside the enclave and they are passed to the enclave together by loadToEnclave function. This function is actually implemented by the so-called ECALL in to invoke a SGX function. This technique is used in (tamrakar2017circle) as well, and we aim to optimize the query processing for multiple (e.g. 1000) users by batch processing. This optimization can also mitigate the ECALL overheads.
: (Mapping) After uploaded to the trusted enclave, the data is finally decrypted. Inside the enclave, the number of are grouped together and mapped to query representation using mapToUniqueArray. The function also takes query data and a granularity parameter as step 2. We intend to convert the data structure suitable for PSI; basically, is represented as a unique array. If trajectory representation is large, this part can be bottleneck. Moreover, these query data is private and cannot be handled outside enclave. Therefore, encoding trajectory data to small bytes (step 4) is critical.
: (Contact detection): The chunked data are imported into the enclave one by one, and we compute the set intersection of and in the enclave. This can be done by checking the string-based match with the transformation in step 2. The results are stored together.
: (Response construction) After the iterations for all the chunks are completed, responses for all clients are constructed from the results and complete query data inside the trusted enclave by constructResponses. This can be done by simply checking whether each query has data of the results. Finally, it returns the encrypted result through the secure channel to each client.
6. Trajectory Data Representation
In this section we focus on trajectory data representation which is well-matched for PSI processing in the memory constraint of the secure hardware. The most important issue is how to represent each trajectory data. It is encoding and corresponds to of Def. 4.2. We need to encode different trajectory data into an identified 1d data to perform PSI. And we also have to consider make encoded data small and compressed as described section 5. The small representation contributes whole system performance, and the compressibility contributes performance of PSI part which is core component of our system. In particular, we have to carefully consider the dictionary representations () obtained by the mapping in step 2. should have the following requirements. First, a memory-efficient data structure storing trajectory data to overcome severe memory constraints should be used. Second, fast search performance should be implemented for fast PSI. Third, a deterministic search method for accurate PSI should be provided.
However, standard dictionary representations do not match our requirements. A well-known data structure for dictionary representation is the hash table, and we consider the hash table as a baseline. The hash table ideally supports the key-based search. While the hash table can provide desirable search performance and deterministic search, it fails to satisfy the first requirement because its size increases linearly with the size of the data. A smaller data structure is preferable in our setting because the overheads caused by SGX constraint is so heavy. While probabilistic data structures such as the Bloom filter provide the same speed of search performance as the hash table and superior memory efficiency, they do not satisfy the third requirement because they cannot provide deterministic search.
Our proposed method to achieve the desired dictionary representation is a combination of encoding and finite state automaton (FSA). Roughly speaking, the encoding process transforms trajectory data into highly similar string representations and then utilizes the similarity to create a compressed dictionary representation using FSA. FSA is a deterministic finite state acceptor and can be cyclic and permits the sharing of both prefixes and suffixes among the same nodes (described in Section 6.2).
We introduce encoding which corresponds in Spatiotemporal Private Set Intersection. The encoding should satisfy following 3 properties. The first property that our encoding must satisfy is that there is an injective function between discretized and different trajectory data and unique strings. Obviously, if this property is not satisfied, PSI cannot be performed correctly. Second property is small size. The space after mapping should be as small as possible because if it is small, all data, including both the server data and the queries from the client, will be small. That’s the ideal situation for TEE-based secure computation. Our previous work (kato2020secure) lacks this aspect. The other desired property is that the string has many similarities because of the FSA. We introduce TrajectoryHash encoding and show it satisfies both properties.
The trajectory data
consist of an array of tuples of temporal data and geographical data, such as UNIX epoch and tuples of latitude and longitude, as follows.
and are determined as and considering conditions such as a lifespan of the virus. Algorithm 1 shows the pseudo code of TrajectoryHash. This encoding is based on two encoding QuadKeyEncode, PeriodicEncode and binary level mixing function, Mixing. ST-Hash (guan2017st) is similar to our encoding. The part to be mixed at the binary level is the same, but each 2 encoding methods and the motivation are different. We use QuadKeyEncode and PeriodicEncode to preserve trajectory data similarity and hierarchical structure to compress the trajectories.
QuadKeyEncode is based on quadkey introduced by Bing Map (quadkey), which is a method of encoding into bits in the tile coordinate space, recursively dividing into two parts according to a given level, as shown in Figure 4. Note that, in our method QuadKeyEncode outputs separated binaries. As we can see in the figure, while we get ”212”(=) using quadkey encoding, QuadKeyEncode outputs and . Detailed algorithm is described in Algorithm 1. The parameter and the approximate distance included in the square in tile coordinates are shown in the table 5. For instance, given , , we get the output 1110000000111010 and 0110100100111110 as binaries. Using this encoding, we get unique binaries for each distinguishable area by . Moreover, we can keep hierarchical structure and similarity of trajectory locations into binary representation.
PeriodicEncode is optimized to discretize the time data over a specific given period and at specific given time intervals. This encoding outputs bits with minimum length that can express a distinct time interval according to given in the period to . Given 2 weeks as the period, the relation between parameter and the approximate time interval are shown in the table 5. Final output length is determined by both and (, ). For example, given
then, detailed processing is as follows.
Finally, we can get 0011100100100 as a binary. In this way, we get minimum representation to express trajectory time information and preserve time representation similarity of the trajectories in the period while adjusting intervals to given granularity parameter
Now, we have three binaries, like this longitude: 1110000000111010, latitude: 0110100100111110, Periodic: 0011100100100. Note that if lengths of three binaries are different, we have to call ZeroPadding to make the same length because Base8Encode
is encode to string by three bits. However, zero padding does not become an issue. No matter how many such zero values, they are likely one shared node in the FSA. So, periodic encoding is changed into 0000011100100100 (from 0011100100100) byZeroPadding.
Next, we mix them into one binaries by Mixing (line 9). We consider there can be some variants, mixing or simply merge without mixing. Plausible option is mixing one by one from each binaries (mixTH) as shown in Figure 7. In this mixing, the 3D trajectory data is encoded like Figure 8, where the 3D similarity of trajectory data is naturally preserved into the binaries in a balanced manner for time and location. In addition, there may be cases where a different mixture is more desirable. It depends on datasets. In some cases, it may be more compressible to merge periodic binaries behind without mixing (seqTH), as showing Figure 7. For example, when people do not move so much such as sleeping, a higher similarity than mixTH is expected to be obtained. Furthermore, because there is no need for mixing, it eliminates the padding and minimize the number of bytes of the periodic binaries.
In particular, we can regard our previous work (kato2020secure) as a specific mixing where first we mix latitude and longitude binaries and then we merge it and periodic strings without mixing as shown Figure 12. We use geohash encoding (geohash) and more naive periodical encoding than this version Figure 12 (b). However, the biggest difference is that we are assuming a binary-level merge in TrajectoryHash, so we can minimize the size of the output. This is a big advantage in our system.
Lastly, we encode the binaries to string by Base8Encode for ease of transport. Base8Encode simply generates a string by assigning one of 0-7 characters to every 3 bits (Figure 7). Although using Base64 encoding looks more space efficient, when actually used in PCT, the data reverts to binaries, and either way basically works. Here, we use Base8 for more flexibility with respect to the length of the bytes.
6.2. Data structure
We explain that FSA satisfies our three requirements mentioned before. We can use FSA as a deterministic acyclic finite state acceptor and it can store and compress string data, sharing string prefixes and suffixes from the tree structure’s roots It also provides a fast string search as a dictionary in proportion to the maximum depth. Moreover, the search cost can be if the maximum length is small, which is asymptotically equivalent to the hash table and may be advantageous because it does not need computing hash functions. Thus, it basically meets our requirements. (10.5555/2032366.2032380) shows its effectiveness by extensive experiments. Therefore, we increase the compression efficiency by introducing FSA, and this data structure satisfies our requirements. Trie (fredkin1960trie) also has similar functions and it is conventionally used in trajectory data storage (lee2011crowd), however, FSA is better in this case because of many shared node.
Figure 9 shows an example how well our encoding works with FSA. Assume that the TrajectoryHash encode trajectories like ’5660211300766360’, ’5660211300762760’, and ’5660211301043560’…, which are generally similar sequence because of continuous trajectories. Actually, considering trajectory data we collect from real world, time is continuously changing in small increments, and location information is likely to be close or almost immobile. Therefore, encoded data is expected to have much similarity like example data. Thus, FSA has much less data size than if we normally hold the data using the hash table, etc.
We have to consider the way to make chunked FSA. At step 2 of Figure 3, we transform raw data into chunked FSA. Generally, chunking FSA is not a straightforward task, because compression results depends on how to make small FSA or how to divide large FSA, which is different from the hash table. Using the hash table, a performance does not depends on such a way because the data size is determined by the number of stored data. However, we can solve this problem by simple operation. Before constructing FSA, we sort raw data in numerical order and we iteratively take trajectory data from top to bottom and transform them into single FSA. By this operation, we can stably construct compressive FSA representations because a chunk of data have much similarity.
In our system we introduce, is raw format, not encrypted, which is based on the consent of the infected people. We think this setting has been acceptable in the society. However, we can consider another model where we do not need the consent of the infected people,even though they are from different parties that do not trust each other. Basically, this can be achieved without changing the structure too much.
The people or health agency who register infected people’s data perform remote attestation for untrusted server and send encrypted data to enclave. Using SGX Sealing capability (costan2016intel; sealing), we securely encrypt and store enclave secrets for persistent storage to disk, using a private Seal Key that is unique to the particular platform and enclave. While extra overhead is needed, we can keep the data confidentiality and integrity. Even though we use this capability, the main bottleneck is that we need to prepare the chunks in enclave because we have to capture what data each trajectory data is to construct FSA. If the size of infected trajectories becomes large, this preparation can cause too excessive workloads in enclave. Practically, it is difficult unless we propose a way to avoid the bottleneck here. However, once chunks are prepared, sealed, and stored in untrused storage, we can achieve the same performance through the same processing as raw data.
7. System Analysis
7.1. Algorithm Analysis
Here, we discuss asymptotic computational costs of PSI and precautions. We show our algorithm of trajectory-based PCT related to the PSI part in Algorithm 2. Some of functions are described in 5. Dictionary must implement the contains method that returns a Boolean value whether it includes the target or not. In the case of the hash table, it is the computation of the hash function, and in the case of FSA, the acceptance routine with finite state automaton, both of them are asymptotically constant. The computational costs of trajectory-based PCT are as follows. Assume that the cost of a single key search for a dictionary is and the unique size of is , the calculation cost is
Seemingly, and the number of chunks is constant and PSI is completely scalable for an infected trajectory size. However, note that the size of depends on the memory constraints of SGX. When processing thousands of queries together, exact information needs to be kept within the enclave to correctly reconstruct the response, which can be several tens of MB in size; eventually, the size available for chunk is not large. This means that there is actually a practical lower bound on . Lastly, our routine includes decrypt and encrypt. These encryptions are implemented by fast and simple methods, such as 128bit AES-GCM, so that the execution time is not dominant.
Finally, we show how our system meets the requirements we mentioned first.
Efficiency. There are four points, Intel SGX, chunking, data representation, and query multiplexing. First, as described in Section 3, SGX bring us efficient PSI. SGX allows software to perform secret computations transparently and eliminates the need for complicated and time-consuming cryptographic techniques to perform PSI. This fact is the basis of the efficiency of our system. The computational overhead is small and the overall speed is dramatically improved. Second, chunking, into , avoids serious paging overhead caused by severe memory constraints of SGX even when the infected patients’ data become too large to fit into the enclave. Third, the memory-efficient dictionary representation (Section 6) reduces the number of chunks, resulting in reducing PSI execution and overheads for upload to enclave. This is core point of our system. Fourth, steps 5 and 6 (Section 5) show query multiplexing and improve the throughput of the query processing. Reading the chunked data , as in Step 7, is costly due to the iteration, and doing this for every query have large overheads.
Security. Our protocol follows remote attestation and secure computation provided by Intel SGX. Previous researches (bahmani2017secure; subramanyan2017formal) shows the protocol security. Informally, any state cannot be observed from outside TEE, and even if any inputs, any tampering with the state that can be performed by the malicious server will not divulge any information about the client trajectories. Hence, it is guaranteed that all information attacker can observe is only outside TEE. However, in our system, all information observed outside TEE must be encrypted. Therefore, cryptographically strong security for the client’s privacy from any external attacker is ensured when using proper encryption and without software vulnerability. More formal definitions require elaborate modeling of the attacker and private information, but our setting is common and we defer to the earlier work (bahmani2017secure; subramanyan2017formal). Note that some of side channel attacks are out of scope in their work and in our work. To protect such attacks, we have to consider data-Obliviousness (mishra2018oblix; ahmad2018obliviate) to change into uniform side channels.
Flexibility. We achieve flexibility by parametrizing the encoding of the data using a granularity parameter . The parameter is shared between the server and clients. Once data is encoded by the parameter, all we have to do is normal PSI. In other words, we parametrize not PSI routine but target data by parametrized encoding. In this way, we have to update all the data when we want to change the rules of risky contact. However, because we have to update all the data once a day anyway, it is not a big deal. Thus, we can achieve this requirements keeping the processing so light and efficient.
Accuracy. This is definitely achieved because our query answer does not allow any probabilistic one.
8. Experiment and Evaluation
We conducted experiments using real trajectory data to demonstrate that the proposed architecture for PCT can achieve high query throughput and expected properties.
Experimental setup. We use an HP Z2 SFF G4 Workstation, with 4-core 3.80 GHz Intel Xeon E-2174G CPU (8 threads, with 8 MB cache), 64 GB RAM, and a 1 TB disk, which supports the SGX instruction set and has 128 MB PRM (Processor Reserved Memory) in which 96 MB EPC is available for user use. The host OS is Ubuntu 16.04 LTS, with Linux kernel 4.4.0-178. We use version 1.1.2 of the Rust SGX SDK111https://github.com/apache/incubator-teaclave-sgx-sdk (wang2019towards) which supports Intel SGX SDK v2.9.1, and Rust nightly-2020-04-07. Our experimental implementation is available in Github222https://github.com/ylab-public/PCT.
8.1. Preliminary Experiments
Before experiment, as described in Section 3, we consider secure hardware-based PSI is much better than cryptography-based PSI in efficiency. To show that, we compare both PSI executions in the similar setting to our scenario. For fairness, we compare single end-to-end PSI query without multiplexing optimization described in Section 5. Our secure hardware-based approach implementation is based on Intel SGX and simply uses hash table and perform PSI inside enclave, and OT-based approach implementation 333https://github.com/osu-crypto/libPSI follows (rindal2017malicious). Table 6 describe the execution time comparison between OT-based (rindal2017malicious) and secure hardware (Intel SGX) -based PSI in balanced setting where we assume only RA protocol is performed in advance and online phase includes client-side encryption and decryption time. We change the set size to , and each data has 128bit. As shown in Table 6, Intel SGX can easily overcome the state-of-the-art method in balanced setting. In particular, at the difference of execution time becomes significant because of the overhead of oblivious transfer while SGX has scalability in this range of the sizes. Additionally, secure hardware substantially improves bandwidth. The communication cost of SGX is almost the same as original size because the data we have to send is just encrypted data by symmetric-key like AES-128. Assuming many clients, this is essential. Although the efficiency is better in the two aspects, what we should also pay attention here is the last line in the Table. Despite using Intel SGX, the execution time is so slow. This is because of the memory constraint of SGX. When (80 MB), trusted enclave has to handle approximately 160 MB data which overflows beyond the memory limitation (=96 MB). As a result, it causes a serious overhead.
|(16 KB)||38 / 0.016||35 / 2|
|(0.16 MB)||52 / 0.16||207 / 22|
|(1.6 MB)||153 / 1.6||2389 / 235|
|(16 MB)||1552 / 16||27110 / 2482|
|(80 MB)||121526 / 80||154826 / 12502|
|(1 MB)||5535||77 / 0.089||(600 / 2.6)|
|(1 MB)||11041||73 / 0.17||(1300 / 4.1)|
|(17 MB)||5535||72 / 0.089||(2200 / 5.6)|
|(17 MB)||11041||85 / 0.17||(4000 / 12.0)|
|(268 MB)||5535||249 / 0.089||(10600 / 11.0)|
|(268 MB)||11041||424 / 0.17||(16200 / 21.1)|
Table 7 shows the results when using Intel SGX in unbalanced setting. We also show the results of (chen2017fast) as a reference value from their paper’s result, which is total execution time (sender online and receiver enc. and dec.) of the best parameters and maximum multi-threading (). These number is the best of their implementation, but the table shows Intel SGX-based PSI is significantly fast and efficient in unbalanced setting. Even though security model is more strict than (chen2017fast) (semi-honest). Secure hardware-based PSI is basically not affected by server-side data size as shown in Table 2. However, when it is beyond the SGX memory constraint, the execution time becomes slow due to paging overheads as shown at (268MB). In this case, the client size is so small, and less paging is required and the impact looks smaller than the previous result.
In this way, we can achieve fast PSI by utilizing secure hardware. We expect cryptography-based methods will be gradually improved, however, it is unlikely it will catch up with the secure hardware-based method in the near future. To deploy into a practical situation for private contact tracing system, it is better to adopt secure hardware.
Datasets. We conduct the experiments with a real dataset, including data on people’s trajectories in specific regions of Japan available in JoRAS333http://www.csis.u-tokyo.ac.jp of The University of Tokyo. We use the people flow dataset for Kinki and Tokyo in Japan to create our experimental dataset. We extract only the time and coordinate information and create our dataset by applying the encoding described in Section 6.
We show the appropriateness of the scale of the experiment. Regarding the number of data points, in a practical case, where Japan is considered as an example, the maximum number of new infections in Japan as of July 25 was 981 per day, which means that the maximum number of new infections is approximately 14 x 1000 = 14,000 in 2 weeks. If the trajectory data were collected every 10 minutes, the total number of trajectory data would be . Therefore, - rows of data are plausibly as large as the infected patient data in our experiment. And each clients has 1440 trajectories for 2 weeks. This number corresponds to the case in which trajectories were collected every 14 minutes over 2 weeks.
Minimum trajectory data information that ID is eliminated consists of time, longitude and latitude. Figure 11 shows such 3D trajectory data distribution of Kinki, the scatter is 100000 trajectory points randomly sampled from one day datasets. We can see the trajectory has some patterns and biased distribution, and hence expect to be compressed. Trajectories of Tokyo is shown in Figure 11, which is clearly more dense.
Firstly, we show the compression results. In our method, trajectory data point is represented as 1 string. Generally, the minimum size of trajectory data point has 24 bytes because datetime information is 8 bytes and longitude and latitude are 8 bytes respectively. Table 8 shows name of the encoding methods we use in experiments and relationship of parameter and granularity and byte size of single data in 2 weeks setting. th54 and th60 use TrajectoryHash with shown parameters and gp10 (geohash and periodic) is our previous encoding (kato2020secure) and sep54 use sepTH described in Section 6.1 which merges without mixing binaries. As shown in the table, th54 and th60 and sep54 have higher accuracy in both time and geo scale than gp10.
|th54||7 bytes||32 s||0.6 0.6 m|
|th60||8 bytes||32 s||0.15 0.15 m|
|gp10||14 bytes||1 min||0.6 1.0 m|
|sep54||6 bytes||32 s||0.6 0.6 m|
Figure 12 shows compression results for trajectory data of Kinki and Tokyo and Random data. The size is the number of unique data in the encoded form to be a fair comparison of compression capacity. The figure shows the size of hash table and FSA holding the same data encoded by abovementioned encodings. Compared to Random data (right side), all encoding makes FSA smaller with real trajectories. We can see that compared to gp10, the compression rate of TrahectoryHash looks smaller. When using FSA, naive encoding is not such a problem because shared nodes are ignored. On the other hand, as described in Section 6, because it is better for each data to have a smaller size, we should use th54 or septh54 (Table 8). In both datasets of Kinki (left side) and Tokyo (middle), TrajectoryHash-based method is the smallest data representation. Seemingly, septh54 has trade-offs depending on datasets. th54 is better with 3D similarity and septh54 with 2D similarity. Relatively, the distribution of Tokyo is denser than that of Kinki, which affects the compression rate. Overall, TrajectoryHash has more impact on mapping single trajectory data compact than compressing multiple FSA trajectories.
We actually handle more large data for server data. As mentioned in Section 6.3, before chunking, we sort trajectories. Thus, when the data is more extensive, chunked data can be more compressible because similar data are likely gathered. Figure 13 shows the chunked data size given each central data size, and the band means deviations. Chunk size is fixed at , and the figure shows that as the original data becomes larger, the compression rate of chunked data increases up to approximately a certain value.
To work our system correctly, we have to determine the proper chunk size. Figure 14 shows PSI execution time in different chunk size using gp10 and data. Too small or large chunk size causes severe overheads because of EPC limitation. From this observation, we can find the existence of an optimal chunk size. Eventually, in this setting, when the client size is 1000 3000, the optimal chunk size is approximately (= 27 MB) when using FSA. And approximately (= 28 MB) when using hash table. Considering discussion in Section 7.1, this result is reasonable. This is because the space used for query data in EPC is approximately 60 MB and the EPC limitation is 96 MB. For example, 2000 clients 1440 data 14 bytes = 40 MB, and some more space is needed. Therefore, we can determine a chunk size by considering the client query size.
Next, we show PSI throughput comparison in Figure 15. We compare th54 as our proposed method with FSA (th54), our previous method (gp10), and the non-private fastest set intersection (Plain), which is implemented by hash table. The difference between th54 and gp10 is mainly due to the overhead of loading the data into enclave, therefore, more space-efficient data representation is better. We fix client size and chunk size at and respectively, and measure PSI exefcution time for each central data size. In other words, the time it takes to process 4000 PSI queries (each has 1440 trajectories). Looking the figure, firstly, a th54 is totally better than our previous encoding gp10. Moreover, surprisingly, our proposed method performances are close to non-private set intersection and within a single-digit difference. Therefore, the iteration overheads are acceptable, and th54 improve the overheads by more compressive characteristic. But it’s not that significant; the true advantage of th54 is next. Other computations that should be performed in a trusted enclave include decryption, assemble multiple queries (step6 in Section 5), and encrypted response construction. Figure 16 describes their total execution time. The above figures show the results when server data size is fixed at , and the below figures show the results when client query size is fixed at . The left side is about th54, and the right side is gp10. Overall, th54 (left) is more efficient than gp10 (right). That is mainly due to the small size of each data by th54, rather than to PSI’s speeding up. Large data cause overheads in decryption and assembly and any other processing, especially inside enclave. So, because th54 has smaller bytes, as shown in Table 8, there are such differences. In this way, th54 is more scalable for many clients. And lastly, we stress that the overall execution time is totally kept low. The time needed to process 4,000 queries, including each trajectory data, is only 2 seconds Generally, secure computations such as PSI take an order of magnitude longer. However, our results show such a secure computation is feasible in practical scale. In particular, we proposed a system specialized in private contact tracing for epidemics.
9. Related work
9.1. Private Contact Tracing
There are DP3T (troncoso2020decentralized) and similar schemes (trieu2020epione; rivest2020pact; becker2019tracking; gvili2020security), as shown in section 1, which are decentralized architectures using the device’s wireless signals and the most popular implementation methods so far. However, our proposed system directly handle trajectory data in private manner to detect indirect contacts. Reichert et al. (Reichert2020PrivacyPreservingCT) propose a setting similar to our study. They also show a system for centralized contact tracing on a server using GPS data, and the proposed method is based on traditional multiparty computation with ORAM, which they say underperforms in practical scenarios. (culler2020covista; luo2020deepeye) propose the use of more comprehensive data while focusing on the contact tracing. These indicate that we should aim for a platform that can provide more value, not a system that specializes in contact tracing. In that sense, our system can be part of them because our system is specialized in private contact tracing.
We need to refer to the open source project SafeTrace (https://github.com/enigmampc/SafeTrace) (enigma). SafeTrace also uses TEE for secure computation for contact tracing. Their system is expected to run on an instance of IBM Cloud, which is a very feasible approach. It does not focus on only notification to users, but intend to provide data analysis platform on TEE. Therefore, our methodology may help designing efficient PSI component. This project can be great motivation for our efficient private contact tracing scheme on TEE.
Another technology such as block-chain (arifeen2020blockchain; marbouh2020blockchain) is used for contact tracing . The blockchain network is considered as better option to store private and verified trajectory data or something. (marbouh2020blockchain) discuss the feasibility to deployment blockchain-based data tracking solution. In our paper, we do not discuss how to collect and store data closely, but how to provide the data. Secure data management is also a matter of interest.
9.2. Trajectory data
Trajectory data use is still increasing because of recent industry demands. Our research is related to trajectory data representation and compression. Quadkey (quadkey; lee2011crowd) we use in TrajectoryHash is one of the encoding for geodata. Another popular encoding is geohash (geohash). While both methods use recursive bisection on longitude and latitude, the main difference is encoding. Geohash uses Base32 encoding, thus, it corresponds to decomposition into a matrix with 4 columns and 8 rows.However, quadkey is 22 matrix. Additionally, our quadkey uses mercator projection and hence tile coordinates. Quadkey is better in points that the obtained mesh is close to real world squares, encoding granularity is more adjustable, but code size is larger than geohash. However, if we use them as binaries, there is actually no big difference. Both methods are used and applied in the literature (moussalli2015fast; zhou2017spatial; guo2019geographic). These researches focus on the location data and utilize their tree-structures or proximity to have more utilities. However, we mixing time data in the encoding. ST-Hash (guan2017st) is used to make spatiotemporal indexing for key-value stores. Although this work also use mixing spatiotemporal data into single encoding as we do, inside encoding for both time and location are different. In particular, our encoding is suited the case there is a specific period. Moreover, the motivation is totally different, and we consider storing data as FSA for compression.
Another compression direction is variants of Douglas-Peucker. This direction is basically route-wise compression. The basic idea is reproduction by estimating the route from minimum points(song2014press). The compression is performed by approximating the route information rather than the position information. (chen2019trajcompressor) shows the method that compress vehicle trajectories using route information of road and maps. REST (zhao2018rest) extracts important trajectory route and compress trajectories using the route. Our compression is based on the similarity of trajectory points. Therefore, it is orthogonal to these compression methods. However, such a route-wise compression may not work well in contact tracing, because it is not possible to tell whether or not they are really in contact just by intersecting. To detect the contact, it is necessary to include some time information in the route information.
In this paper, we proposed a trajectory-based private contact tracing system using trusted hardware to control the spread of infectious diseases. We identified the problems of existing private contact tracing systems, clarified the requirements for trajectory-based private contact tracing, and presented a TEE-based architecture to achieve secure, efficient, flexible, and accurate contact tracing. Our experimental results with real data suggested that our proposed system can work on a realistic scale. We hope this study could stimulate different communities and help to develop solutions to combat COVID-19 as soon as possible.