Web search is with no doubt the most widely used online service, with more than 3.5 billion queries sent on a daily basis to Google alone. These queries are generally stored by search engines to analyze user behavior and to personalize responses according to profiles inferred from the past queries of the users (langville2011google; Hannak:2013:MPW:2488388.2488435). They are at heart of the economic model of online services, which heavily relies on (personalized) advertising (yang2010analyzing). However, as pointed out by numerous studies, the collection and exploitation of search queries opens a number of privacy threats as they can disclose sensitive information about individuals (e.g., their age, sex, religious or political preferences, sexual orientation) (castelluccia2010private).
To deal with this issue, a number of solutions enabling the users to query search engines in a privacy preserving manner have been proposed in the literature. These solutions can be classified in three categories according to the guarantees they offer to the users.
The first category of solutions are those enforcing unlinkability between a user and her search query. The most popular approaches in this category are anonymous communication protocols (e.g., Tor (dingledine2004tor), Dissent (corrigan2010dissent; wolinsky2012dissent), RAC (mokhtar2013rac)). These solutions are however limited for two main reasons: first, they typically suffer from poor performance because of the heavy cryptographic mechanisms they rely on; second, despite ensuring anonymity of the requester, it has been shown in (peddinti2014web) that the actual content of search queries may be sufficient to link back to the identity of the user.
To overcome this limitation, a second category of solutions aim at enforcing indistinguishability between user profiles/queries. To that end, they obfuscate user preferences/profile in such a way that the search engine cannot distinguish between a user’s real interests and fake ones (e.g., Track me not (howe2009trackmenot), GooPIR (domingo2009h)). These approaches generally operate by sending fake queries (also called dummy queries) on behalf of the user. It has been shown (petit2016simattack), however, that the external resources used for generating fake queries (e.g., RSS feeds, dictionaries) makes it possible for search engines to easily distinguish fake from real traffic. Combination of unlinkability and indistinguishability has also been proposed in the literature, yet the only existing solution that we are aware of (PEAS (petit2015peas)) assumes a weak adversarial model of non-colluding proxy servers.
The last category of solutions are those enabling private information retrieval (PIR), e.g., (pang2010privacy; lindell2010private)). These approaches rely on specialized search engines implementing cryptographic techniques (e.g., homomorphic encryption) that enable to answer a user request without having access to its content. These techniques are, however, still unpractical due to their limited performance with response times in the order of seconds for very large data stores (aguilar2016xpir), which is the case of search engines.
Based on these considerations, it appears clearly that to fully support privacy-preserving Web search one must address two main challenges. The first one is to provide a practical and secure unlinkability protocol, i.e., a protocol enabling the protection of the identity of the requester in a more realistic adversarial model, without compromising the interactiveness between the user and the search engine. The second one is to provide an effective indistinguishability protocol that generates realistic fake queries, i.e., difficult to distinguish from real queries.
This paper contributes X-Search, a novel privacy proxy enabling Internet users to access Web search engines in a privacy-preserving manner. X-Search relies on Intel software guard extensions (SGX) (costan_intel), a hardware technology that provides a trusted execution environment able to perform secure computations within an “enclave”. Instead of submitting her queries directly to the search engine, a user sends them to the X-Search proxy to execute them on her behalf. The proxy executes attested code in a trusted SGX enclave (see §2.3 for details on the guarantees provided by SGX). The queries are encrypted while outside the enclave, and only accessible as plain text from within. The X-Search proxy then generates an obfuscated query by aggregating random past queries and the original one using the logical OR operator in such a way that the search engine is not able to distinguish which one is the original query. As the obfuscated scheme can alter the results returned by the search engine by mixing results for the original query with results for the additional aggregated past queries, the X-Search proxy filters results to only forward to the user the results related to the initial query.
We evaluate X-Search from three perspectives: privacy, accuracy, and performance. From the privacy perspective, we analytically show that X-Search offers stronger privacy guarantees than its competitors as it operates under a stronger adversarial model. Furthermore, we experimentally demonstrate using a data set of real search queries that X-Search is more resilient to state-of-the-art re-identification attacks than PEAS (by in average). From the accuracy perspective, we show that the impact of the obfuscation scheme of X-Search remains limited. For instance, with two fake queries in the obfuscated query, the user retrieves more than of the results returned for the initial query. From the performance perspective, we show that X-Search outperforms its competitors both in terms of latency and throughput. Specifically, the throughput of X-Search is one order of magnitude higher than the one of PEAS and two orders of magnitude higher than the one of Tor.
The contributions of X-Search are as follows. First, we present a novel architecture to allow privacy-preserving Web searches that exploits Intel SGX to operate under stronger adversarial models than existing systems in literature. Second, we contribute a novel query obfuscation mechanism. Third, we present the implementation choices of our full prototype. Finally, we contribute an extensive evaluation, both analytically and experimentally using real-world datasets.
The remainder of the paper is organized as follows. We first introduce background concepts and overview related work in Section 2. Then we present the considered adversary model in Section 3 before presenting our X-Search proposed protocol in Section 4. Finally, we describe the considered experimental setup and the evaluation of X-Search in Section 5 and Section 6, respectively. Section 7 presents our conclusions.
2. Background and related work
We start by describing in this section the related work in private Web search (Section 2.1). Then, we discuss the limitations of existing solutions (Section 2.2). Finally, we present Intel software guard extensions (SGX) and discuss how this novel technology can be used to improve the state of research in the field of private Web search (Section 2.3).
2.1. Private Web Search
Private Web search has been an active research area in the last decade in order to counterbalance the numerous threats open due to the oversharing of users’ search queries by search engines. This research field is likely to gain even more attention due to the recent legislation change in the United States, which enable ISPs to sell user browsing history without their consent.111http://www.nbcnews.com/news/us-news/trump-signs-measure-let-isps-sell-your-data-without-consent-n742316 In this context, existing solutions to private Web search can be classified in three main categories. The first two categories (presented respectively in Sections 2.1.1 and 2.1.2) enable clients to use existing search engines while offering them additional privacy guarantees. The third category (see in Section 2.1.3) includes alternative search engines implementing specific privacy-preserving protocols.
2.1.1. Enforcing unlinkability
This category of solutions includes a set of protocols enabling users to send their search queries anonymously to a search engine, thus enforcing unlinkability between the user identity (e.g., IP address) and her query.
The most popular protocol among these solutions is Tor (dingledine2004tor), an implementation of the Onion Routing protocol (goldschlag1999onion). Similarly to Onion Routing, Tor sends each query through multiple nodes using a cryptographic protocol. In this protocol, queries are encrypted using multiple keys of randomly selected nodes (creating an “onion” with multiple layers) and routed through these nodes. Then, each node deciphers the received cipher text (hence removing the outer-most layer of the onion) and forwards it to the next node until the onion reaches the exit node. The exit node retrieves the query and sends it to the search engine on behalf of the user. This protocol assumes the participating relays to faithfully forward the onions, which might not be true as some may behave selfishly (e.g., by dropping onions) or even maliciously (e.g., by injecting fake traffic to slow down the system).
RAC (mokhtar2013rac) overcome these limitations, by enabling anonymous communication in presence of malicious and selfish nodes. In this protocol, nodes are organized on several virtual rings such that, for a given ring, a node has a predecessor node and a successor node. A node might be part of several rings and thus have multiple predecessors and successors. To ensure that no message is dropped by a freerider, nodes have to broadcast all messages they relay. Broadcast messages have to circulate through all nodes in the ring such that if a node does not receive a message from one of its predecessors, it considers this predecessor as a freerider. The modifications made by RAC suffer from performance limitations, achieving a throughput that is orders of magnitude lower than Tor.
Another robust solution to anonymous communication is the Dissent protocol (corrigan2010dissent; wolinsky2012dissent). This protocol enforces accountability in presence of malicious and selfish participants. However, its performance is even worse than the one of RAC as it is a combination of two heavy cryptographic protocols: the dining cryptographers protocol (DC-NET) (chaum1988dining) and a data mining protocol used to permute a set of fixed-length messages with cryptographically strong anonymity (brickell2006efficient).
In addition to the performance issue, protocols enforcing unlinkability have also been shown not to resist re-identification attacks (petit2016simattack). Indeed, the issue comes from the fact that search queries themselves disclose enough information for breaking the unlinkability property.
2.1.2. Enforcing indistinguishability
To protect users against re-identification attacks, solutions enforcing indistinguishability have been proposed. The aim of these solutions is to avoid search engines distinguishing between a user’s real interests and fake ones, hence protecting her privacy. This is generally achieved either by generating fake queries (e.g., TrackMeNot (howe2009trackmenot), GooPIR (domingo2009h)) or by altering the user’s query (e.g., QueryScrambler (arampatzis2013query)).
TrackMeNot is a Firefox plugin that periodically generates fake queries and send them to the search engine on behalf of the user and independently of her real queries. Fake queries in TrackMeNot are generated using RSS feeds.
GooPIR introduces fake queries inside the user’s real query. All these queries (i.e., the real one and the fake ones) are separated by the logical or operator and sent to the search engine. Fake queries in GooPIR are generated by using randomly selected keywords from a dictionary.
QueryScrambler protects users by replacing their queries by semantically related queries. More precisely, for each user query, it generates a set of related queries by generalizing the concepts used in the initial query. Then, by merging and filtering all the results obtained with these related queries, it retrieves the most plausible results for the initial query.
PEAS improves over existing solutions by combing an unlinkability protocol with an indistinguishability protocol. The former is based on two non-colluding proxy servers. The first one handles user identities without having access to their requests, while the second generates fake queries, and send them to the search engine on behalf of the user. To generate fake queries, PEAS uses a co-occurrence matrix built from past user queries.
One of the major limitation of these solutions is that it is still easy to discern the fake queriesfrom real ones, as shown by re-identification attacks (petit2016simattack). We highlight this issue in Figure 1
. The show the CCDF (i.e., Complementary Cumulative Distribution Function) of the maximum similarity between fake queries generated by PEAS (i.e., based on the co-occurrence of terms in past queries) and TrackMeNot (i.e., based on RSS feeds) and past queries on the AOL dataset (see Section5.4 to have details of the used dataset and similarity metric). This result shows that in both cases most of the fake queries are significantly different from real queries.
2.1.3. Alternative Search Engines
This category of solutions build alternative search engines generally based on Private Information Retrieval (PIR) thus enforcing privacy-by-design. In these systems, users access information stored on the distant server without revealing to the latter what information they access. The only information known by the search engine is that the user has sent a query. In general, PIR protocols consist of three algorithms: the constructions of protected queries (keywords are at least encrypted), the execution of the information retrieval (preventing the search engine to access the query and its results), and finally the reconstruction of the result list. Part of these algorithms is performed on the clients, the other part on the distant server. These generally rely on heavy and unpractical (aguilar2016xpir) cryptographic protocols, especially when the accessed data stores contain millions of documents, the normal case for today’s search engines.
2.2. Open Challenges in Private Web Search
From the analysis of state of art private Web search solutions, we distinguish two major challenges: one for enforcing unlinkability and one for enforcing indistinguishability. The main open challenge for enforcing unlinkability is to design efficient protocols that resist strong adversaries. Indeed, existing protocols are either efficient but assume honest but curious servers (e.g., Tor, PEAS) or robust to malicious adversaries but have unpractical performance (e.g., Dissent, RAC).
In term of indistinguishability, the main open challenge is to better resist re-identification attacks by effectively hiding the original query among fake queries. This requires the generation of realistic fake queries that are as close as possible to real queries.
The remaining of this section shows how to leverage Intel Software Guard Extensions and address the above two challenges.
2.3. Intel Software Guard Extensions
Cloud software runs in multitenant computing nodes, remotely maintained by third parties. From the clients’ point of view, the environment to remotely run their software can be compromised in several ways. The third party or the person in charge of managing its hardware may be malicious. System managers have total access privileges on their hardware to potentially access or tamper with any stored information. Besides, the remote machine may run compromised operating systems, possibly executed by another (malicious) tenant. It is therefore hard to trust software running in the clouds.
Homomorphic encryption (Gentry:2009:FHE:1536414.1536440) is an appealing solution for untrusted environments. A user encrypts data, send it to an untrusted server. It is still able to process the ciphertext without having access to its content. The algorithms proposed so far prove that the concept is sound but impractical because of their enormous complexity. Preliminary yet partial solutions promise to improve the current situation (Naehrig:2011:HEP:2046660.2046682). To overcome this limitation, several hardware manufacturers extended their architectures with some form of trusted execution environment (TEE). In a nutshell, a TEE can certify what software it runs, and data stored inside it can only be accessed by its own software. With TEEs, users do not need to trust the infrastructure provider’s execution environment, because it can do no harm to their data, but only the TEE manufacturer.
We use a TEE to ensure the confidentiality and integrity of the X-Search proxy. It is the responsibility of the client to ensure that a certified proxy is running within a trustworthy TEE. The communication between client and proxy is then encrypted, and the user’s real interests are only accessible in the client domain and inside the TEE. In the following, we present Intel’s SGX (mckeen2013innovative), our platform of choice and TEE to implement the X-Search proxy.
Intel calls an enclave a TEE created with SGX. Enclaves are created and destroyed using specific privileged system calls. When an enclave is created, SGX allocates a memory region that is protected from all accesses from outside the enclave itself, including kernel, hypervisor and peripheral DMA. Applications can interact with enclaves via procedure calls, in both ways. Parameters and results are copied in and out enclaved memory when a call crosses the enclave border. Intel offers a software development kit to define and handle in- and out-calls and to manage the enclaves’ lifecycle.
The CPU keeps for each enclave a page cache and ensures that each page is assigned to exactly one enclave. System software, although untrusted, is responsible for assigning pages to enclaves. An initial set of pages is prepared by the system software, by assigning enclave pages with unencrypted data and code in it. The CPU keeps a cryptographic hash for the memory pages assigned to each enclave. After all initial pages are loaded into the enclave, the system software issues an instruction to mark the enclave as initialized. At that moment the memory hash, ormeasurement hash, is computed. From this point on, loading unencrypted pages is disabled and application software can enter the protected environment through the enclave interface.
SGX offers instructions for managing keys and for signing certificates of an enclave. Communication between a remote entity and an enclave is done through a local, untrusted software proxy. The enclave can send its certificate to the remote entity, which can then verify it with an appropriate authority. An authentic certificate and a correct measurement hash attest that the correct program has been loaded inside an authentic enclave. This process is also known as attestation. As certificates are signed within enclaves, remote entities can verify that it was not forged nor modified by an untrusted proxy, and trusted channels can be built (using untrusted components).
Access to enclave memory is prevented by hardware, and all enclaves in a processor can have up to approximately 90MB of a protected memory called EPC (enclave page cache). Paging can still be used to access larger address spaces. Enclave data residing in the processor’s internal cache are hashed and encrypted before flushed to the EPC. Memory checks are made through a chain of a stateful hash codes using random numbers created every time a page is encrypted. The chain is stored in untrusted memory, and its root is kept in the CPU, inaccessible from outside, what prevents any tampering attacks in memory, including replay. Paging is completely handled by untrusted software, in the local operating system.
2.4. Improving Security with SGX
SGX has been successfully used to improve the security and privacy of other systems. Code attestation mechanism coupled with the trusted environment provide an assurance that can enforce security guarantees in a plethora of systems, a few of those described next.
Hoekstra et al. (hoekstra2013using) show how SGX improves the security of sensitive code and data within three scenarios. First, they use enclaves in the client-side to store shared secrets with financial institutions, and to generate one-time passwords based on such secrets. Second, an enterprise-grade digital rights management system that stores document encryption keys within user enclaves. Such keys are distributed on demand, and discarded by the enclaves after use. The documents pass through the enclave for decryption, which in turn generates encrypted bitmaps using the GPU symmetric key. Third, a video-conferencing application with IP-connected enclaves that exchange encrypted media content and interact with the local hardware using encrypted protocols. These systems prevent malicious software (including high-privilege ones) from gaining access to the private data. Verifiable confidential cloud computing (VC3) is a MapReduce implementation with data confidentiality and integrity for both code and data that guarantees that the distributed computation globally ran correctly to completion and was not tampered with (schuster2015vc3). To execute map and reduce tasks, VC3 instantiates enclaves with encrypted code in it. It implements a key distribution protocol such that guarantees that any enclave that contributes to the job runs the correct code and shares the necessary keys for decrypting code and data. All data sent to tasks is encrypted, as well as all data produced by the tasks. Mapper and reducer tasks generate extra encrypted hashes that are used to verify that they properly processed all their input data. Leveraging enclaves, VC3 supports a threat model with powerful adversaries, that may control all cloud software and hardware, except for the physical processors used in the tasks computations.
SCBR (secure content-based routing) implements a content-based publish/subscribe engine (pires2016secure) where all message filtering is done inside secure enclaves. All messages are encrypted when outside enclaves, and the filters operate on plaintext headers. It uses a hybrid encryption scheme with different keys for header and payload to avoid sending all data through the enclave boundary. This improves performance and reduces the enclave memory footprint. An experimental evaluation shows that SCBR adds small overheads when compared to insecure plaintext matching outside enclaves.
Kim at al. (kim2015first) explored the possibility of using enclaves to provide security and privacy in network applications. They initially demonstrate how to use enclaves to prevent software-defined inter-domain routers to disclose their routing policies or how the Tor anonymity network (dingledine2004tor) can be strengthened to run its directory authorities to attest each other. Attackers can still launch denial-of-service attacks but they cannot alter the directory behavior. Also, by putting onion routers within enclaves, they can attest their integrity and their admission can be done automatically so directory authorities can be eliminated, and the routers can simply keep track of their membership in a distributed hash table. Finally, they present how enclaves can be used to securely introduce in-network functionality into TLS sessions.
Recent work investigate the resilience of SGX enclaves against side-channel attacks (Weichbrodt2016; xu2015controlled). This problem is orthogonal to the one investigated by this paper, and thus considered outside of the scope.
3. Adversary model
As further detailed in the following section, the protocol presented in this paper involves three premises: the client side, the X-Search proxy nodes running on cloud platforms and the search engine.
We assume that the code and the platform on which client nodes run are trusted. Then, as further presented in the following section, our protocol relies on X-Search proxy nodes running on public cloud platforms. We assume that these nodes are untrusted and can behave in a Byzantine manner (lamport1982byzantine), that is they can arbitrarily deviate from a correct behaviour (i.e., they can be subject to a failure, a bug or even behave maliciously). Finally, we assume that the search engine is honest but curious (Goldreich:2003:CCP:966037.966044). This means that the search engine behaves correctly when it comes to fetching answers to a specific request but it may collect and exploit in all possible ways the information they receive from clients. In particular, we assume that the search engine was able to collect as preliminary information about each user in the system a set of past queries. This preliminary information is stored in user profile structures.
Moreover, we also assume that if the search engine identifies that the client is relying on a private web search mechanism (e.g., an anonymous communication protocol or X-Search), it may run state-of-the-art re-identification attacks (e.g., (Gervais:2014:QWP:2660267.2660367)) in order to re-associate the received request to a known user profile. We further assume that the search engine may collude with proxy nodes (e.g., TOR relays or proxy nodes in X-Search) in order to learn more information about the anonymous client.
We start this section by presenting an overview of our X-Search protocol (Section 4.1). Then, we detail how the unlinkability is ensured (Section 4.2). Finally, we introduce the obfuscation and filtering mechanisms used to provide indistinguishability (Section 4.3).
4.1. Protocol Overview
To efficiently protect users during Web search, X-Search combines unlinkability and indistinguishability. As previously discussed in Section 2 these two schemes are complementary as the former hides the identity of the requesting user while the latter hides her query. Figure 2 depicts the architecture and the execution flow of X-Search. Specifically, the user interacts with the search engine through an X-Search proxy node hosted on untrusted public cloud services. We assume the X-Search proxy to be deployed on physical nodes with available SGX instructions, a scenario that we expect to be common in a near future.
As this proxy node acts as an intermediate node between the search engine and the user, it hides the user identity (i.e., her IP address). The proxy node is also in charge of obfuscating the user queries, and filtering the results returned from the search engine before forwarding them back to the user.
More precisely, the user starts by sending her query to the X-Search proxy (Figure 2 – ❶). Then, the proxy node generates a new obfuscated query. To achieve that, the proxy retrieves random past queries (❷) and aggregates them with the original query in a random order using the logical OR operator. Next, the proxy stores the initial query in the table of past queries (❸) and sends one single obfuscated query to the search engine (❹). The search engine is queried by the proxy without using end-to-end encryption 222Using HTTPS could be also supported by the SGX enclave.. Contrary to state of the art indistinguishability protocols, X-Search uses as fake queries past queries sent by real users. This allows to have fake queries that are effectively indistinguishable from the user’s real one. This is possible because past queries are securely stored inside the TEE with no correlation to the identity of their originating users, which prevents any malicious entity from exploiting them.
As the obfuscated query can alter the results returned by the search engine, e.g. by mixing results for the original query with results for the additional aggregated past queries, the proxy node includes a filtering step. Once the search engine sends back the results to the X-Search proxy (❺), the filtering removes the results returned by the search engine that are not associated to the original query. Finally, the remaining results are returned to the user (❻). These results are tampered by the proxy to remove any URL redirection used for analytics for instance.
We note that the X-Search proxy node does not maintain individual profile structures associated to each user. Instead, it only updates a table containing the last past queries. To improve performance, the proxy uses multiples threads. The query table is kept in memory and shared among all threads. Moreover, the user sends her query to the proxy node through an encrypted tunnel with an end point inside the SGX enclave. Consequently, the protection of the original query is ensured from the client until inside the TEE of the proxy node. Once outside from the proxy in flight toward the search engine, the original query of the user is protected thanks to the used obfuscation mechanism.
4.2. Enforcing Unlinkability
The X-Search system offers to end users search unlinkability by relying on a query broker. This broker runs within the client’s domain, such as a local daemon process executing alongside the client’s Web browser. The broker is in charge of the SGX attestation step. When the user issues a Web search query, her Web client first connects to the local broker. Then, the broker encrypts the request and forwards the cipher to an X-Search node hosted in an untrusted cloud provider. The X-Search node receiving the cipher generates the obfuscated query as further detailed in the following section. Before sending out the obfuscated query, the original one is securely stored in the SGX reserved memory. When the search engine sends back the response to the X-Search node, the latter filters out the relevant results, i.e., those related to the original user query, encrypts them and delivers them backward to the broker. Finally, the broker decrypts the result and delivers it upward to the Web client.
4.3. Enforcing Indistinguishability
To enforce indistinguishability, X-Search relies on an obfuscation mechanism. This mechanism (Algorithm 1) aims at hiding the user queries among multiple fake queries. More precisely, the proposed obfuscation mechanism randomly aggregates the original query with fake queries separated with logical OR operators (lines 2–8). These fake queries come from the table of past queries maintained in the private memory of the X-Search proxy (Algorithm 1, variable ). Indeed, to avoid building irrelevant fake queries and possibly easily identifiable by the adversary as fake (as discussed in Section 2.2), the obfuscation mechanism of X-Search leverages real past queries chosen at random. Using real past queries ensures that each sub-query of the obfuscated query can be mapped by an adversary conducting a re-identification attack to an existing user profile, thereby making the task of re-identification more complicated to perform.
As an SGX enclave has approximately 90MB of private memory (Section 2.3), we need to bound the memory usage of the X-Search proxy by limiting the size of to only keep the last queries sent by users. This size limitation acts as a sliding window where only the most recent queries are exploited. Once the obfuscated query is generated, the initial query is stored in the history (line 9).
This obfuscation mechanism impacts the results returned by the search engine. Indeed, the results of the search engine contain a mix of answers corresponding to queries (i.e., fake queries and the initial one). Consequently, the X-Search proxy filters the returned results to remove those which are not related to the initial query. To do this filtering step, the X-Search node exploits the initial query and the associated fake queries. Algorithm 2 describes this filtering process. For each result from the result set, the algorithm determines if it corresponds to the initial query as following. A similarity score is assigned to each query (lines 5–6) based on the title and the description of the result. The function nbCommonWords() computes the number of common words between a query and an element . A result is considered related to the initial query, and hence forwarded to the user, if the initial query has the largest score (lines 7–8).
5. Experimental setup
In this section we present the experimental setup we used to evaluate X-Search. This comprises: the dataset we used, the comparison baselines we compared against, the evaluation methodology and the metrics used to assess the performance of X-Search.
5.1. Web Search Dataset
To assess X-Search, we use a real world Web search dataset from the AOL query logs (pass2006picture). This dataset contains approximately 21 million queries, formulated by 650,000 unique users over three months (from March to May of 2006). For the sake of comparison, we use the same methodology as described in (petit2015peas) to focus our evaluation on the 100 most active users, as they are the most exposed to an adversary willing to unveil their identities. Indeed, the most active users have exposed more preliminary information to the search engine through their past querying activity. To reflect this preliminary information collected by the search engine, we built an off-line profile for each user. To do that, we split the dataset in a training set to build these user profiles, and a testing set to apply and to evaluate the privacy of X-Search. The training set contained two thirds of user queries and the testing set the remaining ones.
5.2. Comparison Baselines
We compare the robustness and quality of X-Search against two baselines from the state-of-the art, namely Tor (dingledine2004tor) and PEAS (petit2015peas). As described in Section 2.1.1, Tor leverages a proxy chain to provide unlinkability. More precisely, this solution uses encryption schemes to hide the identity of a user from the search engine perspective. PEAS, in turn, combines unlinkability and indistinguishability by hiding the identity of the requesting user as well as obfuscating the original query with fake queries. Specifically, the unlinkability property is ensured by a proxy composed of two trusted nodes relaying the original queries while the obfuscation is achieved locally on the client by aggregating in a random order fake queries with the original one. These fake queries are generated from the graph of co-occurrence between terms in the history of user queries. Lastly, we also consider a Direct baseline solution, for which the users send directly their queries to the search engine without any protection. We do not compare X-Search against PIR-based solutions because they require to use crypto-based search engines.
This section presents the methodology adopted to evaluate X-Search. We assess X-Search along three dimensions: the offered privacy (i.e., the protection of users’ queries), the achieved accuracy (i.e., the quality of the results returned by X-Search), and the pure system performance (i.e., the efficiency of X-Search in terms of throughput, latency and memory usage).
To evaluate privacy, we leverage SimAttack (petit2016simattack)
a re-identification attack for which the code is available and that has been shown to outperform previous attacks including a machine learning attack presented in(peddinti2014web). To run this attack, we assume that the attacker holds a set of user profiles built from the learning part of the dataset. Then, we protect each query of the testing part using X-Search before sending it to the search engine. Then, for each obfuscated query, the attack tries to re-identify both the requesting user and the initial query among fake ones.
More precisely, SimAttack is based on a similarity metric that characterizes the proximity between a query and a user profile . This profile represents the preliminary information associated to user collected by the adversary. This preliminary information can be viewed as the history of queries of the users before they protect their Web search activities. In our case, contains queries that belong to the training set of user
. The similarity metric used by SimAttack accounts the cosine similarity ofand all queries part of the user profile , and returns the exponential smoothing of all these similarities ranked in ascending order. We empirically set the smoothing factor at as it provides the best performances.
To achieve the re-identification from the obfuscated query of X-Search, we compute the similarity metric for each sub-query embedded in the obfuscated query and each user for which the adversary has a profile. If only one couple of query and user have the highest similarities, SimAttack returns this couple corresponding to the initial query and to the initial requester. Otherwise, the attack is unsuccessful.
The obfuscation mechanism of X-Search (i.e., adding past queries) impacts the results returned by a search engine. Consequently, we evaluate the capacity of X-Search to filter results not related to the initial query before forwarding them back to the user. To achieve that, for a given initial query, we compare results returned by the search engine for this query and the results returned for the associated obfuscated query after the filtering step.
Our experiments use the Bing search engine. Search queries are directed to the http://www.bing.com/search=q? address. As the OR operator implemented by Bing only works with single-word queries, we simulated the execution of an obfuscated query by submitting each sub-query and independently and by merging the result sets. To circumvent the queryday limit imposed by Bing, for each value of (i.e., the number of fake queries), we run the experiment on a random subset of the testing set composed of 100 queries. Unless otherwise specified, we consider the first 20 results in our accuracy-related experiments.
To evaluate the performances of X-Search from a system perspective, we implemented a fully-functioning prototype. Our implementation uses C++ and rely on the Intel SGX SDK (v1.8) libraries and tools (intel-sdk). The prototype is deployed on a machine with an Intel® Core™ i7-6700 processor (intel:i7_6700) and RAM running on Ubuntu 14.04.1 LTS (kernel 4.2.0-42-generic).
The main performance bottlenecks when using intel SGX are known to be the transitions between trusted and untrusted modes (inside/outside the enclaves) and the intensive usage of memory, with two stages: (i) when exceeding the processor’s last cache level, which requires cache eviction and the consequent cryptographic and integrity checks; and (ii) when exceeding the EPC size, triggering memory swaps scheduled by the underlying operating system. An excessive memory usage can be caused by the management of the past queries inside the enclave’s protected memory. We evaluate this aspect of X-Search in Section 6.3. Furthermore, in order to avoid unnecessary and costly mode transitions, we limit the enclave interface to allow only essential operations that deal with sensitive information. Procedure calls made by the vulnerable code are called ecalls (enclave calls), whereas the ones made the enclave trusted code are called ocalls (outside calls). The enclave interface offered by the X-Search node is as follows:
|init( parameters )||Setup options for X-Search.|
|request( sock, buff, len )||Provision of data to the enclave, coming from the given socket.|
|sock connect( host, port )||Performs the DNS lookup and connection to server, returns the socket file descriptor.|
|send( sock, buff, len )||Sends data through the given socket.|
|recv( sock, buff, len )||Receives data from the given socket.|
|close( sock )||Close socket file descriptor.|
We measured the system capacity by observing latency for increasing throughput configurations when X-Search was configured to reply immediatly to requests. Memory usage was assessed by populating the past queries store inside the enclave with a real dataset and observing its occupancy. Finally, we measured respone times considering the complete chain, including the search engine delays. Results are described in Section 6.3.
We consider three types of metrics in our evaluation. The privacy metric measures the level of protection offered by X-Search and its ability to preserve the users’ privacy. The accuracy metric, in turn, assesses the quality of the query results provided to users according to their original queries. Lastly, system metrics evaluate the performance and the effectiveness of our solution.
To assess the privacy we consider the re-identification rate. This rate aims to retrieve for each protected query, both the content of the initial query and the identity of the associated user. The re-identification rate is defined as follow:
where is the set of correctly re-identified queries (i.e., re-identification of both the initial query and the associated user), while is the set of original queries sent by users. This metric is defined between where represents the best solution (i.e., no re-identification) and represents the worst solution (i.e., all queries are re-identified).
The evaluation of the accuracy consists in comparing the lists of results associated to the original query and the results returned with the obfuscated query aggregating the original query and fake ones. To measure the accuracy, we consider the precision (i.e., correctness) and the recall (i.e., completeness) as defined below:
where is the set of results returned by the search engine for the original query, and the set of results returned by X-Search. Both metrics are in . The best accuracy is provided with a precision and a recall at .
5.4.3. System Metrics
To evaluate the behavior of X-Search from a systems perspective, we consider the following metrics. First, we measure the throughput (requests/second) to assess the scalability of X-Search by measuring its capability to operate properly (adequate response times) even with a growing number of users requesting the service. Second, looking at occupancy (in MB) using a memory profiler we assess the efficiency of our working prototype. Finally, we look at the latency to serve the search results back to the users once they send their queries.
This section presents the experimental evaluation of X-Search over three dimensions: the privacy, the accuracy and the system performance, respectively described in Sections 6.1, 6.2, and 6.3. Our evaluation draws the following conclusions: (1) X-Search better resists state-of-the-art re-identification attack, (2) it has a limited impact on the accuracy of the results returned to users, and (3) system-wise, it outperforms its competitors, sometimes by orders of magnitude.
We start by evaluating the capacity of X-Search to preserve the user privacy and to improve user protection compared to PEAS. To this end, we measure the robustness of X-Search against a classical re-identification attack. Figure 3 shows the re-identification rate for PEAS and X-Search for different values of fake requests, i.e., . Results for represent the re-identification rate for a solution enforcing only unlinkability (e.g., Tor). In this case (i.e., without query obfuscation), an adversary using only the history of user queries as preliminary information, is able to re-associate almost of novel queries to their originating user. This confirms that unlikability solutions alone are not sufficient to effectively protect users against re-identification attacks.
Adding only one fake query drops this re-identification rate to for X-Search and almost for PEAS. This difference comes from the fake query generation process. Indeed, using real past queries makes X-Search more robust to the re-identification attack as all sub-queries of the obfuscated query can be mapped to past queries of other users, which creates confusion from the attacker side. On the contrary, generating fake queries based on the co-occurrence of terms does not ensure PEAS to build fake queries closer to a user profile than the original one.
The re-identification rate decreases accordingly to (i.e., the number of fake queries). For all value of , X-Search provides a better protection to the users (i.e., re-identification rate) than PEAS. The improvement of X-Search over PEAS varies from for to for .
The accuracy of X-Search can be measured by evaluating the impact of the obfuscation and the filtering mechanisms on the search results returned to users. Specifically, we study if the filtering mechanism is able to remove results related to the fake queries while keeping the ones related to the initial query. Figure 4 depicts the precision and the recall of X-Search according to an increasing value of . As expected, these curves show that both the recall and the precision slightly decrease according to . However, the results returned to users are still accurate. For instance with , the value of the recall is higher than . This means that more than of the results returned to users with X-Search are the same results as the ones returned if the original query was sent directly to the search engine. Moreover, the measured precision in this case is higher than , which means that only around of the results returned to users can be associated to a fake query and not to the initial query. These numbers confirm that X-Search preserves the quality of the results returned by the search engine.
6.3. System Performance
We evaluate the system performance of X-Search to answer the following questions: (1) is our implementation fast? (2) is it memory-efficient and can it be executed within the current SGX memory limitations? and (3) is it usable and responsive to end-users?.
We begin by looking at the throughput/latency ratio of the X-Search proxy. To perform this experiment, we iteratively increase the rate at which requests are directed toward the X-Search proxy, until the point where the latency to handle each request becomes too high. For this experiment, we rely on the wrk2 workload generator (wrk2) to measure the throughput and latency based on the request rates issued to the X-Search proxy. Note that these measurements are taken without actually hitting the web search engine, to better understand the saturation point of the proxy. We compare against Tor and PEAS.333Note that PEAS and Tor require custom clients to forge messages following their protocol, whereas X-Search can be used with third-party clients issuing regular HTTP requests, such as wget or curl. These results are presented in Figure 5. We plot the number of requests per second and the observed latency per request on the x-axis and y-axis, respectively. Due to the different magnitude of performances, this plot uses a log-log scale.
We observe that X-Search scales well, and it is capable of serving up to requests/sec with sub-second latencies. Instead, PEAS deteriorates much faster, with as few as requests/sec being served with a sub-second latency. In our experiments, Tor performs very poorly: handling as few as 100 requests/sec at an average reply latency of 8.86 milliseconds, around 10 slower than X-Search serving requests/sec. This result confirms our implementation to be fast and scalable.
Next, we investigate how much memory is required by the obfuscation scheme. For this experiment, we used a much larger dataset than the one described in Section 5. Specifically, we use all the 6 millions unique queries available in the AOL dataset. We leverage Valgrind’s Massif (Seward:2008:VAD:1796426) to trace and profile the heap memory allocations executed by the xsearch process. Figure 6 presents our result. Observing the trend of the X-Search curve, it is clear that the EPC size is largely sufficient to store at least 1M queries, a number that can support with ease the obfuscation mechanism.
We complete this part of the evaluation by evaluating the user-perceived performance of the system, e.g. the end-to-end latency of a Web query from the submission to the reception of the results. Due to rate limiting schemes adopted by the Bing’s search engine, in this experiment we only issue 100 queries, picked at random between the AOL dataset. We compare the observed latency between three different scenarios: (1) the client contacting directly the web engine (hence without any privacy guarantees), (2) the same set of queries being routed via the Tor network, and finally (3) using X-Search. Figure 7 presents the results as a Cumulative Distribution Function (CDF) of the measured round-trip network latencies. We can observe that X-Search allows for much faster replies: the median response time is seconds, and the percentile is seconds. The results over the Tor network are surprisingly bad from a user-perspective: the median time to route a Web search over the onion routers was seconds at the time of our experiments (May 2017), while the of the queries complete in up to seconds.444We could not conduct a similar experiment using PEAS due to a bug in the code. The Tor network largely exceeds well-known usability margins (Palmer:2002:WSU:767837.769618), while X-Search offers a usable and secure browsing experience.
User behavior tracking by major service providers is one of the main privacy threats in today’s Internet. This is particularly the case with search engines, as they are among the most widely used online services and search queries reveal sensitive information about individual users, such as their age, sex, or religious or political preferences. Solutions exist in the literature for enabling users to access Web search engines in privacy-preserving way. However, these solutions either do not resist malicious adversaries or are robust but have poor performance.
In this paper, we proposed a novel architecture for privacy-preserving Web search, which relies on a trusted execution environment (Intel SGX) to support stronger adversarial models than existing solutions. Our system, X-Search, operates as a proxy which stores and leverages user past queries within a protected SGX enclave and generates obfuscated queries on behalf of the user. It does so by aggregating random past queries in such a way that the search engine is not able to distinguish which one is the original query, but still provides relevant results for the user. Upon receiving a response from the search engine, the X-Search proxy filters results to only forward those related to the initial query.
We have implemented a working prototype and evaluated it both analytically and experimentally using real-world datasets. Our observations indicate that X-Search can indeed provide accurate results without disclosing personal information about individual users. Most importantly, X-Search does so with a throughput that is orders of magnitude higher than its competitors, i.e., the PEAS and Tor protocols.
The research leading to these results has received funding from the European Commission, Information and Communication Technologies, H2020-ICT-2015 under grant agreement number 690111 (SecureCloud project). Rafael Pires is also sponsored by CNPq, National Counsel of Technological and Scientific Development, Brazil.