Anonymising Queries by Semantic Decomposition

09/12/2019 ∙ by Danushka Bollegala, et al. ∙ University of Liverpool 0

Protecting the privacy of search engine users is an important requirement in many information retrieval scenarios. A user might not want a search engine to guess his or her information need despite requesting relevant results. We propose a method to protect the privacy of search engine users by decomposing the queries using semantically related and unrelated distractor terms. Instead of a single query, the search engine receives multiple decomposed query terms. Next, we reconstruct the search results relevant to the original query term by aggregating the search results retrieved for the decomposed query terms. We show that the word embeddings learnt using a distributed representation learning method can be used to find semantically related and distractor query terms. We derive the relationship between the anonymity achieved through the proposed query anonymisation method and the reconstructability of the original search results using the decomposed queries. We analytically study the risk of discovering the search engine users' information intents under the proposed query anonymisation method, and empirically evaluate its robustness against clustering-based attacks. Our experimental results show that the proposed method can accurately reconstruct the search results for user queries, without compromising the privacy of the search engine users.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Information retrieval systems have become essential tools in our day-to-day activities. We constantly search information on the Web using search engines by issuing keywords that describe our information needs. However, we might not always want the search engine to discover our intent through the keywords we use in a search session. For example, a patient with a particular disease might want to use a web search engine to find information about his/her disease, but at the same time might not want to disclose his/her health conditions.

As web search engine users, we are left with two options regarding our privacy. First, we can trust the search engine not to disclose the keywords that we use in a search session to third parties, or even to use for any other purpose other than providing search results to the users who issued the queries. However, the user agreements in most web search engines do not allow such user rights. Although search engines pledge to protect the privacy of their users by encrypting queries and search results111https://goo.gl/JSBvpK, the encryption is between the user and the search engine – the original non-encrypted queries are still available to the search engine. The keywords issued by the users are a vital source of information for improving the relevancy of the search engine and displaying relevant adverts to the users. For example, in learning to rank (He et al., 2008), keywords issued by a user and the documents clicked by that user are recorded by the search engine to learn the optimal dynamic ranking of the search results. Query logs have been used extensively to learn the user interests and extract attributes related to frequently searched entities (Pasca, 2014; Sadikov et al., 2010; Santos et al., 2010; Richardson, 2008; Pasca, 2007). Considering the fact that placing advertisements for the highly bid keywords is one of the main revenue sources for search engines, there are obvious commercial motivations for the search engines to exploit the user queries beyond simply providing relevant search results to their users. For example, it has been reported that advertisements contribute to 96% of Google’s revenue222https://www.wordstream.com/articles/google-earnings. Therefore, it would be unwise to assume that the user queries will not be exploited in a manner unintended by the users

As an alternative approach that does not rely on the goodwill of the search engine companies, we propose a method (shown in Figure 1), where we disguise the queries that are sent to a search engine such that it is difficult for the search engine to guess the real information need of the user by looking at the keywords, yet it is somehow possible for the users to reconstruct the search results relevant for them from what is returned by the search engine. The proposed method does not require any encryption or blindly trusting the search engine companies or any third-party mediators. However, this is a non-trivial task because a search engine must be able to recognise the information need of a user in order to provide relevant results in the first place. Therefore, query anonymisation and relevance of search results are at a direct trade-off.

Figure 1. Overview of the proposed method. The original query is decomposed into a set of relevant () and distractor () terms at the user-end. The search engine returns documents relevant for both and , denoted by . We will ignore and reconstruct the search results for using .

Specifically, given a user query , our proposed method first finds a set of noisy relevant terms and distractor terms for . We use pre-trained word embeddings for identifying the noisy-relevant and distractor terms. We add Gaussian noise to the relevant terms such that it becomes difficult for the search engine to discover using . However, are derived using , so there is a risk that the search engine will perform some form of de-noising to unveil from . Therefore, using alone as the keywords does not guarantee anonymity. To mitigate this risk, we generate a set of distractor terms separately for each user query. We then issue in random order to the search engine to retrieve the corresponding search results. We then reconstruct the search results for using the search results we retrieve from the noisy-relevant terms and discard the search results retrieved from the distractor terms. It is noteworthy that during any stage of the proposed method, we do not issue as a standalone query nor in conjunction with any other terms to the search engine. Moreover, we do not require access to the search index, which is typically not shared by the search engine companies with the outside world.

Our contributions in this paper can be summarised as follows:

  • We propose a method to anonymise user queries sent to a search engine by semantic decomposition to protect the privacy of the search engine users. Our proposed method uses pretrained word embeddings.

  • We introduce the concepts of anonymity (i.e., how difficult it is to guess the original user query by looking at the auxiliary queries sent to the search engine?), and reconstructability

    (i.e. how easy it is to reconstruct the search results for the original query from the search results for the auxiliary queries?), and propose methods to estimate their values.

  • We theoretically derive the relationship between anonymity and reconstructability using known properties of distributed word representations.

  • We evaluate the robustness of the proposed query anonymisation method against clustering-based attacks, where a search engine would cluster the keywords it receives within a single session to filter our distractors and predict the original query from the induced clusters. Our experimental results show that by selecting appropriate distractor terms, it is possible to guarantee query anonymity, while reconstructing the relevant search results.

2. Query Anonymisation via Semantic Decomposition

2.1. Retrieval Model

Modern search engines use a complex retrieval mechanism that involves search result ranking, sessions, personalisation, etc. Moreover, the exact implementations of those mechanisms differ from one search engine to another and not publicly disclosed. Therefore, to simplify the theoretical analysis and empirical validation, we resort to a classical inverted index-based information retrieval, where we return all documents that contain all words in the query, without performing any relevance ranking.

2.2. Finding Noisy-Related Terms

Expanding a user query using related terms is a popular technique in information retrieval (Carpineto and Romano, 2012), and is particularly useful when the number of results for the original query is small or zero. For example, consider the scenario that a user wants to obtain search results for Bill Gates. In a typical search engine, we would search using the query Bill Gates and retrieve documents that contain the phrase Bill Gates. However, assuming that we did not find any documents containing Bill Gates, we could automatically expand the original query using its related terms to overcome the zero results problem. For example, we could expand Bill Gates using the related term Microsoft, which is a company founded by Bill Gates.

Although query expansion using related terms is motivated as a technique for improving the recall in a search engine, we take a different perspective in this paper – we consider query expansion as a method for anonymising the search intent of a user. Numerous methods have been proposed in prior work on query expansion to find good candidate terms for expanding a given user query such as using pre-compiled thesauri containing related terms and query logs (Carpineto and Romano, 2012). We note that any method that can find related terms for a given user query can be used for our purpose given that the following requirements are satisfied:

  1. The user query must never be sent to the search engine when retrieving related terms for because this would obviously compromise the anonymisation goal.

  2. Repeated queries to the search engine must be minimised in order to reduce the burden on the search engine. We assume that the query anonymisation process to take place outside of the search engine using a publicly available search API. Although modern Web search engines would gracefully scale with the number of users/queries, anonymisation methods that send excessively large numbers of queries are likely to be banned by the search engines because of the processing overhead. Therefore, it is important that we limit the search queries that we issue to the search engine when computing the related terms.

  3. No information regarding the distribution of documents nor the search index must be required by the related term identification method. If we had access to the index of the search engine, then we could easily find the terms that are co-occurring with the user query, thereby identifying related terms. However, we assume that the query anonymisation process happens outside of the search engine. None of the major commercial web search engines such as Google, Bing or Baidu provide direct access to their search indices due to security concerns. Therefore, it is realistic to assume that we will not have access to the search index during anytime of the anonymisation process, including the step where we find related terms to a given user query.

  4. The related terms must not be too similar to the original user query because that would enable the search engine to guess via the related terms it receives. For this purpose, we would add noise to the user query and find noisy related neighbours that are less similar to .

We propose a method that use pre-trained word embeddings to find related terms for a user query that satisfy all of the above-mentioned requirements. Context-independent word embedding methods such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014)

can represent the meanings of words using low dimensional dense vectors. Using word embeddings is also computationally attractive because they are low dimensional (typically

dimensions are sufficient), consuming less memory and faster when computing similarity scores. Although we focus on single word queries for ease of discussion, we note that by using context-sensitive phrase embeddings such as Elmo (Peters et al., 2018) and BERT (Devlin et al., 2019) we can obtain vectors representing multi-word queries, which we defer to future work.

We denote the pretrained word embedding of a term by . To perturbate word embeddings, we add a vector, , sampled independently for each from the

-dimensional Gaussian with a zero mean and a unit variance, and measure the cosine similarity between

and each of the words in a predefined and fixed vocabulary , using their word embeddings . We then select the top most similar words as the noisy related terms of .

Let us denote the set of documents retrieved using a query by . If we use a sufficiently large number of related terms to , we will be able to retrieve exactly using

(1)

However, in practice we are limited to using a truncated list of related terms because of computational efficiency and to limit the number of queries sent to the search engine. Therefore, in practice will not be exactly equal to . Nonetheless, we assume the equality to hold in (1), and later analyse the approximation error. We do not consider the problem of ranking the search results in this work, and focus only on reconstructing the set of search results. If the number of relevant search result is large and we would like to rank the most relevant search results at the top, we can still use static or dynamic ranking information provided by the search engine (He et al., 2008).

2.3. Anonymisation via Distractor Terms

Searching using noisy related terms alone of a user query

, does not guarantee the anonymity of the users. The probability of predicting the original user query increases with the number of related terms used. Therefore, we require further mechanisms to ensure that it will be difficult for the search engine to predict

from the queries it has seen. For this purpose, we select a set of unrelated terms , which we refer to as the distractor terms.

Several techniques can be used to find the distractor terms for a given query . For example, we can randomly select terms from the vocabulary as the distractor terms. However, such randomly selected distractor terms are unlikely to be coherent, and could be easily singled-out from the related terms by the search engine. If we know the semantic category of (e.g. is a person or a location etc.), then we can limit the distractor terms to the same semantic category as . This will guarantee that both related terms as well as distractor terms are semantically related in the sense that they both represent the same category. Therefore, it will be difficult for the search engine to discriminate between related terms and distractor terms. Information about the semantic categories of terms can be obtained through different ways such as Wikipedia category pages, taxonomies such as the WordNet (Miller, 1995)

or by named entity recognition (NER) tools.

We propose a method to find distractor terms for each query using pretrained word embeddings as illustrated in Figure 2. Let us consider a set of candidate terms from which we must select the distractor terms. For example,

could be a randomly selected subset from the vocabulary of the corpus used to train word embeddings. First, we select a random hyperplane (represented by the normal vector

to the hyperplane) in the embedding space that passes through the point corresponding to . Next, we split into two mutually exclusive sets and depending on which side of the hyperplane the word is located. Let us define and to be respectively the larger and smaller of the two sets and (i.e. and ) Next, we remove the top of the similar words in to the original query . We then use this reduced as (i.e. ) and repeat this process until we are left with the desired number of distractor terms in . Intuitively, we are partitioning the candidate set into two groups in each iteration considering some attribute (dimension) of the word embedding of the query (possibly representing some latent meaning of the query), and removing similar terms in that subspace.

Figure 2. Selecting distractor terms for a given query . We first compute the noise () added vector for , and then search for terms that are located inside a cone that forms an angle with . This would ensure that distractor terms are sufficiently similar to the noise component, therefore difficult to distinguish from .

2.4. Reconstructing Search Results

For a query, , once we have identified a set of noisy related terms, , and a set of distractor terms, , we will issue those terms as queries to the search engine and retrieve the relevant search results for each individual term. We issue related terms and distractor terms in a random sequence, and ignore the results returned by the search engine for the distractor terms. Finally, we can reconstruct the search results for using (1).

3. Anonymity vs. Reconstructability

Our proposed query decomposition method strikes a fine balance between two factors (a) the difficulty for the search engine to guess the original user query , from the set of terms that it receives , and (b) the difficulty to reconstruct the search results, , for the original user query, , using the search results for the noisy related terms following (1). We refer to (a) as the anonymity, and (b) as the reconstructability of the proposed query decomposition process.

3.1. Anonymity

We define anonymity, , as the ease (or alternatively difficulty) to guess the user query , from the terms issued to the search engine and compute it as follows:

(2)

Specifically, we measure the average cosine similarity between the word embedding, , for the original user query , and the word embeddings for each of search terms. If the similarity is higher, then it becomes easier for the search engine to guess from the search terms. The difference between this average similarity and (i.e. the maximum value for the average similarity) is considered as a measure of anonymity we can guarantee through the proposed query decomposition process.

3.2. Reconstructability

We reconstruct the search results for using the search results for the queries following (1). We define reconstructability, as a measure of the accuracy of this reconstruction process and is defined as follows:

(3)

A document retrieved by only a single noisy related term might not be relevant to the original user query . A more robust reconstruction procedure would be to consider a document as relevant if it has been retrieved by at least different noisy related terms. If a user query can be represented by a set of documents where, each document is retrieved by at least different noisy related terms, then we say to be -reconstructable. In fact, the reconstruction process defined in (1) corresponds to the special case where . Increasing the value of would decrease the number of relevant documents retrieved for the original user query , but it is likely to increase the relevance of the retrieval process.

4. Relationship between Anonymity and Reconstructability

In this section, we derive the relationship between anonymity and reconstructability. Because anonymity can be increased arbitrarily by increasing the distractor terms, in this analysis, we ignore distractor terms. This can be seen as a lower-bound for the anonymity that you can obtain, without using any distractor terms. We first discuss the case where we have only one related term (i.e. ) and then consider reconstructability case.

4.1. case

Let us consider the case where . Here, for a given query , we have only a single related term . In this case, , and we consider all documents retrieved using as relevant for . We first note that reconstructability, , can be written as,

(4)

from the definition of reconstructability.

Because we have a single noisy related term , we have . By substituting this in (4), we get

(5)

If we consider the co-occurrence context of two terms to be the document in which they co-occur, then (5) can be written as a conditional probability given by (6).

(6)

Theorem 2.2 in (Arora et al., 2016) provides a useful connection between the probability of a word (or the joint probability of two words) and their word representations, which we summarise below.

(7)
(8)

Here, is the partition function and is the approximation error.

By taking the logarithm of both sides in (6) we obtain,

(9)

Anonymity for a single query term can be computed using cosine similarity as follows:

(10)

By substituting (10) in (9) we get,

(11)

Because is a given query, is a constant. Moreover, if we assume that different related terms have similar norms, (from (8) it follows that such related terms must have similar frequencies of occurrence in the corpus), then from (11) we see that there exists a linear inverse relationship between and . Because logarithm function is monotonically increasing (11) implies an inverse relationship between and .

4.2. case

Let us now extend the relationship given by (11) to the case where we consider a document to be relevant if it can be retrieved from all of the related terms. In other words, we have reconstructability in this case. Because each search result is retrieved by all terms, we have

(12)

Reconstructability can be computed in this case as follows:

(13)

In (13) we have assumed that the related terms are mutually independent given the query .

Let us take the logarithm on both sides of (13), and use (7) and (8) in the same manner as we did in Section 4.1 to derive the relationship given by (14).

(14)

In the case, anonymity can be computed as follows:

(15)

Let us further assume that all related terms occur approximately the same number of times in the corpus. From (8) it then follows that for for some . By plugging (15) in (14), and using the approximation we arrive at the relationship between , , and given by (4.2).

In the general case of -reconstructability, we will have a subset of related terms retrieving each document. The reconstructability given by (4.2) must be considered as a lower-bound for this general case because we will still be able to reconstruct the search results using subsets of related terms selected from a set of related terms.

5. Experiments and Results

Figure 3. Relationship between anonymity and reconstructability under different levels of added noise and no distractor terms (left: no-noise, middle: medium-level of noise, and right: high-level of noise).
Figure 4. Relationship between anonymity and reconstructability under different levels of added noise and with 20 distractor terms (left: no-noise, middle: medium-level of noise, and right: high-level of noise).
Figure 5. Relationship between anonymity and reconstructability under different levels of added noise and with 40 distractor terms (left: no-noise, middle: medium-level of noise, and right: high-level of noise).
Figure 6. Hit rates for the -means clustering attacks for increasing number of clusters () and distractor terms. (left: no distractors, middle: 20 distractors, and right: 40 distractors). In each figure, we show results for three levels of added noise.

To evaluate the proposed method we create a dataset where we select popular queries from Wikipedia query logs and associate them with the relevant Wikipedia articles. The 50 query terms used in our experiments are as follows: airfield, alex, anthropology, antoine, antony, autonomous, belfast, ben, benares, benet, benz, biodiversity, broadway, carol, commercial, consciousness, crown, custer, earths, elena, gallery, haddad, haig, helmut, hughes, hugo, irit, judith, kahn, katarina, leith, marshal, masaaki, memorial, negro, oakley, outlaw, product, rings, runaway, sammy, santa, sine, stawell, steve, toole, tube, wait, wilkerson, angel.

We use December 2015 dump of English Wikipedia for this purpose and build a keyword-based inverted search index. We use 300 dimensional pretrained GloVe (Pennington et al., 2014) embeddings trained from a 42 billion token Web crawled corpus333https://nlp.stanford.edu/projects/glove/ as the word embeddings for computing relevant terms. Figures 3-5

show the anonymity and logarithm of the reconstructability values for the 50 queries in our dataset at three different levels of noise and different numbers of distractor terms. Specifically, we add Gaussian noise with zero mean and standard deviations of 0.6 and 1.0 respectively to stimulate medium and high levels of noise, whereas the no-noise case corresponds to not perturbating the word embeddings.

We see a negative correlation between anonymity and reconstructability in all plots as predicted by (4.2). Moreover, the absolute value of the negative correlation increases with the noise level in all cases with different numbers of distractor terms. Addition of noise affects the selection of related terms. It does not affect the selection of distractor terms. However, related terms influence both anonymity as well as reconstructability. Because the Gaussian noise is added to the word embedding of the original query, and the nearest neighbours to this noise added embedding are selected as the related terms, this process would help to increase anonymity. On the other hand, the search results obtained using noisy related terms will be less relevant to the original user query. Therefore, reconstructing the search results for the original user query using the search results for the noisy related terms will become more difficult, resulting in decreasing the reconstructability. The overall effect of increasing anonymity and decreasing reconstructability is shown by the increased negative gradient of the line of best fit in the figures.

5.1. Robustness against Attacks

An important aspect of a query anonymisation method is how robust it will be against attacks. Given that the proposed method sends two groups of terms (relevant and distractor) to a search engine, a natural line of attack will be for the search engine to cluster the received terms to filter out distractor terms and then guess the user query from the relevant terms. We call such attacks as clustering attacks in this paper.

As a concrete example, we simulate a clustering attacker who applies -means clustering to the received terms. The similarity between terms for the purpose of clustering is computed using the cosine similarity between the corresponding word embeddings. Any clustering algorithm can be used for this purpose. We use -means clustering because of its simplicity. Next, the attacker must identify a single cluster that is likely to contain the relevant terms. For this purpose, we measure the coherence, , of a cluster given by (16).

(16)

Here, are two distinct terms in . Because a cluster containing relevant terms will be more coherent than a cluster containing distractor terms, the attacker selects the cluster with the highest coherence as the relevant cluster. Finally, we find the term from the entire vocabulary that is closest to the centroid of the cluster as the guess of the original user query . We define hit rate to be the proportion of the queries that we disclose via the clustering attack. Figure 6 shows the hit rates for the clustering attacks under different numbers of distractor terms.

From Figure 6 left we see that the hit rate is high when we do not use any distractor terms. In this case, the set of candidate terms consists purely of related terms . We see that if we cluster all the related terms into one cluster () we can easily pick the original query by measuring the similarity to the centroid of the cluster. The hit rate drops when we add noise to the word embeddings, but even with the highest level of noise, we see that it is possible to discover the original query in 19% of the time. However, the hit rate drops significantly for all levels of noise when we add distractor terms as shown in the middle and right plots in Figure 6. Further results are presented in the SI. These results show the importance of using distractor terms.

Hit rate is maximum when we set , which is an ideal choice for the number of clusters considering the fact that we have two groups of terms (related terms and distractors) among the candidates. Increasing also increases the possibility of further splitting the related terms into multiple clusters thereby decreasing the probability of discovering the original query from a single cluster. We see that hit rates under no or medium levels of noise drops when we increase the number of distractor terms from 20 to 40, but the effect on high-level noise added candidates is less prominent. This result suggests that we could increase the number of distractor terms while keeping the level of noise to a minimum.

We show the related and distractor terms for two example queries, Hitler, in Table 1 and mass murder, in Table 2. We see terms that are related to the original queries can be accurately identified from the word embeddings. Moreover, by adding a high-level of noise to the embeddings, we can generate distractor terms that are sufficiently further from the original queries. Consequently, we see that both anonymity and reconstructability is relatively high for the examples. Interestingly, the clustering attack is unable to discover the original queries, irrespective of the number of clusters produced.

Query Hitler
noise high-level
related terms nazi, führer, gun, wehrmacht, guns, nra, pistol, bullets
distractors schenectady, fairfield, columbia, hanover, lafayette, bronx, evansville, youngstown, tallahassee, alexandria, northampton
anonymity 0.867
reconstructability 0.831
Clustering Attack Revealed Query
k=1 motagomery
k=2 albany, george
k=3 smith, albany
k=4 smith, fresno
k=5 rifle, albany
Table 1. Relevant and distractor terms for the query Hitler. Both anonymity and reconstructrability is high for this query with even with a small number of distractor terms. Clustering attack with different number of clusters () does not reveal the original query.
Query mass murder
noise high-level
related terms terrorism, killed, wrath, full-grown
distractors roselle, morristown, rockville, schenectady, utica, albany, ashland, hartford, salem, columbus
anonymity 0.789
reconstructability 0.747
Clustering Attack Revealed Query
k=1 richmond
k=2 fremont, death
k=4 pasadena, words
k=4 pasadena, words
k=5 pasadena, anderson
Table 2. Relevant and distractor terms for the query mass murder with 10 distractor terms. We see that the query or its two tokens are not revealed by the clustering attacks with different values.

5.2. Trade-off between Reconstructability and the Hit Rate in Clustering Attacks

If the keywords (related and distractor terms) sent to the search engine are related to the original user query then the search engine will be able to return relevant search results that we can use to reconstruct the search results for the original user query. However, this will also increase the risk that the search engine can guess the original user query using some attacking method such as -means clustering described in the paper. Hit rate was defined as the ratio of the user queries correctly predicted by the clustering attack and is a measure of the robustness of the proposed query anonymisation method against -means clustering attacks. Therefore, a natural question is what is the relationship between the reconstructability and the hit rate.

To empirically study the relationship between reconstructability and hit rate, we conduct the following experiment. We randomly select 109 user-queries and add Gaussian noise with zero-mean and standard deviations 0 (no noise), 0.6, 1.0, 1.4 and 1.8. In each case, we vary the number of distractor terms 0-120. We then apply -means clustering attacks with values of 1, 2, 3, 4 and 5. In total, for a fixed -value and the number of distractor terms, we have 545 clustering attacks. To make the evaluation more conservative, in this section we consider the terms in the vocabulary closest to the respective centroids in all clusters and not only the most coherent one as specified in Section 5.1. If the original query matches any of those terms, we consider it to be a hit. We randomly sample data points from even intervals of reconstructability values and plot in Figure 7.

We see a positive relationship between the reconstructability and the hit rate in all figures. This indicates a trade-off between the reconstructability and the hit rate, which shows that if we try to increase the reconstructability by selecting more relevant keywords to the original user-query, then it simultaneously increases the risk of the search engine discovering the original user-query via a clustering attack. Moreover, we see that when we increase the number of distractor terms the hit rate drops for the same value of reconstructability. This result shows that in order to overcome the trade off between the reconstructability and the hit rate we can simply increase the number of distractor terms, thereby making the query anonymisation method more robust against clustering attacks. Moreover, the drop due to distractor terms is more prominent for the attacks when we have distractor terms compared to that when we do not have distractor terms. This is because both related and distractor terms will be contained in this single cluster from which it is difficult to guess the original user-query. Therefore, multiple clusters are required for a successful -means clustering attack.

Overall, the hit rate drops in the order of , and when we increase the number of distractor terms. This result suggests that if one wants to increase the hit rate, then an effective strategy is to increase the number of clusters because we consider it to be a hit if the user-query is found via any of the clusters. Intuitively, if we form more clusters and pick all terms form the vocabulary closest to any one of the centroids, then the likelihood of predicting the original user-query increases with the number of clusters formed. However, in practice, we will need to further select one term from all the clusters. Nevertheless, we can consider the hit rate obtained in this manner to be a more conservative estimate, whereas in reality it will be less and therefore be more robust against attacks.

Figure 7. Hit-rate shown against reconstructability for -means attacks with left no distractor terms, middle 60 distractor terms and right 120 distractor terms.

6. Human Evaluation

The goal of our work was to anonymise queries sent to search engines such that the search engine cannot guess the user’s information intent. However, it is an interesting question whether a human, not a search engine, could guess the original query from the set of related and distractor terms suggested by the proposed method. This can be seen as an upper baseline for anonymisation. To empirically evaluate the difficulty for humans to predict the original query, we devise a query prediction game, where a group of human subjects are required to predict the original query from the related and distractor terms suggested by the proposed method.

The query prediction game is conducted in two stages. In the first stage, we randomly shuffle the related and distractor term sets extracted by the proposed method for a hidden query. The human subject is unaware of which of the terms are related to the original user-query and which are distractors. A human subject has a single guess to predict the user-query and wins only if the original query is correctly predicted. If the human subject fails at this first step, then we remove all distractor terms and display only the related terms to the human subject. The human subject then has a second chance to predict the original query from the related set of terms. If the human subject correctly predicts the original query in the second stage, we consider it to be a winning case. Otherwise, the current round of the game is terminated and the next set of terms are shown to the human subject. Winning rate is defined as the number of games won by the human subjects, where the original user query was correctly predicted.

Figures 8 and 9 show the winning rates for the first and second stages of the query prediction game against the anonymity of the queries. All queries selected for the prediction game have reconstructability scores greater than 0.3, which indicates that the search results for the original query can be accurately reconstructed from the related and distractor terms shown to the human subjects. We see that the winning rate for the first stage is lower than the second stage, indicating that it is significantly easier for human subjects to guess the original query when we have removed the distractor terms. This result suggests that the distractor terms found by the proposed method can distract not only search engines but also humans. From Figure 9 we see that there is a gradual negative correlation between hit rate and anonymity. This implies that more anonymous the terms are, it becomes difficult for the human subjects to predict the original query, which is a desirable property for a query anonymisation method.

Figure 8. Winning rate vs. anonymity for the first-stage of the query prediction game
Figure 9. Winning rate vs. anonymity for the second-stage of the query prediction game

7. Related Work

Our work is closely related to several different research fields such as query anonymisation, user profile unlinking, user unidentifiability and Private Information Retrieval (PIR). Traditionally, information retrieval systems such as Web search engines have been primarily text-based interfaces where users enter keywords that describe their information need and the search engine returns relevant documents to the users as the search results. The queries entered by the Web search engine users often reveal intimate private information about the users that they would not like to be known to the general public. One of the early incidents of query logs leaking private information in the public domain is the AOL’s release of query log data in 2006.444https://tinyurl.com/y9qx9ufz AOL released 20 million search queries from over 600K users, representing about of AOL’s serch data from March, April and May of 2006. The data contained the query, session id, anonymised user id, and the rank and domain of the clicked result. Despite the user ids being anonymised, various private information about the users could be easily triangulated from the released queries, which resulted after nine days AOL to issue an apology, remove the website and terminate a number of employees responsible for the decision, including the CTO. Following this incident various methods have been proposed to anonymise user queries such as token-based hashing (Kumar et al., 2007) and query-log bundling (Jones et al., 2008). However, in these approaches anonymisation happens only at the Web search engine’s side without any intervention by the users, and the users must trust the good intentions of the search engine with respect to the user privacy. Moreover, (Kumar et al., 2007) showed that hashing alone does not guarantee user privacy.

Accessing Web search engines via an anonymised proxy server such as the onion routing (Goldschlang et al., 1999), TOR (Dingledine et al., 2004), Dissent (Corrigan-Gibbs and Ford, 2010) or RAC (Mokhtar et al., 2013) is a popular strategy employed by common users. The goal is to prevent the search engine link the queries issues by a user to his or her user profile. Unfortunately, hiding the IP address of a user alone does not guarantee privacy as evident from the AOL incident in 2006, which already had user IPs replaced by random ids in the released query logs. In order to completely unlink their profiles, users must continuously change the proxy servers used and clean caches in the form of cookies and embedded javascript, which is a cumbersome process.

A complementary approach for ensuring the unidentifiability of users by the search engines is to issue a mixture of noisy or unrelated keywords alongside the keywords that describe the information need of the users. Several browser add-ons that automatically append unrelated fake terms have been developed such as TrackMetNot (Howe and Nissenbaum, 2009), OptimiseGoogle, Google Privacy and Private Web Search tool (Saint-Jean et al., 2007). Although this approach is similar to our proposal to append user queries with distractor terms previously proposed methods have relied on pre-compiled ontologies (Petit et al., 2014) such as the WordNet or queries issued by other users shared via a peer network. Such approaches have scalability issues because most named entities that appear in search queries do not appear in the WordNet and it is unlikely that users would openly share their keywords to be used by their peers.

The goal in Private Information Retrieval (Yekhanin, 2010) is to retrieve data from a database without revealing the query but only some encypted or obfuscated version of it (Ostrovsky and Skeith, 2007; Chor et al., 1997). For example, in hompmophic encryption-based methods the user (client) submits encrypted keywords and the search engine (server) performs a blinded lookup and returns the results again in an encrypted form, which can then be decrypted by the user. Embellishing queries with decoy terms further protects the privacy of the users. However, unlike our proposed method, PIR methods assume search engines to accommodate the client side encryption methods, which is a critical limitation because modern commercial Web search engines do not allow this.

Although we considered text-based queries, the proposed framework is not limited to text-based information retrieval, but can be easily extended to other types of search platforms. For example, in the case of voice-based information retrieval (Lee and Pan, 2009)

, we can use the spectrum of the voice input and add some noise to it such as white noise to find the distractors. Likewise, in image-based information retrieval, we can add noise to the image embedding produced by, for example, a convolutional neural network-based filter 

(Lu et al., 2017; Goodfellow et al., 2015; Szegedy et al., 2014). We plan to extend the proposed method to other types of information retrieval tasks in the future.

8. Conclusion

We proposed a method to anonymise queries sent to a Web search engine by decomposing the query into a set of related terms and a set of distractor terms. We then reconstruct the search results for the original query using the search results we obtain for the related terms, discarding the search results for the distractor terms. We theoretically studied the relationship between the anonymity and the reconstructability obtained using the proposed method under different noise levels. We empirically showed that the proposed anonymisation method is robust against a -means clustering attack. Moreover, a human evaluation task, implemented as a query prediction game, showed that it is even difficult for humans to predict the original query from the anonymisation produced by the proposed method.

References

  • S. Arora, Y. Li, Y. Liang, T. Ma, and A. Risteski (2016) A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics 4, pp. 385–399. Cited by: §4.1.
  • C. Carpineto and G. Romano (2012) A survey of automatic query expansion in information retrieval. Journal of ACL Computing Surveys 44 (1), pp. 1 – 50. Cited by: §2.2, §2.2.
  • B. Chor, N. Gilboa, and M. Naor (1997) Private information retrieval by keywords. Technical report Department of Computer Science, Technion, Israel Institute of Technology. Cited by: §7.
  • H. Corrigan-Gibbs and B. Ford (2010) Dissent: accountable anonymous group messaging. Cited by: §7.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In

    Proc. of NAACL-HLTProc. of NAACL-HLTProc. of ACM Workshop on Privacy in Electronic SocietyProc. of ICDCSProc. of CCSProc. of the Usenix Security SymposiumProc. of Public Key Cryptography (PKC)Proc. of CIKMProceedings of the 16th International Conference on World Wide WebProceedings of the 2017 Conference on Empirical Methods in Natural Language ProcessingProceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical PapersProc. of ACLProc. of ICLRProc. of the 1st workshop on Representation Learning for NLPProc. of International Conference on Learning Representations (ICLR)Proc. of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)Proc. of AAAIProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)Proc. of EMNLPProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)Proc. of EMNLPProc. of International Conference on Learning RepresentationsProc. of EMNLPWWW 2010WWW 2010WWW 2007Proc. of the 7th Intl. Conf. on Machine Learning and Cybernetics

    ,
    WWW ’07, Vol. 4450. Cited by: §2.2.
  • R. Dingledine, N. Mathewson, and P. Syversion (2004) TOR: the second generation onion router. Cited by: §7.
  • D. Goldschlang, M. Reed, and P. Syverson (1999) Onion routing. Communications of the ACM 42 (2). Cited by: §7.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. Cited by: §7.
  • C. He, C. Wang, Y. Zhong, and R. Li (2008) A survey on learning to rank. pp. 1734 – 1739. Cited by: §1, §2.2.
  • D. C. Howe and H. Nissenbaum (2009) TrackMeNot: resisting surveillance in web search. Note: Lessons from the Identity Train: Anonymity, Privacy and Identity in a Networked Society Cited by: §7.
  • R. Jones, R. Kumar, B. Pang, and A. Tomkins (2008) Vanity fair: privacy in querylog bundles. pp. 853–862. Cited by: §7.
  • T. Kenter, A. Borisov, and M. de Rijke (2016) Siamese cbow: optimizing word embeddings for sentence representations. pp. 941–951. Note: B10-P9
  • (22) They proposed a Seamese NN, where they learn word embeddings that when averaged best represent sentences. For a given sentence , they consider the two adjacent sentences and to be positive instances, and some randomly selected non-adjacent number of sentences are negative instances. The word embeddings are randomly initialised and averaged to represent two sentences and . Next, cosine similarity is used to compute the predicted probability, of and being adjacent as follows:
    (1)
    Target probabilities are computed as and respectively for positive and negative instances. Finally, categorical cross-entropy given by,
    is minimised. Seamese NN uses the same (shared) weights across different sub-networks. In this paper, however, the only parameters to be learnt are the word embeddings themselves. Experimental results on 20 sentence similarity benchmarks from SemEval show that the proposed method outperforms averaged CBOW embeddings, and skip-thought.
  • Cited by: J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019).
  • R. Kumar, J. Novak, B. Pang, and A. Tomkins (2007) On anonymizing query logs via token-based hashing. New York, NY, USA, pp. 629–638. External Links: Document, ISBN 978-1-59593-654-7 Cited by: §7.
  • L. Lee and Y. Pan (2009) Voice-based information retrieval – how far are we from the text-based information retrieval?. pp. 26–43. Cited by: §7.
  • J. Lu, H. Sibai, E. Fabry, and D. Forsyth (2017) NO need to worry about adversarial examples in object detection in autonomous vehicles. External Links: 1707.03501 Cited by: §7.
  • T. Mikolov, K. Chen, and J. Dean (2013) Efficient estimation of word representation in vector space. Cited by: §2.2.
  • G. A. Miller (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39 – 41. Cited by: §2.3.
  • S. B. Mokhtar, G. Berthou, A. Diarra, V. Quéma, and A. Shoker (2013) RAC: a freerider-resilient scalable, anonymous communication protocol. Cited by: §7.
  • R. Ostrovsky and W. I. Skeith (2007) A survey of single-database pir: techniques and applications. pp. 393–411. Cited by: §7.
  • M. Pasca (2007) Organizing and searching the world wide web of facts-step two: harnessing the wisdom of the crowds. pp. 101–110. Cited by: §1.
  • M. Pasca (2014) Queries as a source of lexicalized commonsense knowledge. pp. 1081–1091. Note: uses queries for extracting facts Cited by: §1.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. pp. 1532–1543. Note:

    models the global co-occurrences between two words in the co-occurrence matrix as a linear regression problem. This obviates the need for softmax types probabilistic models that require one to integrate over the vocabulary, which is costly and calls for ad-hoc tweaks. The co-occurrence strength between two words x and y is modeled as p(k—x) / p(k—y) over features (other words) k. This scoring method removes the noise due to features k that are related to both words x and y or not related to neither anyone of them. The objective function is derived systematically such that it supports properties that guarantee analogies (man - women + queen = king).

  • (45) Outperforms word2vect method.
  • (46) Experiments are done using 42B word corpora.
  • (47) Comparisons are done against existing methods theoretically by considering power-law for the co-occurrence matrix elements. Excellent analysis!
  • Cited by: §2.2, §5.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. Note: Use hidden states of a pre-trained bi-directional LSTM as the word embeddings. SOTA on many tasks. Learns charachter level embeddings. External Links: arXiv:1802.05365 Cited by: §2.2.
  • A. Petit, S. Ben Mokhtar, L. Brunie, and H. Kosch (2014) Towards efficient and accurate privacy preserving web search. Proceedings of the 9th Workshop on Middleware for Next Generation Internet Computing - MW4NG ’14. External Links: Document, ISBN 9781450332224 Cited by: §7.
  • M. Richardson (2008) Learning about the world through long term query logs. ACM Transactions on the Web 2 (4). Cited by: §1.
  • E. Sadikov, J. Madhavan, L. Wang, and A. Halevy (2010) Clustering query refinements by user intent. pp. 841–850. Cited by: §1.
  • F. Saint-Jean, A. Johnson, D. Boneh, and J. Feigenbaum (2007) Private web search. pp. 84–90. Cited by: §7.
  • R. L. T. Santos, C. Macdonald, and I. Ounis (2010) Exploiting query reformulations for web search result diversification. pp. 881–890. Cited by: §1.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. arXiv. Cited by: §7.
  • S. Yekhanin (2010) Private information retrieval. Commun. ACM 53 (4), pp. 68–73. External Links: Document, ISSN 0001-0782 Cited by: §7.