Log In Sign Up

Stochastic Privacy

Online services such as web search and e-commerce applications typically rely on the collection of data about users, including details of their activities on the web. Such personal data is used to enhance the quality of service via personalization of content and to maximize revenues via better targeting of advertisements and deeper engagement of users on sites. To date, service providers have largely followed the approach of either requiring or requesting consent for opting-in to share their data. Users may be willing to share private information in return for better quality of service or for incentives, or in return for assurances about the nature and extend of the logging of data. We introduce stochastic privacy, a new approach to privacy centering on a simple concept: A guarantee is provided to users about the upper-bound on the probability that their personal data will be used. Such a probability, which we refer to as privacy risk, can be assessed by users as a preference or communicated as a policy by a service provider. Service providers can work to personalize and to optimize revenues in accordance with preferences about privacy risk. We present procedures, proofs, and an overall system for maximizing the quality of services, while respecting bounds on allowable or communicated privacy risk. We demonstrate the methodology with a case study and evaluation of the procedures applied to web search personalization. We show how we can achieve near-optimal utility of accessing information with provable guarantees on the probability of sharing data.


page 1

page 2

page 3

page 4


A Utility-Theoretic Approach to Privacy in Online Services

Online offerings such as web search, news portals, and e-commerce applic...

Privacy and Integrity Preserving Computations with CRISP

In the digital era, users share their personal data with service provide...

Vicious Classifiers: Data Reconstruction Attack at Inference Time

Privacy-preserving inference via edge or encrypted computing paradigms e...

No Privacy in the Electronics Repair Industry

Electronics repair and service providers offer a range of services to co...

Collective Privacy Recovery: Data-sharing Coordination via Decentralized Artificial Intelligence

Collective privacy loss becomes a colossal problem, an emergency for per...

Defending Against Membership Inference Attacks on Beacon Services

Large genomic datasets are now created through numerous activities, incl...

Code Repositories


An R package for Private Evaporative Cooling feature selection and classification and classification with Relief-F and Random Forests

view repo


Figure 1: Overview of stochastic privacy.

Online services such as web search, recommendation engines, social networks, and e-commerce applications typically rely on the collection of data about activities ( e.g., click logs, queries, and browsing information) and personal information (e.g., location and demographics) of users. The availability of such data enables providers to personalize services to individuals and also to learn how to enhance the service for all users (e.g., improved search results relevance). User data is also important to providers for optimizing revenues via better targeted advertising, extended user engagement and popularity, and even the selling of user data to third party companies. Permissions are typically obtained via broad consent agreements that request user permission to share their data through system dialogs, or via complex Terms of Service. Such notices are typically difficult to understand and ignored by more than 40 percent of users [Technet2012]. In other cases, a plethora of requests for information such as user location may be shown in system dialogs at run-time or installation time. Beyond the normal channels for sharing data, potential breaches of information are possible via attacks by malicious third parties and malware, and through surprising situations such as the AOL data release [Arrington2006, Adar2007] and de-anonymization of released Netflix logs [Narayanan and Shmatikov2008]. The charges by the Federal Trade Commission against Facebook [FTC2011] and Google [FTC2012] highlight increasing concerns by privacy advocates and government institutions about the large-scale recording of personal data.

Ideal approaches to privacy in online services would enable users to benefit from machine learning over data from populations of users, yet consider users’ preferences as a top priority. Prior research in this realm has focused on designing privacy-preserving methodologies that can provide for control of a privacy-utility tradeoff

[Adar2007, Krause and Horvitz2008]. Research has also explored the feasibility of incorporating user preferences over what type of data can be logged [Xu et al.2007, Cooper2008, Olson, Grudin, and Horvitz2005, Krause and Horvitz2008].

We introduce a new approach to privacy that we refer to as stochastic privacy. Stochastic privacy centers on the simple idea of providing a guarantee to users about the maximum likelihood that their data will be accessed and used by a service provider. We refer to this measure as the assessed or communicated privacy risk, which may be increased in return for increases in the quality of service or other incentives. Very small probabilities of sharing data may be tolerated by individuals (just as lightning strikes are tolerated as a rare event), yet offer providers sufficient information to optimize over a large population of users. Stochastic privacy depends critically on harnessing inference and decision making to make choices about data collection within the constraints of a guaranteed privacy risk.

We explore procedures that can be employed by service providers when preferences about the sharing of data are represented as privacy risk. The goal is to maximize the utility of service using data extracted from a population of users, while abiding by the agreement reached with users on privacy risk. We show that optimal selection of users under these constraints is NP-hard and thus intractable, given the massive size of the online systems. As a solution, we propose two procedures, RandGreedy and SPGreedy, that combine greedy value of information analysis with obfuscation to offer mechanisms for tractable optimization, while satisfying stochastic privacy guarantees. We present performance bounds for the expected utility achievable by these procedures compared to the optimal solution. Our contributions can be summarized as follows:

  • Introduction of stochastic privacy, an approach that represents preferences about the probability that data will be shared, and methods for trading off privacy risk, incentives, and quality of service.

  • A tractable end-to-end system for implementing a version of stochastic privacy in online services.

  • RandGreedy and SPGreedy procedures for sampling users under the constraints of stochastic privacy, with theoretical guarantees on the acquired utility.

  • Evaluation to demonstrate the effectiveness of the proposed procedures on a case study of user selection for personalization in web search.

Stochastic Privacy Overview

Figure 1 provides an overview of stochastic privacy in the context of a particular design of a system that implements the methodology. The design is composed of three main components: (i) a user preference component, (ii) a system preference component, and (iii) an optimization component for guiding the system’s data collection. We now provide details about each of the components and then formally specify the optimization problem for selective sampling module.

User Preference Component

The user preference component interacts with users (e.g., during signup) and establishes an agreement between a user and service provider on a tolerated probability that the user’s data will be shared in return for better quality of service or incentives. Representing and capturing users’ tolerated privacy risk allows users to move beyond the binary choice of yes or no on the sharing of data. The incentives offered to users can be personalized based on the metalevel information available for a user (e.g., general location information inferred from a previously shared IP address) and can vary from guarantees of improved service [Krause and Horvitz2010] to complementary software and entries in a lottery to win cash prizes (as done by the comScore service [Wikipedia-comScore2006]).

Formally, let be the population of users signed-up for a service. Each user is represented with the tuple , where is the metadata information (e.g., IP address) available for user prior to selecting and logging finer-grained data about the user. is the privacy risk assessed by the user, and is the corresponding incentive provided in return for the user assuming the risk. The elements of this tuple can be updated through interactions between the system and the user. For simplicity of analysis, we shall assume that the pool and user preferences are static.

System Preference Component

The goal of the service provider is to optimize the quality of service. For example, a provider may wish to personalize web search and to improve the targeting of advertising for maximization of revenue. The service provider may record the activities of a subset of users (e.g., sets of queries issued, sites browsed, etc.) and use this data to provide better service globally or to a specific cohort of users. We model the private data of activity logs of user by variable , where represents the web-scale space of activities (e.g., set of queries issued, sites browsed, etc.) . However, is observed by the system only after is selected and the data from is logged. We model the system’s uncertain belief of

by a random variable

, with

being its realization distributed according to conditional probability distribution

. In order to make an informed decision about user selection, the distribution is learned by the system using data available from the user and recorded logs of other users. We quantify the utility of application by logging activities from selected users through function , given by . The expected value of the utility that the system can expect to gain by selecting users with observed attributes is characterized by distribution and utility function as: . However, the application itself may be using the logs in a complex manner (such as training a ranker [Bennett et al.2011]) and evaluating this on complex user metrics [Hassan and White2013]. Hence, the system uses a surrogate utility function to capture the utility through a simple metric, for example, coverage of query-clicks obtained from the sampled users [Singla and White2010] or reduction in uncertainty of click phenomena [Krause and Horvitz2008].

In our model, we require that the set function to be non-negative, monotone (i.e., whenever , it holds that ) and submodular. Submodularity is an intuitive notion of diminishing returns, stating that, for any sets , and any given user , it holds that . These conditions are general, satisfied by many realistic, as well as complex utility functions [Krause and Guestrin2007], such as reduction in click entropy [Krause and Horvitz2008]. As a concrete example, consider the setting where attributes represent geo-coordinates of the users and computes the geographical distance between any two users. The goal of the service provider is to provide location-based personalization of web search. For such an application, click information from local users provides valuable signals for personalizing search [Bennett et al.2011]. The system’s goal is to select a set of users , and to leverage data from these users to enhance the service for the population. For search queries originating from any other user , it uses the click data from the nearest user in , given by . One approach for finding such a set is solving the k-medoid problem which aims to minimize the sum of pairwise distances between selected set and the remaining population [Mirzasoleiman et al.2013, Kaufman and Rousseeuw2009]. Concretely, this can be captured by the following submodular utility function:


Here, is any one (or a set of) fixed reference location(s), for example, simply representing origin coordinates and is used ensure that function is non-negative and monotone. Lemma 1 formally states the properties of this function.

Procedure Competitive utility Privacy guarantees Polynomial runtime
Table 1: Properties of different procedures. RandGreedy and SPGreedy satisfy all the desirable properties.

Optimization Component

To make informed decisions about data access, the system computes the expected value of information (VOI) of logging the activities of a particular user, i.e., the marginal utility that the application can expect by logging the activity of this user [Krause and Horvitz2008]. In the absence of sufficient information about user attributes, the VOI may be small, and hence needs to be learned from the data. The system can randomly sample a small set of users from the population that can be used to learn and improve the models of VOI computation (explorative sampling in Figure 1). For example, for optimizing the service for a user cohort speaking a specific language, the system may choose to collect logs from a subset of users to learn how languages spoken by users map to geography. If preferences about privacy risk were not being regarded, VOI can be used to select which users to log with a goal of maximizing the utility for the service provider (selective sampling in Figure 1). Given that the utility function of the system is submodular, a greedy selection rule makes near-optimal decisions about data access [Krause and Guestrin2007]. However, this simple approach could violate the privacy guarantees made with users. To act in accordance with the assessed privacy risk, we design selective sampling procedures that couple obfuscation with VOI analysis to select the set of users to provide data.

The system needs to ensure that both the explorative and selective sampling approaches respect the privacy guarantees made to users: the likelihood of sampling any user throughout the execution of the system must be less than the privacy risk factor . The system tracks the sampling risk (likelihood of sampling) that user faces during phases of the execution of explorative sampling, denoted , and selective sampling, denoted . The privacy guarantee for a user is preserved as long as: . This difference between the assessed risk and risk faced by a user can be viewed as the sampling budget of that user.

Optimization Problem for Selective Sampling

We now focus primarily on the selective sampling module and formally introduce the optimization problem. The goal is to design a sampling procedure that abides by guarantees of stochastic privacy, yet optimizes the utility of the application in decisions about accessing user data. Given a budget constraint , the goal is to select users :

subject to

Here, is the likelihood of selecting by procedure and hence captures the constraint of stochastic privacy guarantee for . Note that we interchangeably write utility acquired by procedure as to denote where is the set of users selected by running . We shall now consider a simpler setting of constant privacy risk rate for all users and unit cost per user (thus reducing the budget constraint to a simpler cardinal constraint, given by ). These assumptions lead to defining , as that is the maximum possible set size that can be sampled by any procedure for Problem 2.

Selective Sampling with Stochastic Privacy

We shall now propose desiderata of the selection procedures, discuss the hardness of the problem and review several different tractable approaches, as summarized in Table 1.

Desirable Properties of Sampling Procedures

The problem defined by Equation 2 requires solving an NP-hard discrete optimization problem, even when stochastic privacy constraint is removed. The algorithm for finding the optimal solution of this problem without the privacy constraint, referred as Opt, is intractable [Feige1998]. We address this intractability by exploiting the submodular structure of the utility function and offer procedures providing provable near-optimal solutions in polynomial time. We aim at designing procedures that satisfy the following desirable properties: (i) provides competitive utility w.r.t. Opt with provable guarantees, (ii) preserves stochastic privacy guarantees, and (iii) runs in polynomial time.

Random Sampling: Random

Random simply samples the users at random, without any consideration of cost and utility. The likelihood of any user to be selected by the algorithm is and hence privacy risk guarantees are trivially satisfied since as defined in Problem 2). In general, Random can perform arbitrarily poorly in terms of acquired utility, specifically for applications targeting particular user cohorts.

Greedy Selection: Greedy

Next, we explore a greedy sampling strategy that maximizes the expected marginal utility at each iteration to guide the decision about selecting a next user to log. Formally, Greedy starts with empty set . At an iteration , it greedily selects a user and adds it to the current selection of users . It stops when .

A fundamental result by 1978-_nemhauser_submodular-max (1978-_nemhauser_submodular-max) states that the utility obtained by this greedy selection strategy is guaranteed to be at least times that obtained by Opt. This result is tight under reasonable complexity assumptions () [Feige1998]. However, such a greedy selection clearly violates the stochastic privacy constraint in Problem 2—consider the user with highest marginal value: . The likelihood that this user will be selected by the algorithm , regardless of the requested privacy risk .

Sampling and Greedy Selection: RandGreedy

We combine the ideas of Random and Greedy to design procedure RandGreedy which provides guarantees on stochastic privacy and competitive utility. RandGreedy is an iterative procedure that samples a small batch of users at each iteration, then greedily selects and removes the entire set for further consideration. By keeping the batch size , the procedure ensures that the privacy guarantees are satisfied. As our user pool is static, to reduce complexity, we consider a simpler version of RandGreedy that defers the greedy selection. Formally, this is equivalent to first sampling the users from at rate to create a subset such that , and then, running the Greedy algorithm on to greedily select a set of users of size .

The initial random sampling ensures a guarantee on the privacy risk for users during the execution of the procedure. In fact, for any user , the likelihood of being sampled and included in subset is . We further analyze the utility obtained by this procedure in the next section and show that, under reasonable assumptions, the approach can provide competitive utility compared to Opt.

Greedy Selection with Obfuscation: SPGreedy

SPGreedy uses an inverse approach of mixing Random and Greedy: it does greedy selection, followed by obfuscation, as illustrated in Procedure 1. It assumes an underlying distance metric which captures the notion of distance or dissimilarity among users. As in Greedy, it operates in iterations and selects the element with maximum marginal utility at each iteration. However, to ensure stochastic privacy, it obfuscates with similar users using distance metric to create a set of size , then samples one user randomly from and removes the entire set for further consideration.

The guarantees on privacy risk hold by the following arguments: During the execution of the algorithm, any user becomes a possible candidate of being selected if the user is part of in some iteration (e.g., iteration ). Given that and algorithm randomly sample , the likelihood of being selected in iteration is at most . The fact that set is removed from available pool at the end of the iteration ensures that can become a possible candidate of selection only once.

1 Input: users ; cardinality constraint ; privacy risk ; distance metric ;
2 Initialize:
  • Outputs: selected users ;

  • Variables: remaining users ;

      3 while  do
            4 ;
            5 Set ;
            6 while  do
                  7 ;
                  8 ;
            9Randomly select ;
            10 ;
            11 ;
Procedure 1 SPGreedy

Performance Analysis

We now analyze the performance of the proposed procedures in terms of the utility acquired compared to that of the Opt as baseline. We first analyze the problem in a general setting and then under a set of practical assumptions on the structure of underlying utility function and population of users . The proofs of all the results are available at 111Available anonymously at:

General Case

In the general setting, we show that one cannot do better than in the worst case. Consider a population of users where only one user has utility value of 1, and rest of the users have utility of 0. The Opt gets a utility of by selecting . Consider any procedure that has to respect the guarantees on privacy risk. If the privacy rate of is , then can select with only a maximum probability of . Hence, the maximum expected utility that any procedure for Problem 2 can achieve is .

On a positive note, a trivial algorithm can always achieve a utility of in expectation. This result can be achieved by running Greedy to select a set and then choosing the final solution to be with probability , and else output an empty set. Theorem 1 formally states these results for the general problem setting.

Theorem 1.

Consider the Problem 2 of optimizing a submodular function under cardinality constraint and privacy risk rate . For any distribution of marginal utilities of population , a trivial procedure can achieve expected utility of at least . In contrast, there exists an underlying distribution for which no procedure can have expected utility of more than .

Smoothness and Diversification Assumptions

In practice, we can hope to do much better than the worst-case results described in Theorem 1 by exploiting the underlying structures of users attributes and utility function. We start with the assumption that there exists a distance metric which captures the notion of distance or dissimilarity among users. For any given , let us define its -neighborhood to be the set of users within a distance from (i.e., -close to ): . We assume that population of users is large and that the number of users in the is large. We formally capture these requirements in Theorems 2,3.

Firstly, we consider utility functions that change gracefully with changes in inputs, similar to the notion of -Lipschitz set functions used in 2013-nips_krause_distributed (2013-nips_krause_distributed). We formalize the notion of smoothness in the utility function w.r.t metric as follows:

Definition 1.

For any given set of users , let us consider a set obtained by replacing every with any . Then, , where parameter captures the notion of smoothness of function .

Secondly, we consider utility functions that favor diversity or dissimilarity of users in the subset selection w.r.t the distance metric . We formalize this notion of diversification in the utility function as follows:

Definition 2.

Let us consider any given set of users and a user . Let . Then, , where parameter captures the notion of diversification of function .

The utility function introduced in Equation 1 satisfies both the above assumptions as formally stated below.

Lemma 1.

Consider the utility function in Equation 1. is submodular, and satisfies the properties of smoothness and diversification, i.e. has bounded and .

We note that for the functions with unbounded and (i.e., and ), it would lead to the general problem settings (equivalent to no assumptions) and hence results of Theorem 1 apply.

(a) Vary budget
(b) Vary privacy risk
(c) Analyze SPGreedy
Figure 2: In Fig. 2(a), for a fixed , budget or number of users selected in increased, showing the competitiveness of our procedures w.r.t Greedy. In Fig. 2(b), a fixed is used, and level of privacy risk is reduced. The results demonstrate that the performance of RandGreedy and SPGreedy degrades smoothly. Fig. 2(c) analyze the execution of the procedure SPGreedy and illustrates that the loss incurred in marginal utility at every step via obfuscation is very low.

Performance Bounds

Under the assumption of smoothness (i.e., bounded , we can show the following bound on utility of RandGreedy:

Theorem 2.

Consider the Problem 2 for function with bounded . Let be the set returned by Opt for Problem 2 after relaxing privacy constraints. For a desired , let . Then, with probability at least ,

Under the assumption of smoothness and diversification (i.e., bounded and ), we can show the following bound on utility of SPGreedy:

Theorem 3.

Consider the Problem 2 for function with bounded and . Let be the set returned by Greedy for Problem 2. Let . Then,

Intuitively, these results imply that both RandGreedy and SPGreedy achieve competitive utility w.r.t Opt, and the performance degrades smoothly as the privacy risk is decreased or the bounds on smoothness and diversification increase.

Experimental Evaluation

We shall now report on experiments we performed to build insights about the performance of the stochastic privacy procedures on a case study of the selective collection of user data in support of web search personalization.

Benchmarks and Metrics

We compare the performance of the RandGreedy and SPGreedy procedures against the baselines of Random and Greedy. While Random provides a trivial lower benchmark for any procedure, Greedy is a natural upper bound on the utility, given that the Opt itself is intractable. To analyze the robustness of our procedures, we then vary the level of privacy risk . We further carried out experiments to understand the loss incurred from obfuscation phase during the execution of SPGreedy.

Experimental Setup

We considered the application of providing location-based personalization for queries issued for the business domain (e.g., real-estate, financial services, etc.). The goal is to select a set of users who are expert web search users in this domain. We seek to leverage the click data from these users to improve the relevance of search results shown to those searching for local businesses. The experiments are based on using a surrogate utility function as introduced in Equation 1. As we are interested in specific domain of business-related queries, we modify the utility function in Equation 1 by restricting to users who are experts in the domain, as further described below. The acquired utility can be interpreted as the average reduction in the distance for any user in the population to the nearest expert .

The primary source of data for this study is obtained from interaction logs on a major web search engine. We considered a fraction of users who issued at least one query in month of October 2013, restricted to queries coming from IP addresses located within ten neighboring states in the western region of the United States. This resulted in a pool of seven million users. We considered a setting where system has access to metadata information of geo-coordinates of the users, as well as a probe of the last 20 search-result clicks for each user, which together constitutes the observed attributes of user denoted as

. Each of these clicks are then classified into a topical hierarchy from a popular web directory, the Open Directory Project (ODP) (, using automated techniques

[Bennett, Svore, and Dumais2010]. With a similar objective to 2009-wsdm_white_experts (2009-wsdm_white_experts), the system then uses this classification to identify users who are expert in the business domain. We used a simple rule of classifying a user as an expert if at least one click was issued in the domain of interest. With this, the system marks a set of users as experts, and the set in Equation 1 is restricted to . We note that the specific thresholds or variable choices do not affect the overall results below.


We now discuss the findings from our experiments.

Varying the budget : In our first set of experiments, we vary the budget , or equivalently the number of users selected, and measured the utility acquired by different procedures. The privacy risk rate is set fixed to . Figure 2(a) illustrates that both RandGreedy and SPGreedy are competitive w.r.t Greedy and clearly outperform naive baseline of Random.

Varying the privacy risk : We then vary the level of privacy risk, for a fixed budget , to measure the robustness of the RandGreedy and SPGreedy. The results in Figure 2(b) demonstrate that the performance of RandGreedy and SPGreedy degrades smoothly, as per the performance analysis in Theorems 2,3.

Analyzing performance of SPGreedy: Lastly, we perform experiments to understand the execution of SPGreedy and the loss incurred from the obfuscation step. SPGreedy removes users from pool at every iteration. As a result, for small privacy risk , the relative loss from obfuscation (i.e., relative % difference in marginal utility acquired by a user chosen by greedy selection, compared to one finally picked after obfuscation) could possibly increase over the execution of procedure as illustrated in Figure 2(a), using a moving average of window size 10. However, the diminishing returns property of the utility ensures that SPGreedy incurs very low absolute loss in marginal utility from obfuscation at every step.


We introduced stochastic privacy, a new approach to privacy that centers on service providers abiding by guarantees about not exceeding a specified likelihood of logging data, and maximizing information collection in accordance with these guarantees. We presented procedures and an overall system design for maximizing the quality of services while respecting privacy risks agreed with populations of users.

Directions for this research include the assessments of user preferences about the probability of sharing data, including how users trade increases in privacy risk for enhanced service and monetary incentives. Directions also include exploring the rich space of designs for interactive and longer-term controls and settings of a tolerated risk of sharing data. Opportunities include policies and analyses based on the sharing of data as a privacy risk rate over time. As an example, systems might one day consider decisions about logging one or more search sessions of users where privacy risk is assessed in terms of the risk of sharing search sessions over time. In another research direction, designs can include models where users are notified when they are selected to share data and are provided with a special reward and option of declining to share at that time. Iterative analyses can be developed where subsets of users are actively engaged with the option to assume higher levels of privacy risk or to simply provide additional information in return for special incentives. Inferences about the likely preferences on privacy risk and about incentives for subpopulations could be folded into the selection procedures.


  • [Adar2007] Adar, E. 2007. User 4xxxxx9: Anonymizing query logs. In Workshop on Query Log Analysis at WWW’07.
  • [Arrington2006] Arrington, M. 2006. Aol proudly releases massive amounts of private data. http://techcrunch
  • [Bennett et al.2011] Bennett, P. N.; Radlinski, F.; White, R. W.; and Yilmaz, E. 2011. Inferring and using location metadata to personalize web search. In Proc. of SIGIR, 135–144.
  • [Bennett, Svore, and Dumais2010] Bennett, P. N.; Svore, K.; and Dumais, S. T. 2010. Classification-enhanced ranking. In Proc. of WWW, 111–120.
  • [Cooper2008] Cooper, A. 2008. A survey of query log privacy-enhancing techniques from a policy perspective. ACM Trans. Web 2(4):19:1–19:27.
  • [Feige1998] Feige, U. 1998. A threshold of ln n for approximating set cover. Journal of the ACM 45:314–318.
  • [FTC2011] FTC. 2011. FTC charges against Facebook.
  • [FTC2012] FTC. 2012. FTC charges against Google.
  • [Hassan and White2013] Hassan, A., and White, R. W. 2013. Personalized models of search satisfaction. In Proc. of CIKM, 2009–2018.
  • [Kaufman and Rousseeuw2009] Kaufman, L., and Rousseeuw, P. J. 2009.

    Finding groups in data: an introduction to cluster analysis

    , volume 344.
    Wiley. com.
  • [Krause and Guestrin2005] Krause, A., and Guestrin, C. 2005. A note on the budgeted maximization on submodular functions. Technical Report CMU-CALD-05-103, Carnegie Mellon University.
  • [Krause and Guestrin2007] Krause, A., and Guestrin, C. 2007. Near-optimal observation selection using submodular functions. In Proc. of AAAI, Nectar track.
  • [Krause and Horvitz2008] Krause, A., and Horvitz, E. 2008. A utility-theoretic approach to privacy and personalization. In Proc. of AAAI.
  • [Krause and Horvitz2010] Krause, A., and Horvitz, E. 2010. A utility-theoretic approach to privacy in online services.

    Journal of Artificial Intelligence Research (JAIR)

  • [Mirzasoleiman et al.2013] Mirzasoleiman, B.; Karbasi, A.; Sarkar, R.; and Krause, A. 2013. Distributed submodular maximization: Identifying representative elements in massive data. In Proc. of NIPS.
  • [Narayanan and Shmatikov2008] Narayanan, A., and Shmatikov, V. 2008. Robust de-anonymization of large sparse datasets. In Proc. of the IEEE Symposium on Security and Privacy, 111–125.
  • [Nemhauser, Wolsey, and Fisher1978] Nemhauser, G.; Wolsey, L.; and Fisher, M. 1978. An analysis of the approximations for maximizing submodular set functions. Math. Prog. 14:265–294.
  • [Olson, Grudin, and Horvitz2005] Olson, J.; Grudin, J.; and Horvitz, E. 2005. A study of preferences for sharing and privacy. In Proc. of CHI.
  • [Singla and White2010] Singla, A., and White, R. W. 2010. Sampling high-quality clicks from noisy click data. In Proc. of WWW, 1187–1188.
  • [Technet2012] Technet. 2012. Privacy and technology in balance.
  • [White, Dumais, and Teevan2009] White, R. W.; Dumais, S. T.; and Teevan, J. 2009. Characterizing the influence of domain expertise on web search behavior. In Proc. of WSDM, 132–141.
  • [Wikipedia-comScore2006] Wikipedia-comScore. 2006. ComScore-#Data_collection_and_reporting. http://en.
  • [Xu et al.2007] Xu, Y.; Wang, K.; Zhang, B.; and Chen, Z. 2007. Privacy-enhancing personalized web search. In Proc. of WWW, 591–600. ACM.

Appendix A Proof of Lemma 1

We prove Lemma 1 by proving three other Lemmas 2 3 4 that are not in the main paper. In Lemma 2, by using the decomposable property of the function from Equation 1, we prove that the function is non-negative, monotonous (non-decreasing) and submodular. Then, we show that the function satisfies the properties of smoothness (in Lemma 3) and diversification (in Lemma 4) by showing an upper bound on the values of the parameters and .

Lemma 2.

Utility function in Equation 1 is non-negative, monotone (non-decreasing) and submodular.


We begin by noting that is decomposable, i.e., it can be written as a sum of simpler functions as:


where is given by:


Next, we prove that each of these functions is non-negative, non-decreasing and submodular. To prove that the function is non-decreasing, consider any two sets . Then,


In step 5, the inequality holds as the distance to the nearest user for in cannot be more than that in , hence proving that is non-decreasing. Also, it is easy to see that = 0, which along with the non-decreasing property, ensures that the function is non-negative.

To prove that the function is submodular, consider any two sets , and any given user . When , submodularity holds trivially as we have using non-decreasing property. Let us consider the case when , i.e., is assigned as the nearest user to from the set , given by . In this case, it would also be the case that is the nearest user to from the set . Then, we can write down the difference of marginal gains as follows:


In step 6, the inequality holds as the function is non-decreasing, thus showing that the marginal gains diminish and hence proving the submodularity of the function .

By using the fact that these properties are preserved under linear combination with non-negative weights (all equal to from Equation 3), is non-negative, non-decreasing and submodular. ∎

Lemma 3.

Utility function in Equation 1 satisfies the properties of smoothness, i.e. has bounded .


For any given set of users , let us consider a set obtained by replacing every with any . The goal is to show that always holds for a fixed and bounded .

Let us again use the simpler functions from decomposition of in Equation 3 and consider the difference . Then,


In step 7, the inequality holds as the deviation in distance to the nearest user for in cannot be more than . Using this result, we have


The inequality in step 8 holds by using the result of step 7 and inequality in step 9 holds trivially as . Hence, the smoothness parameter of the function is bounded by . ∎

Lemma 4.

Utility function in Equation 1 satisfies the properties of diversification, i.e. has bounded .


For any given set of users and any new user , let us define . The goal is to show that always holds for a fixed and bounded .

Again, let us consider the function and consider the marginal of adding to , given by . When is not the nearest user to in the set , we have . Let’s consider the case where , i.e., is assigned as the nearest user to from the set , given by . Let us denote the nearest user assigned to before adding to the set by . Then, we have:


Step 10 uses the triangular inequality of the underlying metric space. In step 11, the inequality holds by the definition of . Then, we have


The inequality in step 12 holds by using the result of step 11. Hence, the diversification parameter of the function is bounded by . ∎

Proof of Lemma 1.

The proof directly follows from the results in Lemmas 2 3 4. ∎

Appendix B Proof of Theorem 2

Proof of Theorem 2.

Let be the set returned by Opt for Problem 2 without the privacy constraints. By the hypothesis of the theorem, for each of the element , the neighborhood of contains a set of at least users. Furthermore, by hypothesis, these sets of size at least can be constructed to be mutually disjoint for every , let us denote these mutually disjoint sets by . Formally, this means that for , we have and for any pairs of , we have .

Recall that the simpler version of RandGreedy first samples the users from at rate to create a subset such that . We first show that sampling at a rate by RandGreedy ensures that with high probability (given by ), at least one user is sampled from for each of the . Consider the process of sampling for and . Each of the users in has probability of being sampled given by . Hence, the probability that none of the users in are included in for a given is given by:

By using union bound, the probability that none of the users in gets included in for any is bounded by (given by ). Hence, with probability at least , the sampled set contains at least one user from for every .

This is equivalent to saying that, with probability at least , the contains a set that can be obtained by replacing every with some , and hence (by using the definition of smoothness property). And, running the Greedy on ensures that the utility obtained is at least . Hence, with probability at least ,

Appendix C Proof of Theorem 3

Proof of Theorem 3.

Let be the set returned by Greedy for Problem 2 without the privacy constraints. By the hypothesis of the theorem, for each of the element , the neighborhood of contains a set of at least users. The loss of utility for the procedure SPGreedy compared w.r.t to Greedy at iteration can be attributed to two following reasons: obfuscation of with set to select , where the size of is , and removal of the entire set for further consideration. We analyze these two factors separately to get the desired bounds on the utility of SPGreedy.

We being by stating a more general result on the approximation guarantees of Greedy from [Krause and Guestrin2005] when the submodular objective function can only be evaluated approximately within an absolute error of . Results from [Krause and Guestrin2005] states that the utility obtained by this noisy greedy selection is guaranteed to be at least , where is the budget.

Now, consider an alternate procedure that operates similar to SPGreedy, by obfuscating with set to pick at each iteration . However, this alternate procedure does not eliminate the entire set of users from the pool, but only removes . Instead, it tags the users of as , i.e. these users are marked as invalid and are tagged with the iteration at which they became invalid (in case a user was already marked as invalid, the iteration tag is not updated). Let us denote this alternate procedure by . This can alternatively be viewed as similar to Greedy, though it can pick the user at every iteration only approximately, because of the noise added by obfuscation. We now bound the absolute value of this approximation error at every iteration. As is obfuscated with a set of users of size nearest to from the hypothesis of the theorem, we are certain that set is contained within a radius of neighborhood. Now, from the smoothness assumptions, the maximum absolute error that could be introduced by the obfuscation compared to greedy selection (i.e. the difference in marginal utilities of and ) at a given iteration is bounded by . Hence, the utility obtained by can be lower-bounded as:


Next, we consider the loss associated with the removal of entire set at iteration . Let us consider the execution of and let be the first iteration when the obfuscation set created by the procedure contains at least one element marked as invalid, with the associated iteration of invalidity as . Note that when , there is no loss associated with this step of removing and hence we only consider the case when . As the users are embedded in euclidean space, this means that the centered around and overlaps and hence . From the diversification assumption, this means that the marginal utility of cannot be more than . And, furthermore, the submodularity ensures that for all , the marginal utility of users selected can only be lesser than the marginal utility of .

Let us consider a truncated version of that stops after steps, denoted by , where denotes the fact that this procedure is always valid as it never touches invalid marked users. The utility of the truncated version can be lower-bounded as follows:


The step 14 follows by using the result in step 13. For the first iterations, the execution of the mechanism SPGreedy is exactly same as . Hence, SPGreedy acquires utility at least that acquired by , which completes the proof. ∎