Privacy-Utility Trades in Crowdsourced Signal Map Obfuscation

01/13/2022
by   Jiang Zhang, et al.
0

Cellular providers and data aggregating companies crowdsource celluar signal strength measurements from user devices to generate signal maps, which can be used to improve network performance. Recognizing that this data collection may be at odds with growing awareness of privacy concerns, we consider obfuscating such data before the data leaves the mobile device. The goal is to increase privacy such that it is difficult to recover sensitive features from the obfuscated data (e.g. user ids and user whereabouts), while still allowing network providers to use the data for improving network services (i.e. create accurate signal maps). To examine this privacy-utility tradeoff, we identify privacy and utility metrics and threat models suited to signal strength measurements. We then obfuscate the measurements using several preeminent techniques, spanning differential privacy, generative adversarial privacy, and information-theoretic privacy techniques, in order to benchmark a variety of promising obfuscation approaches and provide guidance to real-world engineers who are tasked to build signal maps that protect privacy without hurting utility. Our evaluation results, based on multiple, diverse, real-world signal map datasets, demonstrate the feasibility of concurrently achieving adequate privacy and utility, with obfuscation strategies which use the structure and intended use of datasets in their design, and target average-case, rather than worst-case, guarantees.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

10/17/2019

Information-theoretic metrics for Local Differential Privacy protocols

Local Differential Privacy (LDP) protocols allow an aggregator to obtain...
02/08/2022

A Unified Prediction Framework for Signal Maps

Signal maps are essential for the planning and operation of cellular net...
08/20/2020

Not one but many Tradeoffs: Privacy Vs. Utility in Differentially Private Machine Learning

Data holders are increasingly seeking to protect their user's privacy, w...
07/16/2021

BRR: Preserving Privacy of Text Data Efficiently on Device

With the use of personal devices connected to the Internet for tasks suc...
05/18/2018

Learning to Collaborate for User-Controlled Privacy

It is becoming increasingly clear that users should own and control thei...
07/30/2020

Privacy Enhancing Machine Learning via Removal of Unwanted Dependencies

The rapid rise of IoT and Big Data has facilitated copious data driven a...
04/20/2020

On the Data Fight Between Cities and Mobility Providers

E-Scooters are changing transportation habits. In an attempt to oversee ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Network providers and data aggregating companies crowdsource mobile user data for a variety of reasons. This data can reveal network performance, allow for the generation of signal strength maps, inform decisions on where to deploy cell towers or sensors, and provide insight on how to improve user experience. The measurements are collected directly from user devices, via standalone mobile apps opensignal, or measurement software development kits tutela integrated into popular partnering apps. Providers and aggregators then sell this data to network operators, regulators, and device and equipment manufacturers. For the operators, regulators, and manufacturers, this crowdsourced data offers clear value for network planning. For the user, contributing data can in turn be useful, given that it leads to better network performance. However, participation also raises legitimate privacy concerns.

For example, some cellular providers have allegedly been selling their users’ real-time location data to credit agencies, bail bondsmen, and other third parties att_lawsuit. Furthermore, while these measurements are assumed to be sparse in space and time and over thousands of users, previous work has shown that identities are inferable from anonymized data personal_discovery.

In recent years, privacy issues have come to the front of news, politics, and public opinion nature_privacy; nytimes; ibm and pioneering privacy laws have been enacted URL_GDPR; URL_CCPA. To protect user privacy, a plethora of data masking, or obfuscating, schemes have been proposed, see, for example, privacycrowdsensing. However, by obfuscating the original data for the sake of privacy, data can no longer provide the exact insights it once could, sacrificing data utility for privacy infotheor_util.

In this work we examine the privacy-utility tradeoff in the context of cellular signal strength measurements, focusing on device-level obfuscation where the measurement is obfuscated, or privatized, before it leaves the user’s phone. The goal is to increase privacy such that it is difficult to recover sensitive features from the obfuscated measurements, including user ids and whereabouts, while still allowing network providers to use the measurement for improving network services, i.e. create accurate signal maps. To examine this privacy-utility tradeoff, we identify privacy and utility metrics and threat models suited to the signal map application at hand. We then obfuscate the measurements using a number of promising approaches at the forefront of privacy research, in order to benchmark them and provide guidance to real-world engineers who are tasked with building signal maps that provide (some) privacy while maintaining (adequate) utility. To evaluate the different approaches, we use multiple, diverse, real-world signal map datasets to ensure real world applicability of our findings.

We implement four strategies for obfuscating signal strength measurements to assess and compare their application-specific performance, selecting preeminent methods from the literature that span a range of complexities and privacy guarantees. Specifically, the first is a noise-adding privatizer, which adds independent, identically distributed Gaussian noise across the features of the data. Albeit simple, this scheme provides intuition into the privacy-utility tradeoff via the choice of how much noise to add. The second is based on differential privacy (DP) dp, a leading approach to data obfuscation which provides probabilistic worst-case guarantees against any arbitrary adversary, including one with unlimited resources and access to side-information. In this work we apply the popular local Gaussian mechanism dp, as well as the recent Truncated Laplacian Mechanism geng2018privacy

. The third leverages the idea of generative adversarial networks to allow a data-driven method of learning an obfuscation scheme. This method, which is referred to as generative adversarial privacy (GAP)

gap

, positions a privatizer and an adversary, both modeled as neural networks, against each other. The privatizer learns to obfuscate the data such that the adversary cannot infer sensitive features, and the adversary simultaneously learns to infer sensitive features. While this method cannot offer the formal worst-case guarantees of the differentially private methods, the learning approach offers the potential to leverage structure in the data set and take advantage of the specific utility objectives in the network. The fourth strategy is motivated by an information-theoretic treatment of the problem. Considering mutual information as a convex metric for privacy performance, we frame a formal optimization problem as finding the obfuscation strategy which maximizes privacy subject to a constraint on utility. This approach maximizes user privacy in an average sense, but sacrifices the worst-case guarantees offered by the deferentially private methods. Section

5 discusses these privatizers in more detail.

We analyze the performance of each of these privatizers using three, diverse, real-world signal map datasets. The first one is collected from cellular users over a seven month period in the city of Chania, Greece manos. The second one is collected over a period of four months by Android smartphones in the University of California Irvine campus emmanouil2017using. The last one is sampled from the Radiocell dataset URL_RADIOCELL, one of the largest publicly available datasets with millions of measurements from nearly one million macrocells around the world. The sample we work with contains signal strength measurements from hundreds of users over a one year period in UK’s countryside. Section 3.1 discusses these datasets in detail.

An important aspect of our study is to identify privacy and utility metrics (Section 4) as well as threat models (Section 3.2) suited to signal map application. We assess our obfuscation schemes against specific adversaries, modeled as neural networks (Section 3.5

discusses adversary models in detail), which estimate private user information from observing obfuscated data, and we take the adversary’s estimation performance as a practical, application-specific privacy metric. We also consider more robust privacy guarantees, such as DP, which is not dependent on any specific adversary implementation. With respect to utility, we consider two metrics. First, we consider a received signal strength (RSS) model which accurately predicts signal maps when trained with unobfuscated data. We train this model with the obfuscated data. Then, we use as an application-specific utility metric the

distance between the parameters of the RSS model trained with obfuscated versus unbofuscated data. As a general utility metric, we use the overall assessment of data distortion. This serves as a proxy for utility under a wide variety of other potential mobile data applications.

The main contributions of this work are: (1) a framework for formalizing privacy and utility for crowdsourced signal map application, (2) a systematic exploration of the parameterized privacy-utility tradeoff when obfuscating signal strength measurements, (3) an analysis and comparison of four obfuscation schemes, and (4) an evaluation of the feasibility of achieving different notions of privacy in the signal map application. Our findings show that while local DP provides privacy guarantees under worst-case conditions, it comes with a substantial cost in utility. GAP and IT privacy can offer significantly improved privacy-utility tradeoff by sacrificing worst-case guarantees and incorporating application-specific context, such as structure in the datasets and network objectives.

In the next section, we briefly discuss relevant work in privacy, especially as it relates to mobile network data. Section 3 describes our system model, including the three real-world datasets that we use in our evaluation, the threat models we consider, the privatizer and adversary model we implement, and the service provider model we consider. Section 4 rigorously defines our privacy and utility metrics. Section 5 presents each of the four obfuscation schemes. In Section 6, we evaluate and compare these schemes, and analyze our results. We discuss the limitations and future works in Section 7, and present our conclusions in Section 8.

2 Related Work

2.1 Privacy mechanisms

Differential privacy (DP) dworkdp; dp; dpresults is a mathematically rigorous definition of privacy which is useful for quantifying privacy loss and designing randomized algorithms with privacy guarantees. Motivated by statistical disclosure control, or providing accurate statistics while protecting individual survey respondents, DP approaches the problem of releasing coarse-grained information while keeping fine-grained details private. This popular approach to data privatization is studied under the local localdp and global models. The global model assumes a trusted data analyst has access to the dataset and wants to release queries computed on it in a privacy-preserving fashion. The local model assumes the absence of a trusted server, thus the data is randomized prior to aggregation. This work applies the local model of differential privacy.

Generative Adversarial Privacy (GAP) gap; cagap offers an alternative to noise-adding mechanisms in that it is context-aware, meaning it takes the dataset statistics into account. GAP learns from the dataset without the need to explicitly model the underlying distribution. Leveraging recent advancements in generative adversarial networks (GANs) gan1; gan2; cgan, GAP allows the privatizer to learn obfuscation schemes from the dataset itself. Like the generator and discriminator in a GAN, the privatizer and adversary optimize a minimax game.

Information-theoretic (IT) privacy infotheor_util; info_theo; infotheoretic provides an alternative in which privacy metrics are motivated by concepts from information theory. For example, mutual information mutualinformation

is the measure of how much one random variable tells us about another. Obfuscation schemes which minimize mutual information intuitively provide privacy. Unlike DP which provides guarantees on worst-case privacy, mutual information is an expectation, i.e. provides guarantees on average privacy.

2.2 Theoretical studies of privacy-utility trades

Previous work has analyzed distortion in the context of DP dp_distortion or attempted to minimize the utility loss incurred by DP dp_lopub. Previous work in GAP maximizes privacy subject to a constraint on distortion gap. Additionally, previous IT privacy metrics have been considered in the context of theoretically motivated utility metrics ITutilitymetrics. In contrast, in this work we consider utility metrics beyond distortion which are specific to our application and are both more intuitive and relevant for mobile network data. We also formally compare the performance of context-free (Gaussian noise-adding, local DP) and context-aware (GAP, IT) approaches in the context of our application.

Prior theoretical studies on the privacy-utility tradeoff include infotheor_util; PU_trade; pu_ghosh. The authors of infotheor_util

formally define an analytical model for trading equivocation (a privacy metric based on Shannon entropy) and distortion (a utility metric which could be Euclidean distance, Hamming distortion, Kullback-Leibler divergence, etc). This model is designed for “universal” metrics, but is not generalized for non-i.i.d. datasets or datasets lacking strong structural properties. A so called geometric mechanism is presented in

pu_ghosh

as a utility-maximizing alternative to the Laplace or Gaussian mechanisms typically used in differential privacy, where utility is the expected loss of any symmetric, monotonic loss function. In

PU_trade, the authors define a bound on the information-theoretic min-entropy leakage of -differential privacy, and a bound on utility (where utility is roughly the number of differing dataset entries). Our work uniquely examines this tradeoff for all of these approaches in the unifying context of a single application, allowing us to present additional insight.

2.3 Prior work on mobile network data privacy

Previous work on privacy in mobile network data has considered strategic sampling, distribution modeling, and noise addition as obfuscation strategies. In privacycrowdsensing, the authors exploit compressive sensing techniques to sample and compress received signal strength values in a privacy-preserving RSS map generation scheme. While privacy is gained in sampling and compression, the authors of privacycrowdsensing do not take a formal approach to quantifying privacy. In distributednoise, distributed algorithms for Gaussian and exponential noise addition are explored in a crowdsourced data setting. Local differential privacy is applied to the user-tracking problem for indoor positioning systems in mobile_dp. The authors of syntheticdata present a relaxed version of differential privacy, probabilistic DP, which accounts for certain worst-case privacy scenarios being highly unlikely. They apply this to the generation of synthetic data which maps daily commutes. In jin2016inception, a novel privacy-preserving incentive mechanism is proposed for mobile crowd sensing, where the authors employed DP to perturb aggregated data. In each of privacycrowdsensing; distributednoise; mobile_dp; syntheticdata; jin2016inception, utility is not rigorously considered. Our work takes a formal approach to both privacy and utility.

In recent years, researchers have grown interested in studying the privacy-utility tradeoff in mobile network applications. Shokri et al. in shokri2012protecting

propose an optimal strategy against location attack based on Stackelberg Bayesian game theory, which provide the best location privacy while satisfying the user’s service quality requirements. Bordenabe et al. in

bordenabe2014optimal formulate the tradeoff optimization problem between geo-indistinguishability and quality of service, and propose a method based on linear optimization to solve this problem. Chen et al. in chen2021optimized

design a differentially private obfuscation scheme based on reinforcement learning to optimize privacy budget allocation for each location in vehicle trajectory obfuscation, which can balance geolocation obfuscation and semantic security and thus results in better privacy-utility tradeoff. In

boukoros2019lack, the authors design novel privacy and utility metrics for location privacy, and perform large-scale evaluation and analysis of several existing location-privacy preserving mechanisms. In KIM2021102464, the authors provide a survey of DP-based obfuscation approaches for location privacy protection and compare the privacy-utility tradeoff performance of these approaches. However, these works only focus on location privacy without considering other privacy metrics for mobile user data. Moreover, the proposed approaches in shokri2012protecting and bordenabe2014optimal

are based on linear programming and discrete locations which cannot be easily applied under our threat models (continue locations and non-linear adversary). The mechanism proposed by

chen2021optimized does not formally optimize the privacy-utility tradeoff during trajectory obfuscation. boukoros2019lack is an empirical study with no formal analysis or obfuscation schemes that formally consider both privacy and utility in their design, and KIM2021102464 only surveys DP-based obfuscation approaches without considering other obfuscation schemes.

In gursoy2018differentially, the authors propose a novel framework, DP-star, for publishing trajectory data with differential privacy guarantees, while preserving high utility. In adatrace the authors present AdaTrace, a utility-aware location trace synthesizer which provides a differential privacy guarantee and inference attack resilience. This work is closely related to ours in that the authors employ both learning and noise-adding to generate datasets which they evaluate for statistical utility, and analyze how the choice of privacy parameter effects utility. In messaoud2018privacy, the authors proposed a privacy-preserving and utility-aware mechanism based on mutual information optimization, with application to the data uploading phase in participatory sensing. Zhang et al. in zhang2020aggregation also propose an information theoretic approach based on mutual information optimization, which protects the user’s location privacy while satisfying the user’s utility constraints when releasing location aggregates. However, these works only consider the database-level threat model during dataset publishing, which requires a trust-worthy third party to distort data before release, and they cannot be directly applied to the device-level obfuscation in our application.

In liu2019privacy and Raval2019OlympusSP, GANs are leveraged to achieve utility-aware obfuscation of mobile sensor data. However, these works focus on obfuscating image sensor data to reduce sensitive information leakage in mobile apps, where both the dataset structure and threat model are different from ours. Moreover, Raval2019OlympusSP does not compare GANs with other formal obfuscation schemes, and liu2019privacy does not compare against obfuscation schemes that formally consider both privacy and utility in their design.

While the advantages and disadvantages of a range of obfuscation methods are to some extent known in principle from prior work, how they perform and compare in signal map application is unclear. In this work, we implement representative obfuscation schemes based on preeminent approaches, apply them to the important real-world application of generating signal maps via crowd-sourcing, and compare their performance. Our performance results can serve as benchmarks, offering insights about how to design real-world systems to generate accurate signal maps while protecting user privacy.

3 System Model

Fig. 1 illustrates the system model we consider, which involves mobile users, a service provider or a third party, and an adversary. User devices record network measurement data and transmit it to a service-provider or third-party server. Since the reported data contains information that the users may deem private (e.g. user location, see Section 3.1), users apply device-level privatizers to obfuscate their data locally before uploading them to the server (see Section 3.3). The goal of the service provider is to train a RSS model based on the aggregated obfuscated user measurement data, which can be used to generate signal maps and thus guide network planning and operation taufique2017planning (see Section 3.6). Finally, an adversary with access to the obfuscated data estimates the whereabouts of users, by estimating the user ID and location corresponding to the incoming measurements (see Section 3.5). Note that we assume the adversary has access to the obfuscated data as it arrives at the server, but no side information that directly reveals the identity of users (see Section 3.2 for a detailed description of the threat model).

Figure 1: Overview of system model. 1) Users collect measurement data and obfuscate it before uploading it to the service provider; 2) The service provider or third party aggregates obfuscated user data to train a signal map model, i.e. RSS prediction model; 3) The adversary has access to the obfuscated user data and uses it to estimate the user ID and locations.

3.1 User Data

We use three real-world datasets collected from different countries and over different time periods to evaluate the performance of our privatization schemes under different environments and user behaviors, and thus make our findings more conclusive.

The first dataset is taken from users in Chania, Greece, and will be referred to as the Chania dataset, which contains measurements from nine users over seven months in 2014. The nine users are mobile device owners who carry their devices with them throughout the day collecting measurements. Each measurement contains 24 features: device address, timestamp (to the second), received signal strength (RSS) in dBm, latitude, longitude, cellID identifying the base station, downlink carrier frequency, uplink carrier frequency, mobile network code, etc.

The second dataset contains measurements from seven users over four months in the University of California Irvine (UCI) campus in 2017, and will be referred to as the UCI dataset. Each measurement consists of 15 features including latitude, longitude, reference signal received power (RSRP) in dBm, reference signal reference quality (RSRQ) in dBm, timestamp. deviceID, cellID, etc.

The third dataset is collected by Radiocell.org URL_RADIOCELL

, which has been crowdsourcing wireless network measurements from world-wide mobile users since 2009. It is the largest open-source mobile network dataset we can have access to. We sample about 0.5 million measurements from 219 mobile users in UK, 2019

111We choose UK since most of the collected measurements in 2019 come from mobile users in UK (10 million measurements in total). To limit computational complexity, we select three cells containing the largest amount of data., and refer to it as Radiocell dataset. Each measurement has 23 features including latitude, longitude, altitude, speed, signal strength (SS) in dBm, country code, mobile network code, etc.

The most relevant features to this paper are tabulated in Table 1 along with an indicator of their sensitivity. User ID and location are assumed sensitive features (private), whereas RSS/RSRP and others are not sensitive (public).

Features User ID Latitude Longitude RSS Others
Sensitivity Private Private Private Public Public
Variable
Table 1: Dataset Features
Figure 2: Chania Dataset (colored by user)

For visualization purposes, we have plotted the data of the first dataset over the geographic region in Fig. 2. The colors indicate user ID, and it is apparent that one cannot easily assume user ID based on location alone.

3.2 Threat Model

Our adversary has the goal of gathering private information that may be revealed by users operating in the mobile data network who are sending data reports to the service provider. The adversary may use this information for purposes not in the users’ interest, or even to aid criminal attacks such as identity theft. To accomplish his/her goal, the adversary will seek to obtain access to as many user feature reports as possible, consisting of . Since the primary information sought by the adversary may not be explicitly present in the reports, e.g., if the reports are intentionally obfuscated, the adversary will perform inference attacks to estimate the private user information they desire. The nature of the threat may have some variation dependent on the specific mobile data application and the capabilities of the adversary. With this in mind, we consider the following properties as part of the definition of the threat model:

  • Whether the adversary can access individual user reports directly, or whether their access is limited to the aggregated reports of all users,

  • whether the adversary should be assumed to have bounded computational resources,

  • whether the adversary has access to relevant side information, and

  • whether users are primarily concerned with potential exposure of private information from their reports on average or in the worst-case.

Side information is any additional information that may be available to an adversary that could be used to supplement the information collected from the user reports to increase the efficacy of an inference attack. This could include public databases from organizations like the US Census Bureau or the Department of Transportation which allow an adversary to associate data features, e.g., addresses with names.

Typical mobile network data threat model: For most mobile network data applications and users, we apply the following threat model:

  • The adversary can access individual user reports directly,

  • the adversary’s computational resources are bounded,

  • the adversary has limited access to side information, and

  • users are primarily concerned with privacy exposure on average.

We consider that many users are likely to have reservations about providing private data to a service provider, either because they do not trust the provider to adequately protect their data or they believe the service provider will themselves use the data in ways that do not align with the user’s interests. For this reason, we assume a threat model where users must be able to protect their private information at the local level, e.g., at the user device. We also recognize that some users likely will not have such reservations, and thus a minority of users can be incentivized, e.g., through discounts, to trust a data aggregator with their data, allowing for the possibility of training or tuning privacy schemes based on real user data. Adversary computational resources are assumed to be bounded, recognizing that other methods outside the scope of the data network could be employed to reliably obtain the same private information if an adversary is assumed to have limitless resources. Adversary access to side information is assumed to be limited for the same reason. Finally, we assume users will typically be concerned with the exposure of their private information on average. For most mobile data applications, a user will likely operate with the network over a long period of time and will generate many feature reports as a result. Further, exposure of the private data of any one report will typically pose a much lower risk than exposure through the aggregation of many reports over a period of time. Thus, protecting against an adversary attack on any single report under worst-case conditions is unnecessary for typical applications.

Worst-case mobile network data threat model: Due to the wide variety of potential mobile network data applications and possible user privacy concerns, we acknowledge there may be some use cases where a worst-case threat model is appropriate. To account for this, we also treat such a model in our analysis. This adversary can access individual reports directly, but in contrast to the typical threat model we assume the adversary has unbounded computational resources and unlimited access to side information. Also, users are concerned with exposure of any single feature report, and their private information in each report must be protected from exposure under worst-case conditions.

LDP GAP IT
Threat model Adversary computational resources Unlimited Limited Unlimited
Adversary side-info access Unlimited Limited Unlimited
Type of privacy-loss guarantee Worst-case On-average On-average
Provable adversary privacy protection Against any adversary Against trained adversary Against any adversary Context
Privatizer access to data for training Not necessary but helpful* Yes No
Privatizer access to data distribution Not necessary but helpful* No Yes
Utility protection type None/Some* Maximize utility Lower bound on utility
  • As discussed in detail in Section 5.2, LDP requires clipping. While clipping can be done in a manner which is agnostic to the data thakkar2019differentially; mcmahan2018general, this may result in large utility loss. As a result, clipping is usually performed using information about the data to ensure the added noise is calibrated with the range of data values, see Eq. (16).

Table 2: Context used by privatizers (last 4 rows) and properties of threat models (first 6 rows).

3.3 Data Obfuscation and Privatizers

To protect against the adversary threat, privacy can be preserved through obfuscation of the feature data provided by individual users before being released to the service provider. At a minimum, the feature set is stripped of user ID. Remaining features are then obfuscated according to the selected privatization scheme, or ”privatizer” for short. This is needed because the adversary may learn patterns in the data which associate public and private features, thus it is not sufficient to only obfuscate private features.

The privatizer will produce an obfuscated measurement feature report with denoting the obfuscated version of , where the mapping depends on the design of the privatizer. We will consider several privatizers, described fully in Section 5. Some privatizers leverage actual user data in their design. We assume such data is collected either through opt-in surveys and service provider incentives, or else collected by the provider through other means such as wardriving. In our analysis, we use 70% of our available dataset for training our adversary (see Section 3.5 for more details) as well as for training, fitting models, and/or choosing parameters of the privatizers (see Section 5 for more details). The remainder of the dataset will be used to test our privatizers against the adversary.

3.4 Context

User data is a type of application-specific context, and different privatizers may use the actual data, data distributions, or merely data moments like mean and variance. There are other types of application-specific context, e.g. privacy and utility metrics of interest, which privatizers may optimize over. Since mobile service providers know what they want to use the data for, and may ask their clients about privacy concerns, such metrics may indeed be available to be used in the design of privatizers.

Using context has implications to the threat model. For example, optimizing over a particular privacy metric guarantees protection against this privacy metric but not against any function of the data. As another example, if a privatizer optimizes its design under a known data distribution, or is trained under a given dataset, its performance is not guaranteed under different distributions and datasets.

Using context may also offer utility guarantees since optimizing over, or putting a constraint on a utility metric, restricts the privatizer from making obfuscation decisions that reduce utility below acceptable levels.

Table 2 compares different privatizers with respect to how much context they use and which threat model properties they can protect against. LDP offers stronger privacy protection than the rest as it provides worst-case privacy guarantees against any adversary with potentially unlimited resources and side information access. However, it does not have a formal mechanism to guarantee a minimum level of utility. In contrast, GAP and IT are aware of application specific utility metrics which they include in their optimization setups, and thus provide utility guarantees. The GAP privatizer in particular optimizes a multi objective function which considers both privacy and utility. That said, it is optimized and can offer formal guarantees only against the particular adversary in its training loop. These fundamental distinctions among the different obfuscation approaches are discussed in more detail in Section 5 and their implications to the privacy-utility tradeoff are presented and discussed in detail in Section 6.

3.5 Adversary Model

Depending on whether users upload measurements to the server one at a time or in batches, the adversary may or may not know whether a sequence of measurements originated from a single user or multiple users. Consider first the scenario where each user uploads one obfuscated measurement each time. Given that the user ID of each obfuscated measurement is unknown, the adversary takes as input one measurement from the obfuscated dataset and predicts the user ID and true location (the unobfuscated latitude and longitude) from which the measurement originated . Now consider the scenario where, for the sake of reduced system complexity, each user uploads a sequence of obfuscated measurements each time222In practice, the service provider can require users to upload their data weekly or monthly.. While the user ID of each obfuscated measurement in the database is unknown, the adversary knows that measurements in the same batch belong to the same user, and can take advantage of correlations across measurements to improve estimation. In this case the adversary takes as input a measurement sequence from a single user and predicts a single user ID and the true locations ( denotes the measurement in this sequence, and is the sequence length). In Section 6 we investigate the performance under both scenarios, see Section 6.2 for a direct comparison between the two.

The adversary estimation is a mapping from to and one may use a number of approaches to perform that mapping. In theory, one may discretize the continuous ’s and

’s and use empirical conditional probabilities and maximum likelihood estimation, but in practice the state space would explode. Given the availability of real world datasets, learning is a better choice. We experimented with linear and non-linear models for used ID estimation, and chose a deep neural network (DNN) to model our adversary (see Fig.

3), given the effectiveness of DNNs in approximating non-linear functions.

Specifically, our adversary is modeled as a fully-connected DNN containing two hidden layers with 256 neurons each. Between layers we employ Rectified Linear Unit (ReLU) activations, and our optimization relies on Adaptive Moment Estimate (Adam) stochastic gradient descent with a learning rate of 0.001. These values were empirically selected to maximize the adversary’s performance when given the unobfuscated data as input.

Assume that the input measurement contains features and there are users ( and depend on three datasets described in Section 3.1). Then each input batch has measurements containing the features. The output of the adversary neural net is a matrix representing estimates of user ID and location (

in our experiments). Each row in this matrix contains the likelihood that this measurement belongs to different users, and the estimated latitude and longitude of the original measurement. The loss function used to train the adversary is a weighted sum of the categorical cross entropy loss of the user ID estimate vector and the euclidean distance between the actual location and the location estimate. The user ID estimate error, location estimate error, and adversary loss functions are defined in Section

4.

We provide our adversary 70% of the obfuscated dataset to train on, for which it has access to the unobfuscated user IDs and locations, and test it on the remaining 30% of the data. Providing the adversary such a high portion of the data for training makes our privacy results conservative. In our threat model we have assumed some access to side information but comprehensively modeling access to possible forms of side information is intractable. The adversary’s access to 70% of the dataset with obfuscated and true user ID and location labels serves as an approximation of some form of side information. Side information may include known user locations at certain times, or inputs from the adversary’s own devices on the network to establish ground truth. The training set could also correspond to the adversary simulating published privatizers which may be revealed by the service provider to help convince users regarding their ability to preserve privacy.

Figure 3: A diagram of adversary implementation.

3.6 Signal Map Model

The service provider trains an RSS predictor based on the the aggregated user data such that it can generate accurate signal maps. Specifically, the model input features include (obfuscated) latitude, longitude and other features (i.e. ), and the model output is the RSS value in dBm.

There is a long line of research on RSS predictor models, see, for example erceg1999empirically; han2013double; alimpertis2019city. We first consider a simple path loss model molisch2012wireless but find its accuracy to be underwhelming. We also consider a linear and a neural network model and find that both have good comparable accuracy yet the former is easier to work with. Notably, its parameters can be estimated in one step which allows us to calculate application-specific utility metrics more efficiently (see Section 4.2). We thus select a linear RSS prediction model. Specifically, we use the following model:

(1)

where is the total number of features in a measurement and is the parameter vector. Given a set of measurements where and (

is the number of features), the parameter vector of the RSS prediction model can be estimated via linear regression as follows:

(2)

where is the third column of and is the remaining columns of without the third column. Similarly, given a set of obfuscated measurements where and , the parameter vector of RSS prediction model can be estimated as .

4 Definition of Metrics

In this section we define the metrics used to evaluate privacy and utility.

4.1 Privacy

Let denote the number of measurements per batch. , represent the user ID of each collected measurement and is the adversary’s estimate of

. The adversary computes a probability distribution over the space of possible user IDs and selects for each measurement the user ID estimate with the maximum likelihood. We define the adversary estimate accuracy as the fraction of correct user ID estimates, that is,

where the indicator function is equal to 1 if the estimate is correct and 0 otherwise. Since high values of accuracy correspond to low values of privacy, we define the first privacy metric as

(3)

and are the adversary’s estimates of the true latitude and longitude . While represents a probability distribution, and specify an exact location. Our second privacy metric is the Euclidean distance between the true location and the adversary’s estimate averaged over the batch, defined by

(4)

where the subscript corresponds to the measurement in the batch of size . This metric defines how well the adversary is able to recover the original user location. High values of adversary location error correspond to high privacy.

Since both user IDs and locations are considered as private and sensitive information in our application, we further define the following composite privacy metric:

(5)

where and are parameters controlling the weights of the two aforementioned privacy metrics. and are the privacy metrics we use throughout the paper to compare the performance of different privatizers.

4.1.1 Additional privacy metrics

The composite privacy metric defined above is not differentiable because is not differentiable. This is a problem for adversary training. To handle this we use the cross entropy loss of the user ID estimate

(6)

where is the estimated likelihood of user ID for measurement , and define the loss function of the adversary as

(7)

which is used in the training of the adversary and of the GAP neural networks (the GAP privatizer and adversary used in the iterative training, see Section 5.3).

Our IT privacy approach, see Section 5.4, is motivated by the use of mutual information as a measure a privacy. The mutual information between two random variables (e.g., our input) and (e.g., the obfuscated data) quantifies how much information about one random variable is obtained through observing the other. It is given by

(8)

where is the joint probability mass function and are the marginal probability mass functions.

Last, the privacy metrics defined above are well suited for the typical threat model discussed in Section 3.2. However, for the worst-case threat model involving adversaries with unbounded computational resources and auxiliary information where users seek protection of any single report (see Section 3.2) we resort to differential privacy (DP) dpresults. Specifically, let be a randomized function applied to the input dataset. gives ,-differential privacy if for all datasets and which differ in at most one element and ,

(9)

where the probability is taken over the randomness in . and bound the difference between the output of on and thus making it hard to guess the input ( versus ) by observing the output. DP is a strong guarantee, since it doesn’t make any assumptions about the computation power and auxiliary information available to the adversary, and and serve as metrics for privacy, see dp for more details.

4.2 Utility

Let be the number of features at each measurement excluding the user ID which is stripped from the input. The output of the privatizer , is the obfuscated data, e.g. and denote obfuscated latitude and longitude, respectively. Our utility metrics quantify the difference between the input data and the obfuscated data . Recall that we consider measurements per batch thus and are vectors of size with elements and , , respectively. We consider several utility metrics motivated by real-world applications of crowdsourced network data.

The first metric quantifies the overall distortion of the dataset, considering all features, by the L2 norm distance between input and obfuscated data, averaged over all batch measurements:

(10)

Intuitively, high values of distortion correspond to low utility thus the minus sign in front of distortion in Eq. (10).

The second utility metric is related to the RSS prediction model described in Section 3.6. Recall that the goal of service provider is to estimate an accurate RSS prediction model based on the aggregated user data. However, with obfuscated user data, the estimated parameters of RSS prediction model differs from those estimated by unobfuscated user data (i.e. the estimated parameter vector changes from to , see Eq. (2)). To minimize the difference between them, we define our second utility function as the opposite of L1-norm distance between and as follows:

(11)

where represents the RSS prediction model parameters estimated by unobfuscated user data (i.e. the privatizer’s input) and represents the RSS prediction model parameters estimated by obfuscated user data (i.e. the privatizer’s output). We refer to as the generated map error, where higher values of map error corresponds to lower utility. While many metrics could be used to measure the distance between and , comparing the fitted parameters over this bounded space provides a simple, effective loss function. Note that this map error does not capture how well a RSS prediction model generated by the obfuscated data could be used to predict RSS values at a new location, but rather captures the “distance” between maps generated before and after obfuscation.

Envisioning that the service provider may care for more than a single application-specific utility metric like in practice, we further define a composite utility metric as

(12)

where and are parameters adjusting the weights of each utility metric.

5 Privatizers

In this section we introduce in detail each of the four privatizers, which represent different types of obfuscation schemes. Specifically, we first select a Gaussian noise-adding privatizer for its simplicity and as a benchmark. We then select a locally differentially private privetizer (LDP) motivated by the well known strengths of Differential Privacy. We then select a privatizer based on GANs (referred to as the GAP privetizer), given the recent interest on how adversarial learning may be used to train privatizers by positioning them against adversities. Last, we select the so-called IT privatizer since it is a good representative of obfuscation schemes which use mutual information as a privacy metric and optimization to optimally design obfuscation.

5.1 Gaussian Noise-Adding Privacy

Our Gaussian noise-adding privatizer (Noise privatizer) takes the simplest approach to data obfuscation. For each input batch of size , where is the number of points and is the number of features, we add an

matrix of Guassian noise. Each element in this noise matrix is normally distributed with a mean of 0 and a standard deviation of

. Since the data is also normalized such that each feature has a mean value of zero with a standard deviation of 1, values of close to 1 add a significant amount of noise and we choose to vary between 0 and 1. Fig. 3(a) provides a visualization of what the normalized input data, obfuscated data, and adversary’s estimate look like side by side using the Noise privatizer for a low value of . Fig. 3(b) shows the signal maps generated before and after obfuscation. This shows that even in the presence of obfuscation, we can generate representative signal maps with the obfuscated data. Figures 4(a) and 4(b) show the same plots for a high value of . Note that while the privacy is improved, i.e. the adversary estimate is further from the input data, the signal maps differ significantly.

For reference, a privatizer which releases a completely random dataset (from a normal distribution with variance of 1.0) regardless of input data would observe the errors shown in Table 3.

Metric No Obfuscation Random Data
Distortion () 0 5.74
Generated Map Error () 0 2.41
Table 3: Utility Reference Values
(a) Input Data, Obfuscated Data, Adversary Estimate
(b) Signal Maps Before and After Obfuscation
Figure 4: Noise privatizer when =0.1.
(a) Input Data, Obfuscated Data, Adversary Estimate
(b) Signal Maps Before and After Obfuscation
Figure 5: Noise privatizer when =0.9.

5.2 Local Differential Privacy

We implement two local DP (LDP) privatizers which provide mathematical guarantees for privacy (see Eq. (9)) under the worst-case threat model discussed in Section 3.2.

The first approach is the Gaussian Mechanism with parameters and , which we refer to as GLDP dp. This mechanism adds zero-mean Guassian noise with variance to each feature. This variance is defined by

(13)

where is the L2-sensitivity of a function given by

(14)

where and are subsets of the dataset differing only by one element. Generally, in the local DP model, one can think of and as datasets of size 1 (i.e. one data point) and as an identity function. Therefore the sensitivity becomes the greatest L2-distance between any two data points. In practice, we use an analytically calibrated Gaussian Mechanism which is shown to extend better to the very high privacy regime and the low privacy regime , see Algorithm 1 in analytical_gaussian for the exact calculation for the variance of the added noise .

The second approach is the Truncated Laplacian Mechanism with parameters and , which we refer to as LLDP, recently proposed in geng2018privacy

. This mechanism adds noise satisfying the truncated Laplacian distribution, with probability density function

(15)

where

and is defined in Eq. (14).

For both approaches, we follow standard practice and use ( should be much smaller than dptutorial) and between 1 and 10 (larger values yield a very loose bound, see Eq. (9)), where low values of epsilon guarantee better privacy.

Moreover, following standard practice again, we clip each data point to have L2-norm . Then, by invoking the triangle inequality, we can ensure that sensitivity is no greater than . Specifically, for both the Gaussian mechanism and Laplacian mechanisms, we clip each data point according to the following function

(16)

To choose , we use the rule of thumb that clipping should occur 5% of the time. Using the pilot dataset to approximate how much of the data would be clipped for a given value, we choose and use this parameter during testing.

TABLE 4 compares the GLDP and LLDP privatizers with respect to our privacy and utility metrics on Chania dataset. We notice that both GLDP and LLDP privatizers yield quite large utility losses. From this table, it is evident that GLDP achieves sizably higher privacy than LLDP w.r.t. and , especially for larger values. Although GLDP privatizer has larger loss in utility, both GLDP and LLDP privatizers can not offer any utility protection. Hence, we use GLDP with higher privacy in the rest of the paper when comparing LDP with other approaches under the typical threat model, see Section 6.

Note that while our Noise and LDP privatizers both add normally distributed noise, the key difference between the two is the noise clipping step. Intuitively, this ensures that no two data points are too different. This gives an anonymity to each measurement that is crucial to privacy under a worst-case threat model.

Mechanism
1 GLDP 0.68 0.94 113.76 2.96
LLDP 0.68 0.94 84.60 2.95
10 GLDP 0.63 0.91 16.46 2.39
LLDP 0.49 0.69 8.50 2.49
100 GLDP 0.32 0.36 4.21 2.44
LLDP 0.05 0.10 0.93 2.20
Table 4: Comparison of GLDP and LLDP privatizer on Chania dataset.

5.3 Generative Adversarial Privacy

Generative Adversarial Privacy is a data-driven approach to obfuscation which learns a privatization strategy by positioning the privatizer and adversary against each other in a minimax game gap; cagap. Our privatizer is a fully-connected feedforward neural network with a similar structure to our adversary. It has two hiddens layers of 256 units each. Between layers we employ Rectified Linear Unit (ReLU) activations, and our optimization relies on Adaptive Moment Estimate (Adam) stochastic gradient descent with a learning rate of 0.001. Our privatizer, which takes an input batch of size , outputs an batch of obfuscated data, where each measurement has been obfuscated independently. (We treat the case where measurements are grouped into batches and then jointly obfuscated in Section 6.2.)

Our privatizer wants to minimize the following loss function

(17)

where is the composite utility metric defined in Eq. (12) and is the adversary loss function defined in Eq. (7) which is a differentiable version of the composite privacy metric and depends on the adversary estimate error of the user ID and location.

Notice that as the adversary’s loss decreases (implying less privacy), the privatizer’s loss increases. quantifies the penalty on utility loss, as opposed to privacy loss. Utility losses have a large effect on the privatizer when and privacy losses have a large affect on the privatizer when .

We take an iterative approach to training the two neural networks. We first train the adversary, specifically, we fix the neural network (NN) weights of the privatizer and perturb the NN weights of the adversary along the negative gradient of for epochs. We then train the privatizer, that is, we perturb the NN weights of the privatizer along the negative gradient of for epochs, and so on and so forth. When both have converged, we have found the equilibrium of our minimax game. We then fix the weights of both NNs during testing.

The GAP privatizer incorporates the privacy and utility metrics in its loss function and trains against an adversary with the same loss function as the one used to evaluate privatizers. While it is advantageous to incorporate specific privacy metrics, for generality we evaluate the GAP privatizer’s performance against other loss functions too, see Section 6.5.

5.4 Information-theoretic Privacy

For this approach, we consider the privacy-utility tradeoff in an analytically tractable fashion under a formal optimization framework.

Considering and as random variables describing our input and obfuscated data respectively, our IT privatizer tries to minimize the mutual information , see Eq. (8

), subject to a constraint on utility. The privatizer is specified by the conditional probability distribution

, the probability of releasing given input data . Without the utility constraint, the optimal solution is to release independent of .

Formally, the problem becomes:

(18a)
(18b)
(18c)
(18d)

where (18b) is a constraint on the composite utility defined in Eq. (12), and constraints (18c) and (18d) ensure that is a valid probability distribution.

We approach this constrained minimization problem by rewriting it as a Lagrange function whose optimal point is a global minimum over the domain of the choice variables and a global maximum over the Karush-Kuhn-Tucker (KKT) multipliers KKT. We analyze the KKT conditions below to derive key observations on the optimal solution:

(19a)
(19b)
(19c)
(19d)

Solving this for the optimal conditional probability distribution, we see

(20)

We take the sum of both sides,

(21)

We then manipulate this to get an expression in terms of , which we substitute back into Equation (20) to get the following:

(22)

where is a normalization term over . From this formal treatment, and reminiscent of our previous work matt; lillymatt, we derive two important characteristics of the optimal solution: (i) should exponentially increase with utility, and (ii) should linearly increase with , the probability that is reported for any , i.e. we should reuse released datasets to the extent practical.

Given the above qualities of an optimal solution, we design the following heuristic approach. We use the pilot dataset to empirically determine the distribution

using multi-variate Gaussian kernel density estimation. We then sample from this distribution

times to create a “codebook” which approximates the sample space . Limiting allows us to reuse released datasets, as mentioned above.

The weight of each “code” or possible value is given by

(23)

where is our KKT multiplier. Given an input data , our information theoretic privatizer selects a from the codebook with probability . This ensures the likelihood of reporting a increases exponentially with utility. As increases, the IT privatizer offers higher utility but lower privacy. By contrast, as approaches zero, the IT privatizer achieves lower utility while higher privacy.

In implementation, we use a codebook with size of 51. This codebook size was empirically determined to be large enough that one or more codes would provide good utility, yet small enough that codes are reused to the extent practical. Note that we bias the codebook by including a copy of the unobfuscated data (i.e. 50 obfuscated codes + 1 unobfuscated code). This ensures at least one has very high utility even for relatively small codebooks. Also, to reduce computational complexity, we split the measurements into batches and for each batch we select a batch from the codebook.

6 Performance evaluation

In this section we compare the performance of the privatizers against different adversaries when users upload a single or a batch of measurements, and evaluate where they sit in the privacy-utility design trade space. All performance comparisons in this section are under the typical threat model (bounded adversary). We use the three real-world traces introduced in Section 3.1 in our evaluation. Unless otherwise stated, the default trace is the Chania dataset.

6.1 Comparison of Privatizers

(a) : user ID estimate error rate.
(b) : user location estimate error.
(c) : distortion.
(d) : generated map error.
Figure 6: Privacy and utility of different privatizers. Note that Noise, GLDP, GAP, and IT refer to the Gaussian noise-adding, local Gaussian Mechanism DP, GAP, and the information-theoretic privatizers, respectively.

Consider the scenario where users upload a single measurement at a time. Fig. 5(a)/Fig. 5(b) show the adversary estimate user ID error/adversary location error respectively against each privatizer (its privacy), and Fig. 5(c)/Fig. 5(d) show the distortion/generated map error of each privatizer(its utility). The x-axis in these and the following plots represents the parameterization of each privatizer, i.e. , , , and .

As expected, for the noise privatizer, as increases from 0 to 1 the adversary’s user ID and location estimate errors increase, demonstrating higher privacy (larger and ). At the same time, both the distortion and generated map errors increase, demonstrating lower utility (smaller and ). For the GLDP/GAP/IT privatizers, decreasing // leads to higher privacy (i.e. an increase in the adversary’s user ID and location estimate error rate) and lower utility (i.e. an increase in distortion and generated map error).

Among these privatizers, the GLDP privatizer consistently achieves high privacy for typical values of . Specifically, against the GLDP privatizer with , the adversary’s user ID estimation error is around 70% and the adversary’s location estimate error is close to 1. These numbers can be explained as follows. In the absence of any intelligible patterns due to obfuscation, the adversary learns to assume all measurements came from the geographic center of the dataset, thus its error is on the same order as the spread of input data, i.e. roughly 1. Both the IT and GAP privatizers can approach this privacy performance as and get close to 0.1 or smaller. As for the user ID estimation error, the user with the most measurements contributes roughly 30% of them, thus a simple adversary assigning this user’s ID to all measurements would have 70% user ID error, hence this can be considered as the upper bound of .

With respect to the utility, the GLDP privatizer offers the worst performance. The GAP privatizer outperforms the others for in the range [0.0,0.4] (i.e. high privacy region), while the IT privatizer achieves the best utility for in the range [0.4,1.0] (i.e. low privacy region). As it will become clear in the next couple of sections, a major reason why GLDP has the best privacy and worst utility is that for the range of values considered, it distorts the data to a larger extent than the rest of the approaches. We discuss in more detail the differences between the 4 privatizers in the Section 6.3.

(a) Without obfuscation.
(b) Obfuscated by Noise privatizer.
(c) Obfuscated by GAP privatizer.
(d) Obfuscated by IT privatizer.
Figure 7: Visualization of obfuscated measurements generated by different privatizers when . Note that each cell in these figures represents a geographical location. The color of each cell represents the average signal strength value in this location. Lighter color represents higher signal strength value.
Privatizer None Noise GAP IT
RMSE (dBm) 2.46 2.90 2.54 2.48
Table 5: RMSE (root mean square error) of RSS prediction model trained with obfuscated measurements when . Note that we use the RSS prediction model trained with non-obfuscated measurements as a baseline (privatizer is none).

Furthermore, we visualize the obfuscated measurements generated by different privatizers as heat maps in Fig. 7. Note that the x-axis and y-axis of each heat maps represent longitude and latitude respectively, and the color of each cell in these heat maps represents the signal strength value. Compared with the heat map generated by Noise privatizer, the heat maps generated by GAP and IT privatizers are more similar to the heat map without obfuscation under the same privacy level, indicating that GAP and IT privatizers inject less distortion into the measurement data during obfuscation.

Lastly, we report the root mean square error (RMSE) of RSS prediction model trained by obfuscated measurements in Table 5. Note that the goal of the service provider is to train a RSS prediction model based on the obfuscated measurements uploaded by users. The more accurate the RSS prediction model is, the higher utility the obfuscated scheme can provide. As illustrated in Table 5, under the same privacy level, the RSS prediction model trained with measurements obfuscated by IT privatizer achieves the lowest RMSE, which is close the RMSE of RSS prediction model trained with non-obfuscated measurements. The RSS prediction model trained with measurements obfuscated by Noise privatizer achieves the highest RMSE, indicating that the Noise privatizer provides the worst utility.

(a) Only adversary leverages sequences.
(b) Only privatizer leverages sequences.
(c) Both leverage sequences.
Figure 8: Effect of leveraging measurement sequences on adversary user ID estimate accuracy.

6.2 Leveraging Measurement Sequences

To directly investigate the effect of correlations and predictable patterns when considering mobile measurements as a time sequence, we consider an adversary which takes measurement sequences as input, i.e. time sequences of lengths 1, 5, 10, and 20 which belong to a single user, and estimates the (common) user ID of all these measurements, taking advantage of correlations across data of the same user. In practice, the adversary can do this when users upload measurements in batches.

The adversary we consider is trained via supervised learning with the final output of the converged GAP privatizer. The GAP privatizer is a good choice to study sequences of data as it can be trained to consider correlations of sequences and privatize batches of data in one shot as well.

The results shown in Fig. 8 consider three cases: only the adversary, only the privatizer, and both of them consider sequences of data. Fig. 7(a) shows results when only the adversary considers measurement sequences. It shows that the longer the sequence the better the adversary performance, as the adversary achieves smaller error for the same data distortion. Fig. 7(b) shows results when only the privatizer considers measurement sequences. It shows that the longer the sequence the better the privatizer performance, as the privatizer forces the adversary to achieve higher error for the same data distortion. Thus, sequences of measurements help both the adversary and the privatizer, which is expected in the presence of inter-measurement correlations. That said, the tradeoff in both cases above is the additional computational and memory resources required to handle input sequences as opposed to single measurements. Lastly, Fig. 7(c) shows results when both the adversary and privatizer consider sequences of the same length. We observe that longer sequences result in better privacy, as the adversary’s user ID estimate error increases.

6.3 Analysis of Privacy-Utility Trade Space

(a) vs on Chania dataset.
(b) vs on UCI dataset.
(c) vs on Radiocell dataset.
Figure 9: Privacy-utility tradeoff of different privatizers under the Chania, UCI, and Radiocell datasets with composite metrics. Note that Noise, GLDP, GAP, and IT refer to the Gaussian noise-adding, local Gaussian Mechanism DP, GAP, and the information-theoretic privatizers, respectively.
(a) and tradeoff.
(b) and tradeoff.
(c) and tradeoff.
(d) and tradeoff.
Figure 10: Privacy-utility tradeoff of different privatizers under the Chania dataset with non-composite metrics.

Fig. 9 and Fig. 10 illustrate where each privatizer sits in the privacy-utility tradeoff space under real-world datasets with different metrics. Specifically, in Fig. 9, we consider the composite privacy and utility metrics on Chania, UCI, and Radiocell datasets, where the x axis shows the composite privacy defined in Eq. (5) with weights and the y axis shows the composite utility defined in Eq. (12) with weights . Note that we consider such composite privacy and utility metrics since in practice a service provider may care about both and and about both and .

In Fig. 10, we further consider four different combinations of non-composite privacy and utility metrics, and we use the Chania dataset as an example to illustrate how the privacy-utility tradeoff curves of different privatizers change with different non-composite metrics.

In all these plots, the ideal privatizer should sit in the top right corner implying high privacy and high utility. While the three traces are collected in different countries, areas, and years, the results are qualitatively the same. From both plots we conclude that GAP and the IT privatizer outperform the Noise and GLDP privatizers. It is important to remind the reader that the above comparison is under the typical threat model where the adversary is bounded, whereas GLDP privatizer is the only privatizer that provides privacy guarantees under the worst-case threat model. As discussed in detail in Section 3.2, we focus on the typical threat model as it is more relevant to our context/application.

A major reason why GAP and the IT privatizer perform well is that they rely on the notion of context, as we have already discussed in Section 3.4. The GAP privatizer gains some insight about the structure of the dataset through data-driven learning. It also tries to minimize the difference between the true and obfuscated data while achieving privacy, as encoded in its loss function. In summary, GAP uses , and the data. The IT privatizer gains some insight about the structure of the databset through Gaussian kernel density estimation. It does well because it releases obfuscated datasets which inherently mirror the true dataset’s structure, thanks to a constraint on utility. In summary, IT uses and the data distribution. In contrast to GAP and IT, GLDP only needs information about the data to perform clipping without hurting utility too much (in our implementation we used the data directly for this purpose, see Eq. (16)), and Noise only needs the variance of the data to normalize the amount of Gaussian noise that it adds.

Comparing GAP with IT, because GAP tries to prevent an adversary from estimating features of given , this strategy can be thought of as a data-driven approach to what IT does, i.e. minimizing mutual information. Yet while the IT strategy adds privacy by choosing randomly (with appropriate weights), the GAP privatizer maintains a model of a rational adversary which it intentionally tries to deceive. Training against an adversary with the same loss function as the adversary used to test the performance of the privatizers, might be perceived as unfair. To address this, in Section 6.5 we test privatizers against adversaries with different loss functions.

The GLDP privacy-utility curve shown in Fig. 9 shows values of up to 100. Note that this is an order of magnitude greater than the values of shown in Fig. 5(c)/Fig. 5(d) and such high values yield a very loose bound on Eq. (9), yet we do so to show that the Noise and GLDP privatizer meet when noise levels are similar. Note that values of lie along the asymptotic behavior around the line for the Chania/UCI/Radiocell dataset, respectively.

Finally, note that when we train the GAP privatizer and compute the codebook of the IT privatizer to generate the results of Fig. 10, we use the composite privacy and utility metrics to avoid retraining/recomputing them for each case. Interestingly, this doesn’t deteriorate their performance in a visible manner. While real-world engineers could retrain/recompute the GAP/IT privatizers for the specific privacy and utility metrics they care about, in practice this may be cumbersome.

6.4 Constraining Distortion Levels

Previously we have considered privacy and utility as two components of our objective. Suppose instead we wish to maximize privacy subject to a constraint on utility. In Fig. 11 we re-frame previous results to demonstrate choosing the appropriate parameters to meet a constraint on distortion (), which can act as an empirical measure of how different the obfuscated data is compared to the original data. Fig. 11 presents a plot which can be interpreted as a continuous lookup table. For example, to meet the constraint , we could choose or or . This plot also offers a sense of which range of distortion each approach may achieve for its selected range of parameter.

Figure 11: Choosing parameters under a constraint on distortion.
Privatizer Parameter Utility Privacy against different adversaries
Baseline Unobfuscated Aggregate Alternative (different )
, ,
Noise -2.5 0.13 0.30 0.80 0.13 0.14
GAP -2.5 0.95 1.54 1.33 1.05 0.88
IT -2.5 0.70 1.26 1.18 0.68 0.70
Table 6: Evaluation results against common adversaries. Baseline reports the privacy of a privatizer against the adversary trained with its own obfuscated data. Unobfuscated is trained against unobfuscated data. Aggregate is trained against aggregate obfuscated data. Alternative is trained using a different loss function.

6.5 Performance against different adversaries

So far we have tested each privatizer against an adversary trained on the obfuscated data generated by the privatizer. We refer to this as the privatizer’s “own” adversary. What is more, the GAP privatizer is explicitly trained to beat its own adversary and it would be informative to investigate its performance against other adversaries.

Motivated by the above, we investigate how the four privatizers perform against the following three adversaries: (i) “Unobfuscated” adversary which is trained with the unobfuscated data via supervised learning (rather than the obfuscated data that we have used so far), (ii) “Aggregate” adversary which has access to all obfuscated data generated by all privatizers, and is trained with the aggregated obfuscated data, and (iii) “Alternative” adversary trained with a different loss function than the one used so far, which has also been used for training the GAP adversary inside the iterative GAP loop. Specifically, alternative adversaries use different weights and in the loss function .

Recall that for each privatizer, we have different parameter settings to trade privacy and utility. For a fair comparison, we first set a target utility value and use for each privatizer its parameter value that achieves this utility. Table 6 shows the corresponding parameter values for a composite target utility of -2.5. This value is motivated by Table 3 and Fig. 9, as the former shows the (negative) utility of a random dataset and the latter shows the entire privacy-utility spectrum considered. (Notice that for GLDP to achieve a -2.5 utility value it would use too large of an value (100) thus we omit this line from the table.) We report the privacy achieved by each privatizer against its own adversary (Baseline) and the three adversaries introduced above.

Interestingly, the GAP privatizer outperforms all the other privatizers not only when privatizers are positioned against their own adversaries (see also Section 6.1) but also against the other adversaries, namely Unobfuscated, Aggregate, and Alternative. That said, the performance gap does reduce, which can be explained by the fact that the GAP privatizer is trained against an adversary with a loss function which is now different from that of the adversary used to test the privatizers.

As expected, all privatizers achieve the lowest privacy against their own adversary (baseline), since the latter is trained with the obfuscated data of each privatizer. Also, all privatizers achieve the highest privacy against the Unobfuscated adversary. This is also expected as the Unobfuscated adversary is trained using unobfuscated data thus it is weaker than the others.

7 Limitations and Future work

Points of interest: The adversary we consider predicts user IDs and all locations from where measurements are collected. However, an adversary may be particularly interested to learn specific users’ points of interest (POIs). For instance, the adversary may want to predict the target user’s home or work location. We do not consider this in the paper since users can choose to not collect measurements around POIs as a defense mechanism.

Side information: We assume the adversary has access only to the obfuscated user data shared with the service provider, which does not contain user ID information. A stronger adversary might leverage side information to estimate the user ID of each measurement. For example, the adversary might be able to monitor the network connection between the service provider and mobile users, such that it knows from which device each obfuscated measurement comes from and thus the user ID. This adversary may then build a user whereabouts model. Since it is much harder for an adversary to have access to such information than to merely access database updates, we do not consider this threat model.

Federated learning: Mobile crowdsourcing applications lend themselves to a federated learning implementation kairouz2019advances; li2020federated, which can provide some privacy for mobile users. Recent works show that federated learning could also leak user privacy wang2019beyond; nasr2019comprehensive; lyu2020threats; truong2021privacy; FANG2021102199; bakopoulou2021location. However, it would be a reasonable solution for opt-in mobile users used to collect training data for the GAP privatizer and to estimate data distributions for the IT privatizer.

Another avenue for future work is to investigate how federated learning can be applied, with additional privacy mechanisms, to achieve privacy-preserving training of an RSS predictor. For instance, one may add noise to local model updates or carefully select the measurements used for local model training epochs, to weaken data reconstruction attacks (see, for example, the DLG attack proposed in zhu2020deep).

8 Conclusions

In this work, we have systematically examined the privacy-utility tradeoff which exists in crowdsourced mobile network data obfuscation. We have considered four preeminent privatizers employing different obfuscation strategies. To compare them, we have identified several privacy and utility metrics as well as a number of adversaries under two different threat models suited to crowdsourced mobile network data, and evaluate the privacy-utility tradeoff performance of different privatizers on three diverse real-world mobile network datasets. The main takeaway is that under a typical threat model with a bounded adversary, which is of more practical interest in the context of our application, incorporating the structure and intended use of datasets in obfuscation can provide privacy gains without significant utility losses.

Acknowledgment

We would like to thank Athina Markopoulou and Emmanouil Alimpertis from the University of California Irvine for access to the dataset used in this work manos. This material is based upon work supported by the National Science Foundation under grants CNS-1618450 and CNS-1901488.

References