DeepAI
Log In Sign Up

GeoPointGAN: Synthetic Spatial Data with Local Label Differential Privacy

Synthetic data generation is a fundamental task for many data management and data science applications. Spatial data is of particular interest, and its sensitive nature often leads to privacy concerns. We introduce GeoPointGAN, a novel GAN-based solution for generating synthetic spatial point datasets with high utility and strong individual level privacy guarantees. GeoPointGAN's architecture includes a novel point transformation generator that learns to project randomly generated point co-ordinates into meaningful synthetic co-ordinates that capture both microscopic (e.g., junctions, squares) and macroscopic (e.g., parks, lakes) geographic features. We provide our privacy guarantees through label local differential privacy, which is more practical than traditional local differential privacy. We seamlessly integrate this level of privacy into GeoPointGAN by augmenting the discriminator to the point level and implementing a randomized response-based mechanism that flips the labels associated with the 'real' and 'fake' points used in training. Extensive experiments show that GeoPointGAN significantly outperforms recent solutions, improving by up to 10 times compared to the most competitive baseline. We also evaluate GeoPointGAN using range, hotspot, and facility location queries, which confirm the practical effectiveness of GeoPointGAN for privacy-preserving querying. The results illustrate that a strong level of privacy is achieved with little-to-no adverse utility cost, which we explain through the generalization and regularization effects that are realized by flipping the labels of the data during training.

READ FULL TEXT VIEW PDF
09/29/2022

L-SRR: Local Differential Privacy for Location-Based Services with Staircase Randomized Response

Location-based services (LBS) have been significantly developed and wide...
08/04/2021

Privacy-Preserving Synthetic Location Data in the Real World

Sharing sensitive data is vital in enabling many modern data analysis an...
10/25/2022

Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe

Privacy concerns have attracted increasing attention in data-driven prod...
08/30/2020

Data Sanitisation Protocols for the Privacy Funnel with Differential Privacy Guarantees

In the Open Data approach, governments and other public organisations wa...
04/09/2018

An Efficient Privacy-Preserving Algorithm based on Randomized Response in IoT-based Smart Grid

Among existing privacy-preserving approaches, Differential Privacy (DP) ...
10/31/2019

Context-Aware Local Differential Privacy

Local differential privacy (LDP) is a strong notion of privacy for indiv...
10/18/2020

Decentralized and Secure Generation Maintenance with Differential Privacy

Decentralized methods are gaining popularity for data-driven models in p...

1. Introduction

Generating synthetic datasets of high utility is a fundamental challenge in many data management tasks, such as private data sharing (e.g., Machanavajjhala2008; Zhang2017; Ge2021; Cunningham2021a; Cunningham2021), benchmark generation (e.g., Gu2015; Ghazal2013), and database management system testing (Bruno2005; Binnig2007; Torlak2012).

A key priority when generating synthetic data is preserving the privacy of sensitive information while maintaining high utility. Differential privacy (DP) and its local variant, LDP, have become the de facto privacy standards for synthetic data generation owing to their mathematically rigorous privacy guarantees, and several studies have tackled the issue of generating private synthetic datasets (e.g., Zhang2021; Huang2019; Cunningham2021). Whereas the centralized setting of DP relies on a trusted aggregator, LDP offers a stronger degree of privacy by allowing users to perturb their data before sharing it with the aggregator.

In this paper, we develop a locally private machine learning-based solution for generating synthetic data from real data, with the aim to preserve the characteristics of the original real data faithfully. We focus on spatial point data as the proliferation of mobile technologies and location-based services has made (private) spatial data increasingly valuable to data scientists, companies, and researchers. Spatial datasets are also used in several societally important fields, such as ecology 

(Velazquez2016), geology (Zuo2009), and epidemiology (Gatrell1996), many of which need to contend with the need to preserve the privacy of the data subjects (e.g., animals from being poached illegally, individuals being identified from contact tracing apps).

However, when dealing with spatial data, the traditional form of LDP can be unnecessarily restrictive in terms of how it deals with sensitive data, as it normally adopts an “all-or-nothing” approach in which all data needs to be perturbed (Malek2021). This affects the utility of the synthetic data for common location analytics tasks. Label privacy (Chaudhuri2011) provides a more practical means for achieving the necessary privacy protection without sacrificing utility. It is based on the notion that the features of a point are public (and so do not need to be perturbed), whereas the label associated with the data is private (and so does need to be perturbed). Applied to the setting of spatial data, label privacy leads to the following idea: as all location information is public knowledge, it is only a person’s association with a particular location point at a particular time that is private and in need of perturbation. Consequently, we combine the idea of label privacy with LDP and utilize label-LDP to provide sufficient privacy protections when generating synthetic spatial data.

In theory, GANs offer a general purpose solution for the data generation problem due to their objective of learning the optimal functional mapping from some random noise input to a faithful representation of real data (Goodfellow2014)

. However, most existing GAN architectures for point data stem from computer vision and are designed for capturing simplified continuous shapes and meshes. As such, they are not optimized to handle complex spatial patterns observed in the real world, as our experiments show. Furthermore, they do not support local privacy and are unsuitable for generating large-scale spatial datasets. Our solution, GeoPointGAN, addresses these limitations through several key technical novelties in learning the multi-scale partitioned structures of spatial point data. GeoPointGAN’s architecture uses extended PointNets in both the generator and discriminator and, rather than sampling from a lower-dimensional latent vector, the generator ingests randomly generated pseudo-co-ordinates and learns a transformation to generate realistic outputs. The discriminator provides outputs on the point level, rather than the batch level, which allows for highly localized training and the seamless integration of local privacy mechanisms.

Given that they operate with data with ‘real’ and ‘fake’ labels, GANs present an intuitive setting for implementing label-LDP. GeoPointGAN incorporates label-LDP through a randomized response mechanism that flips the labels provided to the discriminator, thereby providing plausible deniability to each individual’s association with a location. We also outline how, as labels are only flipped once, label-LDP does not suffer from the same vulnerability of traditional LDP in which locations can be revealed through repeated querying. Beyond its privatization properties, label flipping also has potential generalization and regularization effects on the model performance, which we contextualize with related literature. In most settings, local privacy mechanisms are known to introduce a lot of noise and typically a large database is needed to generate useful output. But, as we demonstrate in our experiments, incorporating label-LDP into our training process has negligible effects on model performance, which indicates that GeoPointGAN is effective at minimizing the impacts of the added noise.

We evaluate GeoPointGAN against three recent GAN-based approaches using four real-world datasets, each of which exhibit different characteristics. The first set of experiments show that GeoPointGAN significantly outperforms existing GAN-based methods in generating spatial point data, improving by up to 10 times compared to the most competitive baseline. In the second set of experiments, we use the synthetic data generated by GeoPointGAN to answer location analytics queries, namely range, hotspot, and facility location queries. GeoPointGAN performs excellently in most settings and the synthetic data obtains very similar answers to the queries as the real data does. Interestingly, in some settings, privatized GeoPointGANs perform slightly better than non-private GeoPointGANs, which supports our hypothesis that label flipping can help to realize generalization and regularization benefits.

Our strong results confirm that GeoPointGAN is a robust method for generating practical private synthetic data that can be a worthy substitute for non-private real data for many data science tasks, and it can also be used to accurately answer other queries, such as proximity queries and clustering-based analysis. Finally, generating datasets with privacy guarantees motivates further downstream applications. For example, by harnessing the ubiquity of mobile devices, open data portals can be populated with the private mobility patterns of millions of individuals, with little overhead cost due to the distributed nature of the data collection.

2. Related Work

GANs. GANs have been utilized for a range of data types, including image data (Goodfellow2014), audio streams (Akbari2018), text data (Chen2018), traffic patterns (Zhang2020), and gene expressions (Dizaji2018). In the geospatial domain, GANs have been used for generating digital elevation maps (Klemmer2021a) and global surface temperatures (Klemmer2021d). Existing GAN approaches for continuous spatial point co-ordinates mostly deal with point clouds that are simplified, continuous representations of shapes and surfaces. The first GAN tailored to point clouds (r-GAN) (Achlioptas2018)

builds on advances in processing point clouds in neural networks, most notably PointNet 

(Qi2017). Generating faithful shapes based on point cloud datasets, such as ShapeNet (Chang2015), is an active research challenge (e.g., Li2018; Shu2019; Gal2020). Further studies have utilized GANs for point cloud upsampling (Li2019a), shape completion (Sarmad2019), or against adversarial attacks (Zhou2020). Their applications to real-world data have been limited thus far, with a few recent exceptions such as an application to Lidar data (Caccia2019).

Geospatial datasets have different characteristics, which means their point patterns are very different from the point clouds that describe shapes and meshes. They do not have a mesh-like structure, they are typically more complex and noisy, and may be governed by underlying dynamics such as self-excitement. Few studies have tackled this class of data using GANs: Xiao2017 use Wasserstein GANs to learn temporal (one-dimensional) point processes, while Klemmer2019a learn conditional GANs contextualized by the co-ordinates of continuous spatial point data. However, these works provide no intuition for point transformations or for producing new spatial point patterns similar to the input. The challenging nature of generating spatial point patterns and the lack of existing work addressing this problem help to motivate our work.

Private GANs. In recent years, there have been a number of studies proposing differentially private GANs, including DPGAN (Xie2018), DP-CGAN (Torkzadehmahani2019), PATE-GAN (Yoon2018), and the work of Frigerio2019, which extends DPGAN to continuous, discrete, and time series data. Existing private GANs have focused on other specific domains, such as medical (Yoon2018; Beaulieu-Jones2019; Torfi2020), image (Torkzadehmahani2019), or time series data (Wang2020), as opposed to spatial data. Furthermore, all existing private GANs use centralized DP, normally by clipping and adding noise to the gradient during training (Xie2018; Frigerio2019) or applying existing private frameworks (e.g., Yoon2018; Augenstein2019). This different privacy setting means we cannot compare them to our work.

Location Privacy. Both DP and LDP have increasingly been applied with location-specific variants, such as geoindistinguishability (Andres2013). Most work has focused on data publication with DP (e.g., Mohammed2011; Chen2012; Acs2014; Cormode2012; Xiao2015; Ghane2018; Gursoy2018), as opposed to data synthesis and LDP. Some other work exists on private trajectory synthesis and publication  (e.g., He2015; Gursoy2020; Qu2020), but recently proposed solutions, in both the centralized (e.g., He2015; Gursoy2020) and local settings (Cunningham2021a), all possess a common limitation. They all produce outputs that correspond to arbitrary grid cells or places of interest, whereas we generate co-ordinate data (i.e., the same form as the input data). While one could extend these solutions to generate individual points (e.g., by using uniform sampling), Cunningham2021 show that adapting existing solutions in this way fails to produce high-quality synthetic spatial point data. Their purpose-built solution is designed for the centralized DP setting, which leaves generating high-quality spatial point data in local privacy settings as an important, yet unaddressed, challenge. We note that there has been some work on location data in local settings (i.e., LDP and variants thereof). Chen2016 use personalized LDP for spatial data aggregation, Xiong2019 focus on continuous location sharing using randomized response, and Cunningham2021a publish LDP-compliant sequences of places of interest. However, extensions of these works to our setting are unviable owing to their fundamentally different problem and/or privacy settings.

Label DP. Label-DP was formally introduced by Chaudhuri2011 and has since been the focus of several studies (Wang2019b; Ghazi2021; Esfandiari2021; Yuan2021; Malek2021). All of these works are based on the same premise as our work: only the labels attached to data are sensitive, with the data itself being non-sensitive. Despite this work, almost all prior work has been in the centralized setting; only BusaFekete2021 consider the local setting. Hence, we are the the first to apply label-LDP to GANs to generate privacy-preserving spatial point data with LDP-style guarantees.

3. Method

Figure 1. GeoPointGAN modeling pipeline including the label-LDP mechanism

Before introducing GeoPointGAN, we first present some background on GANs, and discuss the privacy setting we follow in our work.

3.1. GANs

GANs are a family of models that seek to learn the data generating process of observed data . Learning is facilitated through two networks: a generator , and a discriminator . The generator with parameters maps a random noise input to the feature space of the real data so that . The discriminator , parameterized by , then attempts to distinguish real data from synthetic samples , so that , with 0 and 1 denoting labels of ‘fake’ and ‘real’ data points respectively. The learning process follows from a min-max game between and , which is given as :

(1)

In our case, the input feature vector represents the spatial co-ordinates of a point in -dimensional space.

3.2. Privacy Setting

3.2.1. Local Differential Privacy

The centralized model of DP (Dwork2006) offers strong privacy guarantees through a level of plausible deniability but assumes that the data aggregator can be trusted, which may not always be appropriate, especially when highly sensitive data (e.g., location data) is concerned. Therefore, we focus on the local setting as it achieves a stronger level of privacy by allowing users to perturb the data before sharing it with the data aggregator.

Definition 3.1 (-local differential privacy (Duchi2013)).

A randomized mechanism satisfies -local differential privacy if, for any two inputs and output :

(2)

where is the privacy budget that controls the level of privacy protection. The intuition with LDP is that, given the output , an adversary cannot (with high confidence) identify the input value.

One important property of LDP is post-processing. Formally, for any two mechanisms and , if satisfies -LDP, then the composition satisfies -LDP (regardless of whether satisfies -LDP itself) (DelRey2020). In a practical sense, the post-processing property allows the output of an LDP mechanism to be used and manipulated infinitely, without affecting its privacy guarantee, as long as there is no further interaction with the private data.

LDP can be achieved by using randomized response (Warner1965)

in which a user reports their true data with probability

, and reports an alternative response with probability . LDP attaches strong privacy guarantees to randomized response outputs. To satisfy LDP in the generalized randomized response setting, the probability a user reports true information is , where is the size of the output set. Hence, the probability of reporting any other single output is .

3.2.2. Label Local Differential Privacy

The notion of LDP is extended to label-LDP if we consider each feature vector to have a label , together denoted as .

Definition 3.2 (-label-LDP).

A randomized mechanism satisfies -label-LDP if, for any labeled feature vector with the input labels and output label :

(3)

It is intuitive to observe that label-LDP possesses the same post-processing property as traditional LDP, and that randomized response can be used to ensure label-LDP.

As explained in Section 1, label-LDP provides more practical, yet sufficiently private, protection to data by only perturbing the label attached to a feature vector. This is appropriate for our problem given that we deem information about locations in a region to be sufficiently public with only one’s association with a location being private information. From Definition 3.2, the intuition is that an adversary cannot (with high confidence) identify whether the person was at the reported location or not.

We assume that the dataset of real points covers a sufficiently large proportion of the domain in which fake points can be generated. This is to prevent the labels being sufficiently correlated with features, which would undermine privacy (BusaFekete2021). For example, if the ratio between land area and total area was too small, many fake points could be generated in nonsensical locations (e.g., oceans), which would allow an adversary to identify fake points and, by extension, real points. While this assumption does not affect the mechanism in any way, we impose this soft constraint as an extra layer of protection against privacy leakage.

Finally, although one could naïvely apply randomized response directly to the real data, this limits the dataset size to approximately , which heavily limits the range of analytics tasks for which the private data can be used. Hence, a more flexible solution that can generate datasets of any size is necessary.

3.2.3. Examples

To further illustrate the use of label-LDP in our setting and its advantage over traditional LDP, consider the following two example scenarios. In both examples, although having the original locations with perturbed labels is useful in itself, the samples may not be representative and, in any case, they will be noisy, owing to perturbation. As such, being able to generate datasets of any size based on the original distribution is important, and gives end users more flexibility.

For our first example, consider that a city’s government wants to know the distribution of its residents’ locations at 10am. This might help to determine the approximate proportion of people working from home, which is helpful for managing working conditions or reducing disease spread. Given that the government will know every resident’s address (e.g., through voter rolls or council tax information), this information is non-private. The private element is where each person is at 10am, and residents will want a degree of plausible deniability, which is provided by label-LDP. In comparison, traditional LDP is unnecessarily restrictive here as it would require the perturbation of the location of each resident at 10am, even if their home address is known.

Moreover, repeated use of traditional LDP (through perturbation of the location) would eventually reveal the true location. While continuous data sharing is a common need for many real-life applications, traditional LDP approaches do not address this most practical setting. Whereas, with label-LDP, as the location itself is deemed to be non-private, each daily report can be considered to be independent, and so repeated querying does not degrade the overall privacy level provided to each user. This further motivates label-LDP as a more robust privacy setting in many practical cases.

Our second example differs from the first as the government no longer has a plausible location for each individual for which they want to ask a yes-no question. Imagine a town with 100,000 people, all of which are asked to privately share their location at 8pm, which is also the time at which a large number of residents are attending a concert. With traditional LDP, the locations of each individual would need to be perturbed, which would induce a large amount of noise into the dataset and greatly affect utility. However, public knowledge (e.g., news sources) would indicate that many people are at the concert and so we should expect a large number of reported locations to be associated with the concert. As such, label-LDP still gives each concert-goer a degree of plausible deniability regarding their presence at the concert, while preserving the overall popularity distribution, which is publicly known (to some extent).

3.3. Model Architecture

Existing GAN architectures for point data, mostly stemming from computer vision research, aim to capture object shape point clouds or mesh-structured data. As such, they are designed to model simple shapes and object outlines. Spatial point patterns, such as location data from mobile devices, typically have a noisy, multi-scale partitioned structure, and may cover the whole observational area, rather than having clear outlines. For example, while the point cloud of a chair can roughly be segmented into six elements (four legs, seat, back), spatial point patterns in the real world can consist of hundreds of intricate macroscopic (e.g., terrain, cities) and microscopic (e.g., roads, junctions) elements. GeoPointGAN includes several novel approaches to address the challenges of generating this data. Its data-processing pipeline and architecture, outlined in Figure 1, consist of three main components, which we explain next.

3.3.1. Generator

GeoPointGAN samples a noise vector of the same dimensionality as the desired output. This is equivalent to sampling random points with co-ordinates in the same space as a real point . This is in contrast to traditional GAN approaches, which sample a Gaussian noise vector from a lower-dimensional latent space. Rather than learning a model that ‘upsamples’ from a low-dimensional latent space to a higher-dimensional output space, we thus aim to learn a model that transforms data from an -dimensional latent space to a meaningful representation in the same -dimensional space.

We use this noise sampling strategy to design a novel PointNet-based generator. PointNet (Qi2017) was originally devised for classification and segmentation of raw point clouds. While it has been used as the basis for GAN discriminators before (see Achlioptas2018)

, we propose the first GAN that utilizes PointNet in the generator. Particularly, its ability in providing transformation invariant properties for unordered data is desirable for spatial point generation. This is achieved by running the input through symmetric functions (e.g., max pooling operators) to compute global point set features. This step is followed by a segmentation network that combines the global information (e.g., city boundaries, rivers) with local, point-wise information (e.g., roads, junctions) to learn a combined representation. Lastly, a spatial transformer network (STN) 

(Jaderberg2015)

aligns the learned global and local point set features with the output space. We find that the traditional STN architecture is unable to resolve the complexities of spatial point datasets sufficiently, owing to the shallowness of the neural networks deployed. In particular, while macroscopic structures (e.g., coastlines) can be captured reliably, the STN is incapable at learning small-scale patterns (e.g., minor roads). Consequently, we extend the STN such that we have five one-dimensional convolutional layers, with four fully connected layers, adding batch normalization between every layer and the ReLu function (except the last layer). Altogether, we refer to this altered PointNet as ‘Large PointNet’.

As a final step, our generator takes the transformation invariant, aligned features of Large PointNet and projects them into the -dimensional output space using four fully connected layers. This is similar to the prediction head for point segmentation, only that we produce features per point (our synthetic co-ordinates), rather than one. These design choices are informed by extensive testing and are validated in our experiments.

3.3.2. Discriminator

Our discriminator architecture is inspired by Achlioptas2018, but comes with some fundamental technical improvements and a critical change that lets us incorporate our label-LDP mechanism. First, we run points through our ‘Large PointNet’ module. This balances the capacity for learning point set representations between the generator and discriminator, allowing for an evenly matched min-max game. The discriminator’s prediction head consists of two fully connected layers and a sigmoid activation. Second, we alter the last fully connected layer to produce predictions not on the batch, but rather the point level. That is, the discriminator’s task is thus to determine whether each individual point is real or fake, as opposed to each batch of points. Constructing the discriminator in this way allows us to seamlessly incorporate a localized privacy mechanism into model training.

3.3.3. Privacy Mechanism

We integrate point-level privacy guarantees into GeoPointGAN by probabilistically flipping the labels of the real and fake points (using randomized response) before showing the data samples to the discriminator. Whereas LDP would require perturbation of each co-ordinate point (using, say, the Laplace or exponential mechanism), with label-LDP, perturbing the label is sufficient. In our setting, we have two (pseudo-)labels – ‘real’ and ‘fake’ – which means that and that flipping can be conducted using biased coin tosses in which the probability that a label’s true status is maintained is , and the probability that a label’s status is flipped is . The labels of real points are flipped on users’ devices during data pre-processing, which ensures that the central agent (i.e., the GAN networks) never has access to this information. Should a real point be sampled several times throughout training, it will always have the same (flipped) label. From Definition 3.2, the intuition follows that the discriminator cannot determine (with high probability) that a point with a real label is actually real, or whether it is a fake point masquerading as a real one (or vice versa).

In a practical setting, all training is conducted by a central agent (who can be trusted or untrusted) on a remote server. Individual data is collected through mobile devices and each individual is responsible for flipping the label associated with their location. Hence, the only data transferred from this device is the location and the flipped label, which means that no central agent can definitively determine the true label with absolute certainty. Importantly, the discriminator does not know which points are generated by the generator, and which points are transmitted to the server by users; it can only distinguish points based on their perturbed labels. In summary, the discriminator has no way of (definitively) knowing whether any one point is real with a real label, real with a fake label, fake with a real label, or fake with a fake label.

3.3.4. Effects on Training and Generalization

A label flipping approach such as ours does not necessarily reduce model performance, but can even have beneficial effects. In predictive models, randomly flipping labels can act as a regularizer, preventing the model from overfitting and improving generalization (Xie2016). When working with noisy labels, label flipping can incorporate the uncertainty of the labels into the model (Nguyen2019). There is a vast collection of literature that focuses on GAN regularization and robustness, and addresses problems such as limited data availability or generator-discriminator imbalance. Manipulating the (pseudo)-labels of GANs has proven to be a successful strategy to this end. Specifically, adding noise to the labels or applying one-sided label smoothing have been shown to improve GAN training and these are common best practices (Salimans2016). Finally, Jiang2021 provide a study that is closely related to our approach. The authors propose to feed the discriminator with fake data masquerading as real data (i.e., fake data with real labels). While this is proposed mainly as an augmentation strategy for sparse data environments, it is very similar to our label flipping approach, although they do not feed the discriminator real data with fake labels as we do. The authors also provide a theoretical intuition for training convergence and their approach. Their proof highlights how a GAN trained with label flipping augmentation minimizes the Jensen-Shannon divergence between the (smoothed) real and synthetic data distributions and is, in theory, able to perfectly capture the data generating process. Hence, we expect that private GeoPointGANs will (to a certain extent) perform as well as non-private GeoPointGANs (i.e., ones with no label flipping).

3.4. Model Training

Algorithm 1 describes the training of GeoPointGAN. Note that Lines 1-2 are conducted on user devices, although we include the steps here for completeness. Before training, each real point is assigned the ‘real’ label: (Line 1). These labels are then flipped with probability , where is controlled by the privacy budget, (Line 2). Perturbed labels are denoted as . We then initiate the training loop (Line 3). At each training step, points are sampled from the real data without replacement (so as to not oversample points from high-density areas) (Line 4). random co-ordinates are also drawn from the noise prior (Line 5) and transformed in to generate fake points: . The label of each fake point, , is flipped with probability to obtain (Line 6). Real points and fake points

are then classified as real or fake by

, after which is updated using the optimizer (Line 7). We then train by generating new fake points (Line 8), flipping their labels with probability (Line 9), and once more classifying them using . is then updated using its optimizer (Lines 10). This concludes one training step. and continuously play this game, with getting better and better at generating synthetic data. After training steps, the label-LDP generator is published (Line 11).

1:, , ,
2:Assign real points real labels:
3:Flip real labels with prob. to obtain
4:for  to  do
5:   Sample real points with flipped labels:
6:   Sample random co-ordinates from noise prior
7:   Flip fake point labels with probability to obtain
8:   Update by ascending its stochastic gradient:
9:
10:   Sample random co-ordinates from noise prior
11:   Flip fake point labels with probability to obtain
12:   Update by ascending its stochastic gradient:
13:
14:return
Algorithm 1 GeoPointGAN training

3.5. Privacy Analysis

We now discuss some important aspects of the mechanism, from a privacy perspective. First, we show the proposed label flipping approach satisfies -label-LDP.

Theorem 3.3 ().

GeoPointGAN satisfies -label-LDP.

Proof.

For ease of understanding, in this proof, we denote points with real and fake labels as and , respectively The probability that a real label tells the discriminator that it is a real label is: . Hence, the probability that a real label tells the discriminator that it is a fake label is . That is, . Similarly, and . From Equation 3, we have:

As discussed in Section 3.2, label-LDP has the same post-processing properties as LDP, which means that privatized data can be manipulated freely without affecting the privacy guarantee (as long as the true data is not ‘touched’ again). In our setting, the labels are perturbed by users before they are shown to the discriminator, and the true label is never used again. This means that the entire training procedure operates under post-processing, and our privacy guarantee remains in tact throughout training.

Many DP mechanisms suffer when points are repeatedly sampled, which causes the privacy leakage to increase each time a point is sampled. Our mechanism is designed such that these attacks are redundant as point labels are flipped once, and once only, before training begins. That is, when is sampled during any training step, it will always have the same perturbed label . This means that there is no privacy leakage even if a point is sampled more than once during different training steps. Note that, as we sample without replacement, the same point cannot be sampled multiple times during a single training step.

Figure 2. Sample plots of real and synthetic data; from top to bottom: Porto, Beijing, New York City and 3D Road

4. GAN Evaluation

We evaluate GeoPointGAN in two parts using four real spatial datasets. The first part, presented in this section, evaluates both non-private and private versions of GeoPointGAN using two fundamental point cloud and GAN metrics. We compare GeoPointGAN against three alternative GAN-based approaches. In the second part (Section 5), we evaluate GeoPointGAN’s practical query-based performance using three common location analytics tasks.

4.1. Experiment Set-Up

Data.

All experiments use the following open-source, real-world datasets. They exhibit a range of characteristics (e.g., alignment of the data with the road network, structure of the road network), which allow us to study GeoPointGAN in a variety of contexts.

We use two taxi trajectory datasets from Porto (Porto2015) and Beijing (Yuan2010; Yuan2011). We extract the latitude and longitude co-ordinates from the raw data and, remove all points that fall within nonsensical geographic regions (e.g., bodies of water). Although points in the real data are linked, we remove this spatio-temporal linkage and consider each point individually. Note that doing this has no effects on privacy, as each point has its own privacy guarantee. The final Porto dataset contains 79,360 points over an area of 24.7km2, and the Beijing dataset contains 158,260 points across a region with an area of 104.7km2.

We also use 311 call data from New York City (NYC) (NYC2020), filtering the dataset to only include data from Manhattan. The dataset has 163,220 points, each of which represents the location provided by the caller. Unlike the other datasets, the New York data is closely aligned with the road network, which itself is grid-like and ordered.

Our final dataset – ‘3D road’ – provides three-dimensional spatial co-ordinates (latitude, longitude, and altitude) of the road network in Jutland, Denmark (Kaul2013). The dataset comprises over 430,000 points, covering an area of 185 135 km2.

Training Setting. We follow a standardized training process. At each training step, 7,500 points are randomly sampled from the real dataset. We use the Adam optimizer with decoupled weight decay (Loshchilov2019) and an initial learning rate of

. The learning rate is decreased by a factor of 10 after 5,000, 50,000, and 90,000 training steps. We train 1,000 steps per epoch for a total of 100 epochs. All training is conducted on a single RTX 2080 GPU. With this set-up, model training times do not exceed two hours. Overall, GeoPointGAN, like r-GAN, experiences training that is reliable and consistent. At no point during any of our training runs do we experience mode collapse or exploding gradients.

Benchmarks. We compare GeoPointGAN against three state-of-the-art GANs. The first – r-GAN (Achlioptas2018) – is the method that is most closely related to GeoPointGAN and it is designed to operate on raw point clouds (point co-ordinates). The other two baselines, tree-GAN (Shu2019) and PCGAN (Arshad2020), are designed for graph-structured point clouds (e.g., meshes, shapes). All baselines are trained according to the configuration outlined by the original authors.

Despite the development of other private GANs (e.g., DPGAN, PATEGAN), these works all use the centralized DP, which is fundamentally different from our label-LDP setting. Similarly, although other private methods for synthetic spatial data generation exist, these methods also use different forms of privacy. For example, Chen2016 use personalized LDP, and Cunningham2021 use centralized DP. As our privacy setting is different from all of these works, any comparison between them is meaningless.

Evaluation Metrics. To evaluate GeoPointGAN’s ability to preserve the underlying distribution of the real data, we use two widely used utility measures: Chamfer distance (CD) and earth mover’s distance (EMD). As a continuous and pairwise smooth function, Chamfer distance is a well-established measure of point cloud distance. For a set of real points () and synthetic points (), CD is defined as:

(4)

where is some distance measure, and and are individual real and synthetic points, respectively. We use normalized Euclidean distance in our evaluation.

EMD – a common metric for evaluating GANs – can be viewed as an optimization problem that seeks to transform one probability distribution into another while minimizing the cost of this operation. While the computational cost of obtaining the exact distance is too high for it to be used in deep learning algorithms (and hence approximations are used), we use its exact version as an evaluation measure. Defining

as a bijection, EMD is defined as:

(5)

4.2. Results

4.2.1. Baseline Comparison

Figure 2 depicts plots of the real data, alongside samples from GeoPointGAN and r-GAN. We do not include visualizations for tree-GAN and PCGAN as they performed poorly. While r-GAN succeeds at capturing macroscopic structures, such as the outline of Manhattan and Central Park, it lacks the capacity to model more intricate structures within the outline, such as individual streets or junctions. On the other hand, GeoPointGAN is able to reproduce microscopic structures to a reasonable extent.

We calculate mean CD and EMD values by taking 60 samples of 7,500 points each from the fully trained generators. These values are shown in Table 1. Note that, here, we use a non-private GeoPointGAN (i.e., ) to ensure a fair comparison with the baselines, which have no privacy mechanism. The tree-based methods perform particularly poorly as they fail to learn the spatial data distribution. This justifies the decision to use generative models that are capable at operating on raw point co-ordinates, rather than shapes or meshes. GeoPointGAN offers substantial improvements of up to 70% over r-GAN, with consistently better CD and EMD values, which reflects its ability to preserve microscopic features and generate accurate spatial point data.

Method Chamfer Distance Earth Mover’s Distance
NYC Por. Beij. 3DR NYC Por. Beij. 3DR
tree-GAN (Shu2019) 0.649 0.437 0.651 1.247 0.601 1.085 0.733 1.080
PCGAN (Arshad2020) 0.160 0.348 0.092 0.796 0.305 0.831 0.283 0.896
r-GAN (Achlioptas2018) 0.226 0.034 0.032 0.264 0.364 0.275 0.084 0.526
GeoPointGAN 0.014 0.019 0.021 0.074 0.031 0.027 0.032 0.085
Table 1. Mean CD and EMD values
Figure 3. Real and synthetic data for different privacy budgets; data for New York City, zoomed in on Central Park
(a) Porto
(b) Beijing
(c) New York
(d) 3D Road
Figure 4. Chamfer distance across different datasets

4.2.2. Effect of Privatization

To assess how the application of our label-LDP mechanism affects GeoPointGAN performance, with particular focus on the sensitivity of GeoPointGAN to the privacy budget, we again draw 60 samples of 7,500 points from the real and synthetic datasets. We consider seven privacy budgets: . As none of the baselines are competitive with GeoPointGAN and they were not designed with privacy-preserving mechanisms in mind, we exclude them from this part of the evaluation. Figure 2

shows the effect of privatization on the generated data. We observe that lower privacy budgets have a negative impact on visual similarity. Figure

3 further illustrates how the preservation of macroscopic geographic features (e.g., Central Park in New York City) is affected by changing the privacy budget.

Figure 4, which shows the variation in CD as the privacy budget varies, highlights more interesting behavior. First, as expected, a very low privacy budget results in poor utility with utility increasing as increases. However, when , the CD values start to increase, leading to (albeit subtle) U-shaped curves. This indicates that our hypothesis that label flipping can aid performance is supported with empirical evidence. Specifically, we can quantify the optimal degree of label flipping to be when (i.e., ). In some cases, better utility is gained using private GeoPointGANs (cf. Beijing), which further demonstrates the power of regularization through privatization.

Interestingly, the structure of the underlying data also appears to influence this behavior. In New York, where the data is more closely aligned with a strict grid structure, the U-shape is more pronounced, indicating that the regularization properties of GeoPointGAN are more influential here. Conversely, in Porto, where the true data (and road network) lacks a clear structure, the U-shape is more subtle and very high privacy budgets correct the curve downwards. Finally, for 3D Road, we see large CD values and larger variations in these values, which suggests that the complexity and spatial extent of the dataset pushes GeoPointGAN to its limits.

5. Data Analytics Tasks

We now evaluate GeoPointGAN using two popular data analytics tasks – range and hotspot queries – both of which are popular for location analytics. We then apply our solution to the data-driven spatio-temporal task of facility location. We assess the ability for GeoPointGAN to preserve the answers to these queries to demonstrate the applicability of our solution in a database and data science setting. Given the poor qualitative and quantitative performance of the baselines, and their non-private nature, it is not meaningful to compare GeoPointGAN against them for these queries. As such, the aim of this evaluation is to compare GeoPointGAN to the optimum (i.e., maximum similarity with the real data).

(a) MAE – Porto
(b) MAE – Beijing
(c) MAE – New York
(d) MPE – Porto
(e) MPE – Beijing
(f) MPE – New York
Figure 5. Variation in MAE/MPE as the query radius varies

5.1. Range Queries

Range queries are commonly used in databases, as well as a primitive for location analytics. For example, they can be used to quickly assess how many customers are potentially available to a business, or measure accessibility to key services within a certain time, such as schools, hospitals, or vaccination centers. To assess this, we specify a set, , of 200 arbitrary places in each city to be the basis of our range queries, and these places are randomly selected from the set of nodes in each city’s road network. The extent of each range query is the circular region defined by the radius, , and centered on place . For each synthetic sample, we answer the set of range queries for all and all values for , and we quantify of error using the mean absolute error (MAE) and mean percentage error (MPE). Having conducted range queries for all 60 synthetic data samples, we take the mean of the MAE and MPE values. We consider the following -values: {50, 100, 200, 500, 1000} meters.

Figure 5 shows the effect that has on the MAE/MPE of the query answer. As expected, MAE increases as increases, although MPE decreases; both of these trends are acceptable when considered together. As a particularly impressive outcome, we see that synthetic data generated with higher privacy budgets (i.e., ) performs as well, and sometimes better, than the data generated using the non-private version of GeoPointGAN. Close inspection indicates that the privatized GeoPointGAN performs best when , which is concordant with the findings in Section 4.2.2.

5.2. Hotspot Analysis

Hotspot analysis identifies regions with a high number of points and, like range queries, it is fundamental in location analytics. For example, businesses need to identify popular regions for advertising, and city agencies have to manage congestion and traffic flow. We analyze hotspots for the three two-dimensional datasets by generating kernel density estimates (KDEs) for the real and synthetic datasets. The KDE uses the two-dimensional Gaussian kernel defined over a uniform grid with dimensions of

, where denotes the granularity. We use a range of granularities: , and define hotspots to be grid cells in which the density is greater than the 95th percentile. For each of the 60 samples, we assess query response similarity between the real and synthetic data using the Sørensen-Dice coefficient (SDC):

(6)

where is the set of hotspots.

(a) Porto
(b) Beijing
(c) New York
Figure 6. Variation in SDC as the hotspot granularity varies

Figure 6 shows the variation in mean SDC values as the hotspot granularity increases and, once again, we observe similar findings. Namely, (a) poor performance is observed when , while other values are competitive with the non-private GeoPointGAN; (b) private GeoPointGANs sometimes outperform the non-private version; and (c) private GeoPointGANs perform best with a middling value, though the exact value depends on the city.

5.3. Facility Location Queries

Facility location is an example (of which there are many) of an end application for our work. Facility location queries are more complex as they are a combination of range and hotspot queries. We consider two variants: Max-Inf and Min-Dist. In the Max-Inf case, we aim to select the most influential candidate facilities, where influence is commonly defined as the total number of customers that the facilities attract. The Min-Dist query instead selects facilities that minimize the total distance between customers and the facilities. For both queries, individual location data is needed to accurately model the behaviors of potential customers, which motivates the use of models like GeoPointGAN.

Outline. Imagine a hot dog salesperson that wishes to locate outlets throughout Manhattan. Since more business could potentially be generated if her outlets were located at the intersections of busy streets, we use the same location set as used for the range queries (as this was a selection of road intersections). Each represents a candidate facility, we consider 100 facilities, and assume that there are no existing facilities. and denote the sets of selected facilities when the real and synthetic data is used, respectively. To quantify similarity between and , we use the SDC (Equation 6), using in place of . Our evaluation uses .

Results. Figure 7 shows the variation in SDC values for the Max-Inf query. GeoPointGAN produces synthetic data that answers queries with high accuracy, and the non-private GeoPointGAN performs exceptionally well in most cases, especially in Porto for which it obtains near-optimal results. The effect of changing the privacy budget is also noticeable, and we continue to see the phenomenon of privatized versions of GeoPointGAN performing as well as non-private versions. We conduct the same analysis for the Min-Dist query and similarly strong results are obtained for all values of and . We omit the corresponding plots due to space limitations.

These strong results demonstrate the practical benefit of our approach and illustrate that facility location can tolerate the noise that is inherent in GAN-based sampling and that is required for label-LDP. This robustness can also be exploited for other data science tasks, such as nearest neighbor queries and clustering.

(a) Porto
(b) Beijing
(c) New York
Figure 7. Variation in SDC as varies for the Max-Inf query

6. Final Remarks

As demonstrated through our experiments, GeoPointGAN is a robust method for generating large synthetic spatial datasets with practical levels of privacy and high utility, both statistically and with respect to several location analytics tasks. Indeed, in some settings, private GeoPointGANs perform better than non-private versions – a remarkable and important observation. This phenomenon is possible due to the design of our privacy mechanism, which exploits the inherent noise in label flipping to harness the regularization effects that can be realized when training GANs.

Beyond these findings, our work provides further insights about the level of privacy provided. While GeoPointGAN demonstrably fulfills the requirements of label-LDP, the actual level of privacy provided is higher in many practical settings. As Malek2021 note, simply removing the sensitive labels of a public dataset and training a model in an unsupervised fashion, complies with label-(L)DP. We take this further by accounting for situations where, even if we remove the identifier (e.g., taxi ID, 311 caller name), we are still conscious of leaking information based on knowledge regarding the veracity of each point (e.g., if locations can help to identify the caller). Specifically, we use pseudo-labels (to denote whether points are real or fake) to obfuscate the training data by removing certainty within the model as to what input represents real data. In this sense, the level of privacy provided is stronger than what is necessarily required with label-LDP, although quantifying this achieved level of privacy is non-trivial. And, while this level of privacy may not necessarily be strong enough to satisfy LDP in theory, it may satisfy LDP to some extent in practice. We leave further exploration of these ideas and challenges for future work.

Acknowledgments. This work is supported in part by the Sponsor UK Engineering and Physical Sciences Research Council Rlu under Grant No.  Grant #3.

References