The right to privacy is enshrined in the Universal Declaration of Human Rights . However, as artificial intelligence is more and more permeating our daily lives, data sharing is increasingly locking horns with data privacy concerns. Differential privacy (DP), a probabilistic mechanism that provides an information-theoretic privacy guarantee, has emerged as a de facto standard for implementing privacy in data sharing . For instance, DP has been adopted by several tech companies  and will also be used in connection with the release of the Census 2020 data [3, 2].
Utility guarantees are usually provided only for a fixed set of queries. This means that either DP has to be used in an interactive scenario or the queries have to specified in advance.
There are no utility guarantees for more complex—but very common—machine learning tasks such as clustering or classification.
DP can suffer from a poor privacy-utility tradeoff, leading to either insufficient privacy protection or to data sets of rather low utility, thereby making DP of limited use in many applications .
Another approach to enable privacy in data sharing is based on the concept of synthetic data . The goal of synthetic data is to create a dataset that maintains the statistical properties of the original data while not exposing sensitive information. The combination of differential privacy with synthetic data has been suggested as a best-of-both-world solutions [24, 9, 31, 35, 13]. While combining DP with synthetic data can indeed provide more flexibility and thereby partially address some of the issues in (i), in and of itself it is not a panacea for the aforementioned problems.
One possibility to construct differentially private synthetic datasets that are not tailored to a priori specified queries is to simply add independent Laplacian noise to each data point. However, the amount noise that has to be added to achieve sufficient DP is too large with respect to maintaining satisfactory utility even for basic counting queries , not to mention more sophisticated machine learning tasks.
This raises the fundamental question whether it is even possible to construct in a numerically efficient manner differentially private synthetic data that come with rigorous utility guarantees for a wide range of (possibly complex) queries, while achieving a favorable privacy-utility tradeoff? In this paper we will answer this question to the affirmative.
1.2. A private measure
A main objective of this paper is to construct a private measure on a given metric space
. Namely, we design an algorithm that transforms a probability measureon into another probability measure on , and such that this transformation is both private and accurate.
For clarity, let us first consider the special case of empirical measures, where our goal can be understood as creating differentially private synthetic data. Specifically, we are looking for a computationally tractable algorithm that transforms true input data into synthetic output data for some , and which is -differentially private (see Definition 2.1) and such that the empirical measures
are close to each other in the Wasserstein 1-metric (recalled in Section 2.2.2):
where is as small as possible.
The main result of this paper is a computationally effective private algorithm whose accuracy that is expressed in terms of the multiscale geometry of the metric space . A consequence of this result, Theorem 9.6, states that if the metric space has Minkowski dimension , then, ignoring the dependence on and lower-order terms in the exponent, we have
The dependence on is optimal and quite intuitive. Indeed, if the true data consists of i.i.d. random points chosen uniformly from the unit cube , then the average spacing between these points is of the order . So our result shows that privacy can be achieved by a microscopic perturbation, one whose magnitude is roughly the same as the average spacing between the points.
Our more general result, Theorem 7.2, holds for arbitrary compact metric spaces and, more importantly, for general input measures (not just empirical ones). To be able to work in such generality, we employ the notion of metric privacy which reduces to differential privacy when we specialize to empirical measures (Section 2.1).
1.3. Uniform accuracy over Lipschitz statistics
The choice of the Wasserstein 1-metric to quantify accuracy ensures that all Lipschitz statistics are preserved uniformly. Indeed, by Kantorovich-Rubinstein duality theorem, (1.1) yields
where the supremum is over all -Lipschitz functions .
Standard private synthetic data generation methods that come with rigorous accuracy guarantees do so with respect to a predefined set of linear queries, such as low-dimensional marginals, see e.g. [8, 44, 22, 13]. While this may suffice in some cases, there is no assurance that the synthetic data behave in the same way as the original data under more complex, but frequently employed, machine learning techniques. For instance, if we want to apply a clustering method to the synthetic data, we cannot be sure that the results we get are close to those for the true data. This can drastically limit effective and reliable analysis of synthetic data.
In contrast, since the synthetic data constructed via our proposed method satisfy a uniform bound (1.3), this provides data analysts with a vastly increased toolbox of machine learning methods for which one can expect outcomes that are similar for the original data and the synthetic data.
As concrete examples let us look at two of the most common tasks in machine learning, namely clustering and classification. While not every clustering method will satisfy a Lipschitz property, there do exist Lipschitz clustering functions that achieve state-of-the-art results, see e.g. [32, 55]
. Similarly, there is distinct interest in Lipschitz function based classifiers, since they are more robust and less susceptible to adversarial attacks. This includes conventional classification methods such as support vector machines
as well as classifiers based on Lipschitz neural networks[50, 10]. These are just a few examples of complex machine learning tools that can be reliably applied to the synthetic data constructed via our private measure algorithm. Moreover, since our results hold for general compact metric spaces, this paves the way for creating private synthetic data for a wide range of data types. We will present a detailed algorithmic and numerical investigation of the proposed method in a forthcoming paper.
1.4. A superregular random walk
The most popular way to achieve privacy is by adding random noise, typically either by adding an appropriate amount of Laplacian noise or Gaussian noise (these methods are aptly referred to as Laplacian mechanism and Gaussian mechanism, respectively ). We, too, can try to make a probability measure on private by discretizing (replacing it with a finite set of points) and then adding random noise to the weights of the points. Going this route, however, yields suboptimal results. For example, it is not difficult to check that if is the interval , the accuracy of the Laplacian mechanism can not be better than , which is suboptimal compared to optimal accuracy in (1.2).
This loss of accuracy is caused by the accumulation of additive noise. Indeed, adding
independent random variables of unit variance produces noise of the order. This prompts a basic probabilistic question: can we construct random variables that are “close” to being independent, but whose partial sums cancel more perfectly than those of independent random variables? We answer this question affirmatively in Theorem 3.1, where we construct random variables whose joint distribution is as regular as that of i.i.d. Laplacian random variables, yet whose partial sums grow logarithmically as opposed to :
One can think of this as a random walk that is locally similar to the one with i.i.d. steps, but globally is much more bounded. Our construction is a nontrivial modification of Lévy’s construction of Brownian motion. It may be interesting and useful beyond applications to privacy.
1.5. Comparison to existing work
The numerically efficient construction of accurate differentially private synthetic data is highly non-trivial. As case in point, Ullman and Vadhan  showed (under standard cryptographic assumptions) that in general it is NP-hard to make private synthetic Boolean data which approximately preserve all two-dimensional marginals. There exists a substantial body of work for generating privacy-preserving synthetic data, cf. e.g. [4, 15, 1, 17, 36], but—unlike our work—without providing any rigorous privacy or accuracy guarantees. Those papers on synthetic data that do provide rigorous guarantees are limited to accuracy bounds for a finite set of a priori specified queries, see for example [8, 12, 44, 22, 13, 14], see also the tutorial . As discussed before, this may suffice for specific purposes, but in general severely limits the impact and usefulness of synthetic data. In contrast, the present work provides accuracy guarantees for a wide range of machine learning techniques. Furthermore, our our results hold for general compact metric spaces, as we establish metric privacy instead of just differential privacy.
A special example of the topic investigated in this paper is the publication of differentially private histograms, which is a well studied problem in the privacy literature, see e.g. [27, 40, 37, 54, 53, 38, 56, 2] and Chapter 4 in . In the specific context of histograms, the Haar function based approach to construct a superregular random walk proposed in our paper is related to the wavelet-based method  and to other hierarchical histogram partitioning methods [27, 40, 56]. Like our approach, [27, 53] obtain consistency of counting queries across the hierarchical levels, owing to the specific way that noise is added. Also, the accuracy bounds obtained in [27, 53] are similar to ours, as they are also polylogarithmic (although we are able to obtain a smaller exponent). There are, however, several key differences. While our approach gives a convenient way to generate accurate and differentially private synthetic data from true data , the methods of the aforementioned papers are not suited to create synthetic data. Instead, these methods release answers to queries. Moreover, accuracy is proven for just a single given range query and not simultaneously for all queries like we do. This limitation makes it impossible to create accurate synthetic data with the algorithms in [27, 53]. Moreover, unlike the aforementioned papers, our work allows the data to be quite general, since we prove metric privacy and not just differential privacy. Furthermore, our results apply to multi-dimensional data, and are not limited to the one-dimensional setting.
There exist several papers on the private estimation of density and other statistical quantities[28, 19], and sampling from distributions in a private manner is the topic of . While definitely interesting, that line of work is not concerned with synthetic data, and thus there is little overlap with this work.
1.6. The architecture of the paper
The remainder of this paper is organized as follows. We introduce some background material and notation in Section 2, such as the concept of metric privacy which generalizes differential privacy. In Section 3 we construct a superregular random walk (Theorem 3.1). We analyze metric privacy in more detail in Section 4, where we also provide a link from the general private measure problem to private synthetic data (Lemma 4.1). In Section 5 we use the superregular random walk to construct a private measure on the interval (Theorem 5.4). In Section 6 we use a link between the Traveling Salesman Problem and minimum spanning trees to devise a folding technique, which we apply in Section 7 to “fold” the interval into a space-filling curve to construct a private measure on a general metric space (Theorem 7.2). Postprocessing the private measure with quantization and splitting, we then generate private synthetic data in a general metric space (Corollary 7.4). In Section 8 we turn to lower bounds for private measures (Theorem 8.5) and synthetic data (Theorem 8.6) on a general metric space. We do this by employing a technique of Hardt and Talwar, which we present in a Proposition 8.1 that identifies general limitations for synthetic data. In Section 9 we illustrate our general results on a specific example of a metric space: the Boolean cube . We construct a private measure (Corollary 9.1) and private synthetic data (Corollary 9.2) on the cube, and show near optimality of these results in Corollary 9.3 and Corollary 9.4, respectively. Results similar to the ones for the -dimensional cube hold for arbitrary metric space of Minkowski dimension . For any such space, we prove an asymptotically sharp min-max results for private measures (Theorem 9.5) and synthetic data (Theorem 9.5).
2. Background and Notation
The motivation behind the concept of differential privacy is the desire to protect an individual’s data, while publishing aggregate information about the database . Adding or removing the data of one individual should have a negligible effect on the query outcome, as formalized in the following definition.
Definition 2.1 (Differential Privacy ).
A randomized algorithm gives -differential privacy if for any input databases and differing on at most one element, and any measurable subset , we have
where the probability is with respect to the randomness of .
2.1. Defining metric privacy
While differential privacy is a concept of the discrete world (where datasets can differ in a single element), it is often desirable to have more freedom in the choice of input data. The following general notion (which seems to be known under slightly different, and somewhat less general, versions, see e.g.  and the references therein) extends the classical concept of differential privacy.
Definition 2.2 (Metric privacy).
Let be a compact metric space and be a measurable space. A randomized algorithm is called -metrically private if, for any inputs and any measurable subset , we have
To see how this metric privacy encompasses differential privacy, consider a product space and equip it with the Hamming distance
The -differentially privacy of an algorithm can be expressed as
Note that (2.3) is equivalent to (2.1) for . Obviously, (2.1) implies (2.3). The converse implication can be proved by replacing one coordinate of by the corresponding coordinate of and applying (2.3) times, then telescoping. Let us summarize:
Lemma 2.3 (MP vs. DP).
Let be an arbitrary set. Then an algorithm is -differentially private if an only if is -metrically private with respect to the Hamming distance (2.2) on .
Unlike differential privacy, metric privacy goes beyond product spaces, and thus allows the data to be quite general. In this paper, for example, the input data are probability measures. Moreover, metric privacy does away with the assumption that the data sets be different in a single element. This assumption is sometimes too restrictive: general measures, for example, do not break down into natural single elements.
2.2. Distances between measures
This paper will use three classical notions of distance between measures.
2.2.1. Total variation
The total variation (TV) norm [20, Section III.1] of a signed measure on a measurable space is defined as111The factor is chosen for convenience.
where the supremum is over all partitions into countably many parts . If is countable, we have
The TV distance between two probability measures and is defined as the TV norm of the signed measure . Equivalently,
2.2.2. Wasserstein distance
Let be a bounded metric space. We define the Wasserstein 1-distance (henceforth simply referred to as Wasserstein distance) between probability measures and on as 
where the infimum is over all couplings of and , or probability measures on whose marginals on the first and second coordinates are and , respectively. In other words, minimizes the transportation cost between the “piles of earth” and .
The Kantorovich-Rubinstein duality theorem  gives an equivalent representation:
where the supremum is over all continuous, -Lipschitz functions .
For probability measures and on , the Wasserstein distance has the following representation, according to Vallender :
is the cumulative distribution function of, and similarly for .
Vallender’s identity (2.7) can be used to define Wasserstein distance for signed measures on . Moreover, for signed measures on , the Wasserstein distance defined this way is always finite, and it defines a pseudometric.
3. A superregular random walk
The classical random walk with independent steps of unit variance is not bounded: it deviates from the origin at the expected rate . Surprisingly, there exists a random walk whose joint distribution of steps is as regular as that of independent Laplacians, yet that deviates from the origin logarithmically slowly.
Theorem 3.1 (A superregular random walk).
For every , there exists a probability density of the form on that satisfies the following two properties.
(Regularity): the potential is -Lipschitz in the norm, i.e.
(Boundedness): a random vectordistributed according to the density satisfies
where is a universal constant.
The first candidate for the random walk could be a discretization of the standard Brownian motion on the interval . The reflection principle yields the boundedness property in the theorem, even without any logarithmic loss. However, the regularity property fails miserably.
To achieve regularity, we would like to grant Brownian motion more freedom. This will be done by modifying Levy’s construction of the Brownian motion. In this construction, the path of a Brownian motion on an interval is defined as a random series with respect to the Schauder basis of the space of continuous functions, see also [11, Section IX.1].
To that end, we recall the definition of the Schauder basis of triangular functions of . Let be the semi-open interval and the set of all integers such that . For any integer there exists a unique pair of integers such that where . Let
and define for and
The modification of this definition from to is obvious by dilation: .
Thus, the basis functions are defined by levels . At level , we have two functions and , and each level contains functions supported in disjoint intervals of length . Throughout this section, will denote the level the function belongs to, e.g. , , , etc. See Figure 1 for an illustration of these functions.
Lévy’s definition of the standard Brownian motion on the interval is
where are i.i.d. standard normal random variables.
To grant more freedom to Brownian motion, we get rid of the suppressing factors in the Levy construction (3.4). The resulting series will be divergent, but we can truncate it defining
The random walk in the theorem could then be defined as . It is more volatile than the Brownian motion, but still can be shown to satisfy the boundedness assumption.
This is essentially the idea of the construction that yields Theorem 3.1. We make two minor modifications though. First, since regularity is defined using norm, it is more natural to use the Laplacian distribution for
instead of the normal distribution. We will makei.i.d. random variables with distribution222Define by . . Second, instead of defining the random walk and then taking its differences to define
, it is more convenient to define the differences directly. This corresponds to working with the derivative of the random walk, i.e. with “white noise”
3.2. Formal construction
First observe that the regularity property (3.1
) of a probability distribution onpasses on to the marginal distributions. For example, regularity of a random vector means that
for all . In particular,
Taking integral with respect to on both sides yields
which is equivalent to the regularity of the random vector . The same argument works in higher dimensions.
Thus, by dropping at most terms if necessary, we can assume without loss of generality that
Consider i.i.d. random variables and the Haar basis of introduced in the previous subsection. Define the random function by
Define the increments by for . The construction is complete. It remains to check boundedness and regularity.
Fix . We would like to bound the partial sum
(Here we use the inner product in , and denote by the indicator of the discrete interval .)
For every we have
Moreover, the random variables are subexponential,333For basic facts about subexponential random variables used in this argument, refer e.g. to [48, Section 2.8]. and . Hence
Furthermore, we claim that for each , at most terms are nonzero. Indeed, let us fix and recall that the definition of Haar functions yields
On any given level , the Haar functions have disjoint support, so there is a single for which . Therefore, for each level , there can be at most one nonzero coefficient . Two more nonzero coefficients can be on level , coming from the functions and . This proves our claim that, for each , there number of nonzero coefficients is bounded by .
Summarizing, we showed that for each , the sum is a sum of at most independent mean zero subexponential random variables that satisfy (3.7). Applying Bernstein’s inequality (see [48, Theorem 2.8.1]), we obtain for every and :
Let and apply this bound for where is a sufficiently large absolute constant. We obtain
where we used that and (3.6) in the last step. Taking the union bound over , we get
This implies that and proves the first boundedness property (3.2).
The second boundedness property (3.3) follows similarly, and even in a simpler way, if we choose and bypass the union bound.
Recall that the Haar functions form an orthogonal basis of . However, this basis is not orthonormal, as the norm of each function on level satisfies . Thus, every function admits the orthogonal decomposition
The key property of the coefficient vector is its approximate sparsity, which we can express via the norm.
Lemma 3.2 (Sparsity).
For any function , the coefficient vector satisfies
First, let us prove the lemma for the indicator of any single point , i.e. for . Here we have
By construction, any function on level takes on three values: and . Moreover, on any given level , the functions have disjoint support, so there is a single well-defined for which . Therefore, among all functions on a given level , only one can make nonzero, namely the one with , and such a nonzero value always equals . Summarizing, the level contributes two nonzero coefficients , while each further level contributes only one. Hence has nonzero coefficients , each taking values . Therefore, .
To extend this bound to a general function , decompose it as . Then, by linearity, , so
The bound from the first part of the argument completes the proof of the lemma. ∎
We are ready to prove regularity. Consider the random function constructed in Subsection 3.2. In our new notation, the coefficient vector of is . We have for any :
To see this, recall that the map is a linear bijection on . Hence for any and for the unit ball of , we have
Taking the limit on both sides as and applying the Lebesgue differentiation theorem yield (3.8).
By construction, the coefficients of the random vector are i.i.d. random variables. Hence
By the triangle inequality and Lemma 3.2, we have
If we express the density in the form , the bound we proved can be written as
or . Swapping with yields . The proof of Theorem 3.1 is complete. ∎
The reader might wonder if the logarithmic factors are necessary in Theorem 3.1. While we do not know if this is the case for the bound (3.3), the logarithmic factor can not be completely removed from the uniform bound (3.2):
Proposition 3.3 (Logarithm is needed in (3.2)).
Let be a natural number and consider a probability density of the form on . Assume that the potential is -Lipschitz in the -norm, i.e. (3.1) holds. Then a random vector distributed according to the density satisfies
Since triangle inequality yields , it suffices to check that
Assume for contradiction that this bound fails. Then, considering the cube
we obtain by Markov’s inequality that .
Let denote the standard basis vectors in and consider the following translates of the cube :
Note the following two properties. First, the cubes are disjoint. Second, since is -Lipschitz in the -norm, for each , the densities of the random vectors and and differ by a multiplicative factor of at most pointwise. Therefore,
Hence, using these two properties we get
It follows that , which contradicts the assumption of the lemma. The proof is complete. ∎
3.6. Beyond the norm?
One may wonder why specifically the norm appears in the regularity property of Theorem 3.1. As we will see shortly, the regularity with respect to the norm is exactly what is needed in our applications to privacy. However, it might be interesting to see if there are natural extensions of Theorem 3.1 for general norms. The lemma below rules out one such avenue, showing that if a potential is Lipschitz with respect to the norm for some , the corresponding random walk deviates at least polynomially fast (as opposed to logarithmically fast).
Proposition 3.4 (No boundedness for -regular potentials).
Let and consider a probability density of the form on . Assume that the potential is -Lipschitz in the -norm. Then a random vector distributed according to the density satisfies
We can write where . Since and is 1-Lipschitz in the norm, the densities of the random vectors and differ by a multiplicative factor of at most pointwise. Therefore,
Rearranging the terms, we deduce that
which completes the proof. ∎
4. Metric privacy
4.1. Private measures
The superregular random walk we just constructed will become the main tool in solving the following private measure problem. We are looking for a private and accurate algorithm that transforms a probability measure on a metric space into another finitely-supported probability measure on .
We need to specify what we mean by privacy and accuracy here. Metric privacy offers a natural framework for our problem. Namely, we consider Definition 2.2 for the space of all probability measures on equipped with the TV metric (recalled in Section 2.2.1). Thus, for any pair of input measures and on that are close in the TV metric, we would like the distributions of the (random) output measures and to be close:
The accuracy will be measured via the Wasserstein distance (recalled in Section 2.2.2). We hope to make as small as possible. The reason for choosing as distance is that it allows us to derive accuracy guarantees for general Lipschitz statistics, as outlined below.
4.2. Synthetic data
The private measure problem has an immediate application for differentially private synthetic data. Let be a compact metric space. We hope to find an algorithm that transforms the true data into synthetic data for some such that the empirical measures
are close in the Wasserstein distance, i.e. we hope to make small. This would imply that synthetic data accurately preserves all Lipschitz statistics, i.e.
for any Lipschitz function .
This goal can be immediately achieved if we solve a version of the private measure problem, described in Section 4.1, with the additional requirement that be an empirical measure. Indeed, define the algorithm by feeding the empirical measure into , i.e. set . The accuracy follows, and the differential privacy of can be seen as follows.
For any pair of input data that differ in a single element, the corresponding empirical measures differ by at most with respect to the TV distance, i.e.
Then, for any subset in the output space, we can use (4.1) to get
Thus, if , the algorithm is -differentially private. Let us record this observation formally.
Lemma 4.1 (Private measure yields private synthetic data).
Let be a compact metric space. Let be an algorithm that inputs a probability measure on , and outputs something. Define the algorithm that takes data as an input, creates the empirical measure and feeds it into the algorithm , i.e. set . If is -metrically private in the TV metric and , then is -differentially private.
Thus, our main focus from now on will be on solving the private measure problem; private synthetic data will follows as a consequence.
5. A private measure on the line
In this section, we construct a private measure on the interval . Later we will extend this construction to general metric spaces.
5.1. Discrete input space
Let us start with a somewhat restricted goal, and then work toward wider generality. In this subsection, we will (a) assume that the input measure is always supported on some fixed finite subset
and (b) allow the output to be a signed measure. We will measure accuracy with the Wasserstein distance.
5.1.1. Perturbing a measure by a superregular random walk
Apply the Superregular Random Walk Theorem 3.1 and rescale the random variables by setting . The regularity property of the random vector takes the form
and the boundedness property takes the form
Let us make the algorithm perturb the measure on by the weights , i.e. we set
Any measure on can be identified with the vector by setting . Then, for any measure on , we have
Fix two measures and on . By above, we have
This shows that the algorithm is -metrically private in the TV metric.
If and are signed measures on , then with the definition (2.7) of the Wasserstein metric for signed measures, we have
where we set .
Applying this general observation for and using (5.3), we obtain
Take expectation on both sides and use (5.2) to conclude that
The following result summarizes what we have proved.
Proposition 5.1 (Input in discrete space, output signed measure).
Let be finite subset of and let . Let . There exists a randomized algorithm that takes a probability measure on as an input and returns a signed measure on as an output, and with the following two properties.
(Privacy): the algorithm is -metrically private in the TV metric.
(Accuracy): for any input measure , the expected accuracy of the output signed measure in the Wasserstein distance is
Let be the signed measure obtained in Proposition 5.1. Let be a probability measure on that minimizes . (The minimizer could be non-unique.) Note that this is a convex problem, since by (LABEL:wassersteinformula), this problem is equivalent to minimizing
under the constraints and .
By minimality, . So