The ability to answer statistical queries on a sensitive data set in a privacy-preserving way is one of the most fundamental primitives in private data analysis. In particular, this task has been at the center of the literature of differential privacy since its emergence [DN03, DMNS06, BLR13] and is central to the upcoming 2020 US Census release [DLS17]. In its basic form, the problem of differentially private query release can be described as follows. Given a class of queries 111In this work, we focus on classes of binary functions (known in the literature of DP as counting queries). defined over some domain , and a data set of i.i.d. drawn samples from some unknown distribution over , the goal is to construct an -differentially private algorithm that, given and , outputs a mapping such that for every ,
gives an accurate estimate for the true meanup to some additive error .
A central question in private query release is concerned with characterizing the private sample complexity, which is the least amount of private samples required to perform this task up to some additive error . This question has been extensively studied in the literature on differential privacy [DN03, HR10, MN12, BLR13, BUV18, SU15]. For general query classes, it was shown that the optimal bound on the private sample complexity in terms of , , and the privacy parameters is attained by the Private Multiplicative Weights (PMW) algorithm due to Hardt and Rothblum [HR10]. This optimality was established by the lower bound due to Bun et al. [BUV18]. This result implied the impossibility of differentially private query release for certain classes of infinite size. Moreover, subsequent results by [BNSV15, ALMM19] implied that this impossibility is true even for a simple class such as one-dimensional thresholds over . On the other hand, without any privacy constraints, the query release problem is equivalent to attaining uniform convergence over , and hence the sample complexity is given by where is the VC-dimension of .
In practice, it is often feasible to collect some amount of “public” data that poses no privacy concerns. For example, in the language of consumer privacy, there is considerable amount of data collected from the so-called “opt-in” users, who voluntarily offer or sell their data to companies or organizations. Such data is deemed by its original owner to pose no threat to personal privacy. There are also a variety of other sources of public data that can be harnessed.
Motivated by the above observation, and by the limitations in the standard model of differentially private query release, in this work, we study a relaxed setting of this problem, which we call Public-data-Assisted Private (PAP) query release. In this setting, the query-release algorithm has access to two types of samples from the unknown distribution: private samples that contain personal and sensitive information (as in the standard setting) and public samples. The goal is to design algorithms that can exploit as little public data as possible to achieve non-trivial savings in sample complexity over standard DP query-release algorithms, while still providing strong privacy guarantees for the private dataset.
1.1 Our results
In this work we study the private and public sample complexities of PAP query-release algorithms, and give upper and lower bounds on both. To describe our results, we will use and to denote the VC-dimension and dual VC-dimension of the query class , respectively. We will use to denote the target error for query release.
Upper bounds: We give a construction of a PAP algorithm that solves the query release problem for any query class using only public samples, and private samples. Recall that samples are necessary even without privacy constraints; therefore, our upper bound on the public sample complexity shows a nearly quadratic saving.
Lower bound on private sample complexity: We show that there is a query class with VC-dimension and dual VC-dimension such that any PAP algorithm either requires public samples or requires total samples. Thus the dependence above is unavoidable. For this class, public samples are enough to solve the problem with no private samples, and private samples are enough to solve the problem with no public samples. Thus, for this function class public samples essentially do not help improve the overall sample complexity, unless there are nearly enough public samples to solve the problem without private samples.
Lower bound on public sample complexity: We show that if the class has infinite Littlestone dimension,222The Littlestone dimension is a combinatorial parameter that arises in online learning [Lit88, BDPSS09]. then any PAP query-release algorithm for must have public sample complexity . The simplest example of such a class is the class of one-dimensional thresholds over . This class has VC-dimension , and therefore demonstrates that the upper bound above is nearly tight.
The first key step in our construction for the upper bound is to use the public data to construct a finite class that forms a “good approximation” of the original class . Such approximation is captured via the notion of an -cover (Definition 2.1). The number of public examples that suffices to construct such approximation is about [ABM19]. Given this finite class , we then reduce the original domain to a finite set of representative domain points, which is defined via an equivalence relation induced by over (Definition 2.1). Using Sauer’s Lemma, we can show that the size of such a representative domain is at most , where is the dual VC-dimension of . At this point, we can reduce our problem to DP query-release for the finite query class and the finite domain , which we can solve via the PMW algorithm [HR10, DR14].
Lower bound on private sample complexity:
The proof of the lower bound is based on robust tracing attacks [BUV18, DSS15, SU15]. That work proves privacy lower bounds for the class of decision stumps over the domain , which contains queries of the form for some . Specifically, they show that for any algorithm that takes at most samples, and releases the class of decision stumps with accuracy , there is some attacker that can “detect” the presence of at least of the samples. Therefore, if the number of public samples is at most , the attacker can detect the presence of one of the private samples, which means the algorithm cannot be differentially private with respect to the private samples.
Lower bound on public sample complexity:
This lower bound is derived in two steps. First, we show that PAP query-release for a class implies PAP learning (studied in [BNS13, ABM19]333This notion was termed “semi-private” learning in their work.) of the same class with the same amount of public data. This step follows from a straightforward generalization of an analogous result by [BNS13] in the standard DP model with no public data. Second, we invoke the lower bound of [ABM19] on the public sample complexity of PAP learning.
1.3 Other related work
To the best of our knowledge, our work is the first to formally study differentially private query release assisted with public data. There has been work on private supervised-learning setting with access to limited public data, that is PAP learning. In particular, the notion of differentially private PAC learning assisted with public data was introduced by Beimel et al. in[BNS13], where it was called “semi-private learning.” They gave a construction of a learning algorithm in this setting, and derived upper bounds on the private and public sample complexities. The paper [ABM19] revisited this problem and gave nearly optimal bounds on the private and public sample complexities in the agnostic PAC model. The work of [ABM19] emphasizes the notion of -covers as a useful tool in the construction of such learning algorithms. Our PAP algorithm nicely leverages this notion in the query-release setting.
In a similar vein, there has been work on other relaxations of private learning that do not require all parts of the data to be private. For example, [CH11, BNS13] studied the notion of “label-private learning,” where only the labels in the training set are considered private. Another line of work considers the setting in which the learning task can be reduced to privately answering classification queries [HCB16, PAE17, PSM18, BTT18, NB19]
, where the goal is to construct a differentially private classification algorithm that predicts the labels of a sequence of public feature-vectors such that the predictions are differentially private in the private training dataset.
We use to denote an arbitrary data universe, to denote a distribution over , and to denote a binary hypothesis class.
2.1 Tools from learning theory
The VC dimension of a binary hypothesis class is denoted by .
We will use the following notion of coverings:
[-cover for a hypothesis class] A family of hypotheses is said to form an -cover for a hypothesis class with respect to a distribution over if for every , there is such that
[Representative domain (a.k.a. the dual class)] Let be a hypothesis class. Define an equivalence relation on by if and only if . The representative domain induced by on , denoted by , is a complete set of distinct representatives from for this equivalence relation. For example, let be a class of binary thresholds over given by: iff , . Then, a representative domain in this case is a set of distinct elements; one from each of the following intervals More generally, if
is a class of halfspaces then a representative domain contains exactly one point in each cell of the hyperplane arrangement induced by(see Figure 1).
Note that when is finite then any representative domain for has size at most , since the equivalence class of each is determined by the binary vector . Moreover, one can also make the following simple claim, which is a direct consequence of the Sauer-Shelah Lemma [Sau72] together with the fact that a representative domain has a one-to-one correspondence with the dual class of . Below, we use to denote the dual VC-dimension of a hypothesis class , namely, is the VC-dimension of the dual class of . Let be a finite class of binary functions defined over a domain . Then, the size of a representative domain satisfies: .
The following useful fact gives a worst-case upper bound on the dual VC-dimension in terms of the VC-dimension. [[Ass83]] Let be a binary hypothesis class. The dual VC-dimension of satisfies: .
In Section 3, we will use the following notation. Let be a hypothesis class defined over a domain . For any , we define as the representative such that , where is the equivalence relation described in Definition 2.1. Note that by definition this is unique. Moreover, for any and any we define .
[Query Release] Given a distribution over and a binary hypothesis class , a query release data structure (equivalently, ) estimates the expected label for all . The worst-case error is defined as
2.2 Tools from Differential Privacy
Two datasets are neighboring when they differ on one element. [Differential Privacy] A randomized algorithm is -differentially private if for all neighboring and all
Private Multiplicative Weights (PMW):
In our construction in Section 3, we will use, as a black box, a well-studied algorithm in differential privacy known as Private Multiplicative-Weights [HR10]. We will use a special case of the offline version of the PMW algorithm. Namely, the input query class is finite, and PMW runs over all the queries in the input class (in any order) to perform its updates, and finally outputs a query release data structure . When the input private data set is drawn i.i.d. from some unknown distribution , the accuracy goal is to have a data structure such that is small. The outline (inputs and output of the PMW algorithm) is described in Algorithm 1.
The following lemma is an immediate corollary of the accuracy guarantee of the PMW algorithm [HR10, DR14]. In particular, it follows from combining the empirical accuracy of PMW with a standard uniform-convergence argument.
2.3 Our Model: PAP Query Release
In this paper, we study an extension of the problem of differentially private query release [DR14] where the input data have two types: private and public. Formally, let be any distribution over a data domain . Let be a class of binary queries. We consider a family of algorithms whose generic description (namely, inputs and outputs) is given by Algorithm 2.
Given the query class , a private data set (i.e., a data set whose elements belong to private users), and a public data set (i.e., a data set whose elements belong to users with no privacy constraint), the algorithm outputs a query release data structure . Such an algorithm is required to be -differentially private but only with respect to the private data set. We call such an algorithm Public-data-Assisted Private (PAP) query-release algorithm. The accuracy/utility of the algorithm is determined by the worst-case estimation error incurred by its output data structure on any query (hypothesis) .
[ PAP query-release algorithm] Let be a query class. Let be a randomized algorithm in the family outlined in Algorithm 2. We say that is Public-data-Assisted Private (PAP) query-release algorithm for with private sample size and public sample size if the following conditions hold:
For every distribution over , given data sets and as inputs to , with probability at least (over the choice of and the random coins of ), outputs a function (data structure) satisfying
For all is -differentially private.
In our description in Algorithm 2, the algorithm is required to output a data structure and not necessarily a “synthetic” data set for some number as in what is referred to as “private proper sanitizers” in [BNS13]. In that special case, obviously the output data set can be used to define a data structure ; namely, for any , . Moreover, in the general case, ignoring computational complexity, the output data structure can also be used to construct a data set as pointed out in Remark 2.18 of [BNS13] . In particular, given a data structure whose error , then it suffices find a data set , where , such that for all , and hence the accuracy requirement would follow by the triangle inequality. Also, we know that this data set must exist. This is because by a standard uniform-convergence argument, a data set will, with a non-zero probability, satisfy for all , and hence, by the triangle inequality, for all .
3 A PAP Query-Release Algorithm for Classes of Finite VC-Dimension
We now describe a construction of a public-data-assisted private query release algorithm that works for any class with a finite VC-dimension.
Our construction is given by Algorithm 3. The key idea of the construction is to use the public data to create a finite -cover for the input query class (see Definition 2.1), then, run the PMW algorithm on the finite cover and the representative domain given by the dual of (see Definition 2.1).
[Upper Bound] (Algorithm 3) is an public-data-assisted private query-release algorithm for whose private and public sample complexities satisfy:
where and .
By Fact 1, we can further bound the private sample complexity for general query classes as
[Lemma 3.3 in [ABM19]] Let . Then, with probability at least the family constructed in Step 3 of Algorithm 3 is an -cover for w.r.t. In particular, for every , we have where (see Algorithm 3 for the definition of ), as long as
Proof of Theorem 3.
First, note that for any realization of the public data set is -differentially private w.r.t. the private data set. Indeed, the private data set is only used to construct , which is the input data set to the PMW algorithm. The output of PMW is then used to construct the output data structure . Moreover, for any pair of neighboring data sets , the pair cannot differ in more than one element. Hence, -differential privacy of our construction follows from -differential privacy of the PMW algorithm together with the fact that differential privacy is closed under post-processing.
Next, we prove the accuracy guarantee of our construction. By Lemma 3, it follows that with probability at least , we have where Hence, it suffices to show that with probability at least , (recall that is the output data structure of the PMW algorithm). Note that by Sauer’s lemma, we know where . From the setting of in the theorem statement, we hence have
Moreover, by Claim 2.1, we have
where . Thus, given the setting of in the theorem statement, Lemma 1 implies that with probability at least , our instantiation of the PMW algorithm yields that satisfies
which completes the proof. ∎
4 A Lower Bound for Releasing Decision Stumps
In this section we give an example of a hypothesis class—decision stumps on over the domain —where additional public data “does not help” for private query release. This concept class can be released using either private samples and no public samples, or using public samples and no private samples. However, we show that every PAP query-release algorithm requires either private samples or public samples. That is, making some samples public does not reduce the overall sample complexity until the number of the public samples is nearly enough to solve the problem on its own.
The class of decision stumps on has dual-VC-dimension , but VC-dimension just , so this lower bound implies that the polynomial dependence on the dual-VC-dimension in Theorem 3 cannot be improved—there are classes with dual-VC-dimension that require either private samples or public samples.
[Binary Decision Stumps] For any , let be a hypothesis class of hypotheses consisting of all hypotheses of the form for .
[Lower Bound for Releasing Decision Stumps] Fix any and . Suppose is a PAP algorithm that takes private samples and public samples, satisfies -differential privacy, and is -accurate for the class of decision stumps . Then either or .
Thus, if , then the number of private samples must scale proportionally to as in our upper bound in Theorem 3.
The main ingredient in the proof is a result of Dwork et al. [DSS15]. Informally, what this theorem says is that for any algorithm that releases accurate answers to the class of decision stumps using too small of a dataset, there is an attacker who can identify a large number of that algorithm’s samples.
[Special Case of Theorem 17 of [DSS15] ] For every and , there exists a number and a number such that the following holds: For every query-release algorithm with total sample size that is -accurate for the class of decision stumps on , there exists a distribution over and an attacker who takes as input the vector of answers and an example and outputs either or such that
Proof of Theorem 4.
Fix and let and be the values specified in Theorem 4. Suppose that is a PAP algorithm that is -accurate for the class with private samples and public samples. We will show that either or . First, note that the accuracy condition of implies that we must have by the standard lower bound on the sample complexity of query release even without any privacy constraints. Thus, to prove the theorem statement, it suffices to show that if and is -accurate, then cannot satisfy -differential privacy w.r.t. its private samples. Indeed, this would imply that either or .
where the first inequality follows from the assumption that , and the last inequality follows from the assumption that .
That is, with high probability, the attacker identifies at least samples in the dataset. Let be the indicator of the event . Therefore, we have
where the second step follows from Markov’s inequality. Since
we can conclude that
Therefore, there must exist a private sample such that
Now, consider the dataset where we replace in with an independent sample but the rest of the samples in is the same as in . In this experiment is now an independent sample from , so Theorem 4 states that
However, note that the joint distributionis a distribution over pairs of datasets that differ on at most one private sample. Therefore, we have shown that cannot satisfy -differential privacy for its private samples unless
Therefore, in particular, for , cannot be -differentially private. ∎
5 A Lower Bound on Public Sample Complexity
The goal of this section is to show a general lower bound on the public sample complexity of PAP query release. Our lower bound holds for classes with infinite Littlestone dimension. The Littlestone dimension is a combinatorial parameter of hypothesis classes that characterizes mistake and regret bounds in Online Learning [Lit88, BDPSS09]. There are many examples of classes that have finite VC-dimension, but infinite Littlestone dimension. The simplest example is the class of threshold functions over whose VC-dimension is , but has infinite Littlestone dimension. In [ALMM19], it was shown that if a class has infinite Littlestone dimension, then it is not privately learnable.
Our lower bound is formally stated in the following theorem.
[Lower bound on public sample complexity] Let be any query class that has infinite Littlestone dimension. Any PAP query-release algorithm for must have public sample complexity , where is the desired accuracy. We stress that the above lower bound on the public sample complexity holds regardless of the number of private samples, which can be arbitrarily large.
In the light of our upper bound in Section 3, our lower bound on the public sample complexity exhibits a tight dependence on the accuracy parameter . That is, one cannot hope to attain public sample complexity that is .
In the proof of the above theorem, we will refer to the following notion of private PAC learning with access to public data that was defined in [ABM19]. For completeness, we restate this definition here.
[ PAP Learner] Let be a hypothesis class. A randomized algorithm is PAP learner for with private sample size and public sample size if the following conditions hold:
For every distribution over , given data sets and as inputs to , with probability at least (over the choice of and the random coins of ), outputs a hypothesis satisfying
where, for any hypothesis
For all is -differentially private.
We say that is proper PAP learner if with probability 1.
We prove the above theorem in two simple steps that follow from prior works: the first step shows that PAP query-release implies PAP learning, and the second step invokes a known lower bound on PAP learning of classes with infinite Littlestone dimension. Both steps are formalized in the lemmas below.
[General version of Theorem 5.5 in [BNS13] ] Let be any class of binary functions. If there exists an PAP query-release algorithm for with private sample complexity and public sample complexity , then there exists an PAP learner for with private sample complexity , and public sample complexity .
[Theorem 4.1 in [ABM19]] Let be any class with an infinite Littlestone dimension (e.g., the class of thresholds over ). Then, any PAP learner for must have public sample complexity , where is the excess error.
Given these two lemmas, the proof is straightforward. To elaborate, note that Lemma 5 shows that for any class , a PAP query-release algorithm for with public sample complexity implies the existence of a PAP learner for with the same public sample complexity (and essentially the same accuracy and privacy parameters). Hence, by Lemma 5, if has infinite Littlestone dimension, then such public sample complexity must satisfy . This proves our theorem.
Although the proof of Lemma 5 is almost straightforward given Theorem 5.5 in [BNS13], we will elaborate on a couple of minor details. First, note that even though the reduction in [BNS13] involves pure differentially private algorithms, the same construction in their reduction would also work for the case of -differential privacy with minor and obvious changes in the privacy analysis. Second, we note that the reduction in [BNS13] is for “proper sanitizers,” which are query-release algorithms that are restricted to output a data set from the input domain rather than any data structure that maps to . As discussed in Remark 2.3, ignoring computational complexity, any PAP query-release algorithm satisfying Definition 2 can be transformed into a PAP query-release algorithm that outputs a data set from the input domain and has the same accuracy (up to a constant factor). Now, given these minor details and since any PAP algorithm can obviously be viewed as a differentially private algorithm operating on the private data set (by “hardwiring” the public data set into the algorithm), Lemma 5 simply follows by invoking the reduction in [BNS13]. ∎
RB’s research is supported by NSF Awards AF-1908281, SHF-1907715, Google Faculty Research Award, and OSU faculty start-up support. AC and JU were supported by NSF grants CCF-1718088, CCF-1750640, CNS-1816028, and CNS-1916020. AN is supported by an Ontario ERA, and an NSERC Discovery Grant RGPIN-2016-06333. ZSW is supported by a Google Faculty Research Award, a J.P. Morgan Faculty Award, and a Mozilla research grant. Part of this work was done while RB, AC, AN, JU, and ZSW were visiting the Simons Institute for the Theory of Computing.
- [ABM19] Noga Alon, Raef Bassily, and Shay Moran. Limits of private learning with access to public data. arXiv:1910.11519 [cs.LG] (appeared at NeurIPS 2019), 2019.
- [ALMM19] Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private pac learning implies finite littlestone dimension. STOC 2019, pp. 852-860 (arXiv preprint arXiv:1806.00949), 2019.
- [Ass83] Patrick Assouad. Densité et dimension. In Annales de l’Institut Fourier, volume 33, pages 233–282, 1983.
- [BDPSS09] Shai Ben-David, Dávid Pál, and Shai Shalev-Shwartz. Agnostic online learning. In COLT, volume 3, page 1, 2009.
- [BLR13] Avrim Blum, Katrina Ligett, and Aaron Roth. A learning theory approach to noninteractive database privacy. Journal of the ACM (JACM), 60(2):1–25, 2013.
Amos Beimel, Kobbi Nissim, and Uri Stemmer.
Private learning and sanitization: Pure vs. approximate differential
Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 363–378. Springer, 2013.
- [BNSV15] Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil P. Vadhan. Differentially private release and learning of threshold functions. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages 634–649, 2015.
- [BTT18] Raef Bassily, Abhradeep Guha Thakurta, and Om Dipakbhai Thakkar. Model-agnostic private learning. In Advances in Neural Information Processing Systems, pages 7102–7112, 2018.
- [BUV18] Mark Bun, Jonathan Ullman, and Salil Vadhan. Fingerprinting codes and the price of approximate differential privacy. SIAM Journal on Computing, 47(5):1888–1938, 2018.
- [CH11] Kamalika Chaudhuri and Daniel Hsu. Sample complexity bounds for differentially private learning. In Proceedings of the 24th Annual Conference on Learning Theory, pages 155–186, 2011.
- [DLS17] Aref N. Dajani, Amy D. Lauger, Phyllis E. Singer, Daniel Kifer, Jerome P. Reiter, Ashwin Machanavajjhala, Simson L. Garfinkel, Scot A. Dahl, Matthew Graham, Vishesh Karwa, Hang Kim, Philip Lelerc, Ian M. Schmutte, William N. Sexton, Lars Vilhuber, and John M. Abowd. The modernization of statistical disclosure limitation at the U.S. census bureau, 2017. Presented at the September 2017 meeting of the Census Scientific Advisory Committee.
- [DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
- [DN03] Irit Dinur and Kobbi Nissim. Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 202–210, 2003.
- [DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4):211–407, 2014.
- [DSS15] Cynthia Dwork, Adam Smith, Thomas Steinke, Jonathan Ullman, and Salil Vadhan. Robust traceability from trace amounts. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 650–669. IEEE, 2015.
Jihun Hamm, Yingjun Cao, and Mikhail Belkin.
Learning privately from multiparty data.
International Conference on Machine Learning, pages 555–563, 2016.
- [HR10] Moritz Hardt and Guy N Rothblum. A multiplicative weights mechanism for privacy-preserving data analysis. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 61–70. IEEE, 2010.
- [Lit88] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine learning, 2(4):285–318, 1988.
- [MN12] Shanmugavelayutham Muthukrishnan and Aleksandar Nikolov. Optimal private halfspace counting via discrepancy. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 1285–1292, 2012.
- [NB19] Anupama Nandi and Raef Bassily. Privately answering classification queries in the agnostic pac model. arXiv preprint arXiv:1907.13553. To appear in ALT 2020, 2019.
Nicolas Papernot, Martın Abadi, Úlfar Erlingsson, Ian Goodfellow, and
Semi-supervised knowledge transfer for deep learning from private training data.stat, 1050, 2017.
- [PSM18] Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Úlfar Erlingsson. Scalable private learning with pate. arXiv preprint arXiv:1802.08908, 2018.
- [Sau72] Norbert Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13(1):145–147, 1972.
- [SU15] Thomas Steinke and Jonathan Ullman. Between pure and approximate differential privacy. arXiv preprint arXiv:1501.06095, 2015.