Increasing privacy concerns have become a major obstacle for collecting, analyzing and sharing data, as well as communicating results of a data analysis in sensitive domains. For example, the second Netflix Prize competition was canceled in response to a lawsuit and Federal Trade Commission privacy concerns, and the National Institute of Health decided in August 2008 to remove aggregate Genome-Wide Association Studies (GWAS) data from the public web site, after learning about a potential privacy risk. These concerns are well-grounded in the context of the Big-Data era as stories about privacy breaches from improperly-handled data set appear very regularly (e.g., medical records , Netflix , NYC Taxi ). These incidences highlight the need for formal methods that provably protects the privacy of individual-level data points while allowing similar database level of utility comparing to the non-private counterpart.
There is a long history of attempts to address these problems and the risk-utility tradeoff in statistical agencies [9, 19, 8] but most of the methods developed do not provide clear and quantifiable privacy guarantees. Differential privacy [14, 10] succeeds in the first task. While it allows a clear quantification of the privacy loss, it provides a worst-case guarantee and in practice it often requires adding noise with a very large magnitude (if finite at all), hence resulting in unsatisfactory utility, cf., [32, 37, 16].
A growing literature focuses on weakening the notion of differential privacy to make it applicable and for a more favorable privacy-utility trade-off. Popular attempts include -approximate differential privacy , personalized differential privacy [15, 22], random differential privacy  and so on. They each have pros and cons and are useful in their specific contexts. There is a related literature addressing the folklore observation that “differential privacy implies generalization” [12, 18, 30, 11, 5, 35].
The implication of generalization is a minimal property that we feel any notion of privacy should have. This brings us to the natural question:
Is there a weak notion of privacy that is equivalent to generalization?
In this paper, we provide a partial answer to this question. Specifically, we define On-Average Kullback-Leibler(KL)-Privacy and show that it characterizes On-Average Generalization111We will formally define these quantities.
for algorithms that draw sample from an important class of maximum entropy/Gibbs distributions, i.e., distributions with probability/density proportional to
for a loss functionand (possibly improper) prior distribution .
We argue that this is a fundamental class of algorithms that covers a big portion of tools in modern data analysis including Bayesian inference, empirical risk minimization in statistical learning as well as the private releases of database queries through Laplace and Gaussian noise adding. From here onwards, we will refer this class of distributions “MaxEnt distributions” and the algorithm that output a sample from a MaxEnt distribution “posterior sampling”.
This work is closely related to the various notions of algorithmic stability in learning theory [21, 7, 26, 29]. In fact, we can treat differential privacy as a very strong notion of stability. Thus On-average KL-privacy may well be called On-average KL-stability. Stability implies generalization in many different settings but they are often only sufficient conditions. Exceptions include [26, 29] who show that notions of stability are also necessary for the consistency of empirical risk minimization and distribution-free learnability of any algorithms. Our specific stability definition, its equivalence to generalization and its properties as a privacy measure has not been studied before. KL-Privacy first appears in  and is shown to imply generalization in . On-Average KL-privacy further weakens KL-privacy. A high-level connection can be made to leave-one-out cross validation which is often used as a (slightly biased) empirical measure of generalization, e.g., see .
2 Symbols and Notation
We will use the standard statistical learning terminology where is a data point, is a hypothesis and is the loss function. One can think of the negative loss function as a measure of utility of on data point . Lastly, is a possibly randomized algorithm that maps a data set to some hypothesis . For example, if is the empirical risk minimization (ERM), then chooses .
Just to point out that many data analysis tasks can be casted in this form, e.g., in linear regression,,
is the coefficient vector andis just
; in k-means clustering,is just the feature vector, is the collection of -cluster centers and . Simple calculations of statistical quantities can often be represented in this form too, e.g., calculating the mean is equivalent to linear regression with identity design, and calculating the median is the same as ERM with loss function .
We also consider cases when the loss function is defined over the whole data set , in this case the loss function is also evaluated on the whole data set by the structured loss . We do not require to be drawn from some product distribution, but rather any distribution . Generally speaking, could be a string of text, a news article, a sequence of transactions of a credit card user, or rather just the entire data set of iid samples. We will revisit this generalization with more concrete examples later. However we would like to point out that this is equivalent to the above case when we only have one (much more complicated) data point and the algorithm is applied to only one sample.
3 Main Results
We first describe differential privacy and then it will become very intuitive where KL-privacy and On-Average KL-privacy come from. Roughly speaking, differential privacy requires that for any datasets and that differs by only one data point, the algorithm and samples output from two distributions that are very similar to each other. Define “Hamming distance”
Definition 1 (-Differential Privacy )
We call an algorithm -differentially private (or in short -DP), if
for obeying and any measurable subset .
More transparently, assuming the range of is the whole space , and also assume defines a density on with respect to a base measure on 222These assumptions are only for presentation simplicity. The notion of On-Average KL-privacy can naturally handle mixture of densities and point masses., then -Differential Privacy requires
Replacing the second supremum with an expectation over we get the maximum KL-divergence over the output from two adjacent datasets. This is KL-Privacy as defined in Barber and Duchi , and by replacing both supremums with expectations we get what we call On-Average KL-Privacy. For and , denote the data set obtained from replacing the first entry of by . Also recall that the KL-divergence between two distributions and is .
Definition 2 (On-Average KL-Privacy)
We say obeys -On-Average KL-privacy for some distribution if
Note that by the property of KL-divergence, the On-Average KL-Privacy is always nonnegative and is if and only if the two distributions are the same almost everywhere. In the above case, it happens when .
Unlike differential privacy that provides a uniform privacy guarantee for any users in , on-average KL-Privacy is a distribution-specific quantity that measures the amount of average privacy loss of an average data point suffer from running data analysis on an data set drawn iid from the same distribution .
We argue that this kind of average privacy protection is practically useful because it is able to adapt to benign distributions and is much less sensitive to outliers. After all, when differential privacy fails to provide a meaningfuldue to peculiar data sets that exist in but rarely appear in practice, we would still be interested to gauge how a randomized algorithm protects a typical user’s privacy.
Now we define what we mean by generalization. Let the empirical risk and the actual risk be .
Definition 3 (On-Average Generalization)
We say an algorithm has on-average generalization error if .
This is slightly weaker than the standard notion of generalization in machine learning which requires . Nevertheless, on-average generalization is sufficient for the purpose of proving consistency for methods that approximately minimizes the empirical risk.
3.1 The equivalence to generalization
It turns out that when assumes a special form, that is, sampling from a Gibbs distribution, we can completely characterize generalization of using On-Average KL-Privacy. This class of algorithms include the most general mechanism for differential privacy — exponential mechanism , which casts many other noise adding procedures as special cases. We will discuss a more compelling reason why restricting our attention to this class is not limiting in Section LABEL:sec:maxent.
Theorem 4 (On-Average KL-Privacy Generalization)
Let the loss function for some model parameterized by , and let
If in additional obeys that for every , the distribution is well-defined (in that the normalization constant is finite), then satisfy -On-Average KL-Privacy if and only if has on-average generalization error .
The proof, given in the Appendix, uses a ghost sample trick and the fact that the expected normalization constants of the sampling distribution over and are the same.
Remark 1 (Structural Loss)
Take , and loss function be . Then for an algorithm that samples with probability proportional to : -On-Average KL-Privacy is equivalent to -generalization of the structural loss.
Remark 2 (Dispersion parameter )
The case when for a constant can be handled by redefining . In that case, -On-Average KL-Privacy with respect to implies generalization with respect to . For this reason, larger may not imply strictly better generalization.
Remark 3 (Comparing to differential Privacy)
3.2 Preservation of other properties of DP
We now show that despite being much weaker than DP, On-Average KL-privacy does inherent some of the major properties of differential privacy (under mild additional assumptions in some cases).
Lemma 5 (Closeness to Post-processing)
Let be any (possibly randomized) measurable function from to another domain , then for any
This directly follows from the data processing inequality for the Rényi divergence in Van Erven and Harremoës [33, Theorem 1].
Lemma 6 (Small group privacy)
An immediate corollary of the above connection is that we can now significantly simplify the proof for “max-information generalization” for posterior sampling algorithms.
Let be a posterior sampling algorithm. implies that generalizes with rate .
We now compare to mutual information and draw connections to .
Definition 8 (Mutual Information)
The mutual information
where , and .
Lemma 9 (Relationship to Mutual Information)
For any randomized algorithm , let be an RV, and be two datasets of size . We have
which by Jensen’s inequality implies
A natural observation is that for MaxEnt defined with , mutual information lower bounds its generalization error. On the other hand, Proposition 1 in Russo and Zou  states that under the assumption that is -subgaussian for every , then the on-average generalization error is always smaller than Similar results hold for sub-exponential [28, Proposition 3].
Note that in their bounds, is the mutual information between the choice of hypothesis and the loss function for which we are defining generalization on. By data processing inequality, we have . Further, when is posterior distribution, it only depends on through , namely is a sufficient statistic for . As a result . Therefore, we know . Combine this observation with Lemma 9 and Theorem 4, we get the following characterization of generalization through mutual information.
Corollary 10 (Mutual information and generalization)
Let be an algorithm that samples , and is -subgaussian for any , then
If is -subexponential with parameter instead, then we have a weaker upper bound .
The corollary implies that for each we have an intriguing bound that says for any distribution of , and such that is -subgaussian. One interesting case is when . This gives
The lower bound is therefore sharp up to a multiplicative factor of .
4 Connections to Other Attempts to Weaken DP
We compare and contrast the On-Average KL-Privacy with other notions of privacy that are designed to weaken the original DP. The (certainly incomplete) list includes -approximate differential privacy (Approx-DP) , random differential privacy (Rand-DP) , Personalized Differential Privacy (Personal-DP) [15, 22] and Total-Variation-Privacy (TV-Privacy) [4, 5]. Table 1 summarizes and compares of these definitions.
A key difference of On-Average KL-Privacy from almost all other previous definitions of privacy, is that the probability is defined only over the random coins of private algorithms. For this reason, even if we convert our bound into the high probability form, the meaning of the small probability would be very different from that in Approx-DP. The only exception in the list is Rand-DP, which assumes, like we do, the data points in adjacent data sets and are draw iid from a distribution. Ours is weaker than Rand-DP in that ours is a distribution-specific quantity.
Among these notions of privacy, Pure-DP and Approx-DP have been shown to imply generalization with high probability [12, 5]; and TV-privacy was more shown to imply generalization (in expectation) for a restricted class of queries (loss functions) . The relationship between our proposal and these known results are clearly illustrated in Fig. 1. To the best of our knowledge, our result is the first of its kind that crisply characterizes generalization.
Lastly, we would like to point out that while each of these definitions retains some properties of differential privacy, they might not possess all of them simultaneously and satisfactorily. For example, -approx-DP does not have a satisfactory group privacy guarantee as grows exponentially with the group size.
In this section, we validate our theoretical results through numerical simulation. Specifically, we use two simple examples to compare the of differential privacy, of on-average KL-privacy, the generalization error, as well as the utility, measured in terms of the excess population risk.
The first example is the private release of mean, we consider to be the mean of samples from standard normal distribution truncated between
samples from standard normal distribution truncated between. Hypothesis space , loss function . samples with probability proportional to . Note that this is the simple Laplace mechanism for differential privacy and the global sensitivity is , as a result this algorithm is -differentially private.
The second example we consider is a simple linear regression in 1D. We generate the data from a simple univariate linear regression model , where and the noise are both sampled iid from a uniform distribution defined on
and the noise are both sampled iid from a uniform distribution defined on. The true is chosen to be . Moreover, we use the standard square loss . Clearly, the data domain and if we constrain to be within a bounded set , and the posterior sampling with parameter obeys -DP.
Fig. 2 plots the results over an exponential grid of parameter . In these two examples, we calculate on-Average KL-Privacy using known formula of the KL-divergence of Laplace and Gaussian distributions. Then we stochastically estimate the expectation over data. We estimate the generalization error in the direct formula by evaluating on fresh samples.
As we can see, appropriately scaled On-Average KL-Privacy characterizes the generalization error precisely as the theory predicts. On the other hand, if we just compare the privacy losses, the average
. In these two examples, we calculate on-Average KL-Privacy using known formula of the KL-divergence of Laplace and Gaussian distributions. Then we stochastically estimate the expectation over data. We estimate the generalization error in the direct formula by evaluating on fresh samples. As we can see, appropriately scaled On-Average KL-Privacy characterizes the generalization error precisely as the theory predicts. On the other hand, if we just compare the privacy losses, the averagefrom a random dataset given by On-Avg KL-Privacy is smaller than that for the worst case in DP by orders of magnitudes.
We presented On-Average KL-privacy as a new notion of privacy (or stability) on average. We showed that this new definition preserves properties of differential privacy including closedness to post-processing, small group privacy and adaptive composition. Moreover, we showed that On-Average KL-privacy/stability characterizes a weak form of generalization for a large class of sampling distributions that simultaneously maximize entropy and utility. This equivalence and connections to certain information-theoretic quantities allowed us to provide the first lower bound of generalization using mutual information. Lastly, we conduct numerical simulations which confirm our theory and demonstrate the substantially more favorable privacy-utility trade-off.
Appendix 0.A Proofs of technical results
Proof (Proof of Theorem 4)
We prove this result using a ghost sample trick.
The and are partition functions of and respectively. Since , we know The proof is complete by noting that the On-Average KL-privacy is always non-negative and so is the difference of the actual risk and expected empirical risk (therefore we can take absolute value without changing the equivalence). ∎
Proof (Proof of Lemma 6)
Let , we have
The technical issue is that the second term does not have the correct distribution to take expectation over. By the property of being a posterior sampling algorithm, we can rewrite the second term of the above equation into
where and are normalization constants of and respectively. The expected log-partition functions are the same so we can replace them with normalization constants of and . By adding and subtracting the missing log-likelihood functions on , we get
This completes the proof for . Apply the same argument recursively by different decompositions of , we get the results for .
The second statement follows by the same argument with all “” changed into “”. ∎
Proof (Proof of Lemma LABEL:lem:composition)
Take over and we get the adaptive composition result for KL-Privacy. Take over and such that , we get the adaptive composition result for On-Average KL-Privacy.∎
Proof (Proof of Lemma LABEL:lem:maxinfo)
By Lemma 12 in Dwork et al. , .
where and are normalization constants for distribution and respectively.
Take expectation over and on both sides, by symmetry, the expected normalization constants are equal no matter which size subset of this posterior distribution is defined over. Define . Let be the normalization constant of and be the normalization constant of . We get
Collecting the three systems of inequalities above, we get that is -On-Average-KL-Privacy as claimed. ∎
Proof (Proof of Lemma 9)
Denote . . The marginal distribution of is therefore . By definition,
The last line follows from Jensen’s inequality. ∎
- Akaike  Akaike, H.: Likelihood of a model and information criteria. Journal of econometrics 16(1), 3–14 (1981)
- Altun and Smola  Altun, Y., Smola, A.: Unifying divergence minimization and statistical inference via convex duality. In: Learning theory, pp. 139–153. Springer (2006)
- Anderson  Anderson, N.: “anonymized” data really isn’t and here’s why not. http://arstechnica.com/tech-policy/2009/09/your-secrets-live-online-in-databases-of-ruin/ (2009)
- Barber and Duchi  Barber, R.F., Duchi, J.C.: Privacy and statistical risk: Formalisms and minimax bounds. arXiv preprint arXiv:1412.4451 (2014)
- Bassily et al.  Bassily, R., Nissim, K., Smith, A., Steinke, T., Stemmer, U., Ullman, J.: Algorithmic stability for adaptive data analysis. arXiv preprint arXiv:1511.02513 (2015)
Berger et al. 
Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing.Computational linguistics 22(1), 39–71 (1996)
- Bousquet and Elisseeff  Bousquet, O., Elisseeff, A.: Stability and generalization. Journal of Machine Learning Research 2, 499–526 (2002)
- Duncan et al.  Duncan, G.T., Elliot, M., Salazar-González, J.J.: Statistical Confidentiality: Principle and Practice. Springer (2011)
- Duncan et al.  Duncan, G.T., Fienberg, S.E., Krishnan, R., Padman, R., Roehrig, S.F.: Disclosure limitation methods and information loss for tabular data. Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies pp. 135–166 (2001)
- Dwork  Dwork, C.: Differential privacy. In: Automata, Languages and Programming, pp. 1–12. Springer (2006)
- Dwork et al. [2015a] Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., Roth, A.: Generalization in adaptive data analysis and holdout reuse. In: Advances in Neural Information Processing Systems (NIPS-15). pp. 2341–2349 (2015a)
Dwork et al. [2015b]
Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., Roth, A.L.:
Preserving statistical validity in adaptive data analysis.
In: ACM on Symposium on Theory of Computing (STOC-15). pp. 117–126. ACM (2015b)
- Dwork et al. [2006a] Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: Privacy via distributed noise generation. In: Advances in Cryptology-EUROCRYPT 2006, pp. 486–503. Springer (2006a)
- Dwork et al. [2006b] Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Theory of cryptography, pp. 265–284. Springer (2006b)
- Ebadi et al.  Ebadi, H., Sands, D., Schneider, G.: Differential privacy: Now it’s getting personal. In: ACM Symposium on Principles of Programming Languages. pp. 69–81. ACM (2015)
Fienberg et al. 
Fienberg, S.E., Rinaldo, A., Yang, X.: Differential privacy and the risk-utility tradeoff for multi-dimensional contingency tables.In: Privacy in Statistical Databases. pp. 187–199. Springer (2010)
- Hall et al.  Hall, R., Rinaldo, A., Wasserman, L.: Random differential privacy. arXiv preprint arXiv:1112.2680 (2011)
- Hardt and Ullman  Hardt, M., Ullman, J.: Preventing false discovery in interactive data analysis is hard. In: IEEE Symposium on Foundations of Computer Science (FOCS-14),. pp. 454–463. IEEE (2014)
- Hundepool et al.  Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E.S., Spicer, K., De Wolf, P.P.: Statistical disclosure control. John Wiley & Sons (2012)
- Jaynes  Jaynes, E.T.: Information theory and statistical mechanics. Physical review 106(4), 620 (1957)
- Kearns and Ron  Kearns, M., Ron, D.: Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural Computation 11(6), 1427–1453 (1999)
- Liu et al.  Liu, Z., Wang, Y.X., Smola, A.: Fast differentially private matrix factorization. In: ACM Conference on Recommender Systems (RecSys’15). pp. 171–178. ACM (2015)
- McSherry and Talwar  McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: IEEE Symposium on Foundations of Computer Science (FOCS-07). pp. 94–103 (2007)
- Mir  Mir, D.J.: Information-theoretic foundations of differential privacy. In: Foundations and Practice of Security, pp. 374–381. Springer (2013)
- Mosteller and Tukey  Mosteller, F., Tukey, J.W.: Data analysis, including statistics (1968)
- Mukherjee et al.  Mukherjee, S., Niyogi, P., Poggio, T., Rifkin, R.: Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics 25(1-3), 161–193 (2006)
- Narayanan and Shmatikov  Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets. In: Security and Privacy, 2008. SP 2008. IEEE Symposium on. pp. 111–125. IEEE (2008)
Russo and Zou 
Russo, D., Zou, J.: Controlling bias in adaptive data analysis using
In: International Conference on Artificial Intelligence and Statistics (AISTATS-16) (2016)
- Shalev-Shwartz et al.  Shalev-Shwartz, S., Shamir, O., Srebro, N., Sridharan, K.: Learnability, stability and uniform convergence. Journal of Machine Learning Research 11, 2635–2670 (2010)
- Steinke and Ullman  Steinke, T., Ullman, J.: Interactive fingerprinting codes and the hardness of preventing false discovery. arXiv preprint arXiv:1410.1228 (2014)
- Tishby et al.  Tishby, N., Pereira, F.C., Bialek, W.: The information bottleneck method. arXiv preprint physics/0004057 (2000)
- Uhlerop et al.  Uhlerop, C., Slavković, A., Fienberg, S.E.: Privacy-preserving data sharing for genome-wide association studies. The Journal of privacy and confidentiality 5(1), 137 (2013)
Van Erven and Harremoës 
Van Erven, T., Harremoës, P.: Rényi divergence and kullback-leibler divergence.Information Theory, IEEE Transactions on 60(7), 3797–3820 (2014)
- Wang et al.  Wang, Y.X., Fienberg, S.E., Smola, A.: Privacy for free: Posterior sampling and stochastic gradient monte carlo. In: International Conference on Machine Learning (ICML-15) (2015)
- Wang et al.  Wang, Y.X., Lei, J., Fienberg, S.E.: Learning with differential privacy: Stability, learnability and the sufficiency and necessity of erm principle. Journal of Machine Learning Research (2016), to appear
- Yau  Yau, N.: Lessons from improperly anonymized taxi logs. http://flowingdata.com/2014/06/23/lessons-from-improperly-anonymized-taxi-logs/ (2014)
- Yu et al.  Yu, F., Fienberg, S.E., Slavković, A.B., Uhler, C.: Scalable privacy-preserving data sharing methodology for genome-wide association studies. Journal of biomedical informatics 50, 133–141 (2014)
- Zhou et al.  Zhou, S., Lafferty, J., Wasserman, L.: Compressed and privacy-sensitive sparse regression. Information Theory, IEEE Transactions on 55(2), 846–866 (2009)