The fundamental problem of distribution learning concerns the design of algorithms (i.e., estimators) that, given samples generated from an unknown distribution , output an “approximation” of . While the literature on distribution learning is vast and has a long history dating back to the late nineteenth century, the problem of distribution learning under privacy constraints is relatively new and unexplored.
In this paper, we work with the notion of differential privacy which was introduced by Dwork et al. [DworkMNS06] as a rigorous and practical notion of data privacy. Roughly speaking, differential privacy guarantees that no single data point can influence the output of an algorithm too much, which intuitively provides privacy by “hiding” the contribution of each individual. Differential privacy is the de facto standard for modern private analysis which has seen widespread impact in both industry and government [ErlingssonPK14, BittauEMMRLRKTS17, DingKY17, AppleDP17, DajaniLSKRMGDGKKLSSVA17].
In recent years, there has been a flurry of activity in differentially private distribution learning. A number of techniques have been developed in the literature for this problem. In the pure differentially private setting, Bun et al. [BunKSW19] recently introduced a method to learn a class of distributions when the class admits a finite cover, i.e. when the entire class of distributions can be well-approximated by a finite number of representative distributions. In fact, they show that this is an exact characterization of distributions which can be learned under pure differential privacy in the sense that a class of distributions is learnable under pure differential privacy if and only if the class admits a finite cover [HardtT10, BunKSW19]. As a consequence of this result, they obtained pure differentially private algorithms for learning Gaussian distributions provided that the mean of the Gaussians are bounded and the covariance matrix of the Gaussians are spectrally bounded.111When we say that a matrix is spectrally bounded, we mean that there are such that . Moreover, such restrictions on the Gaussians are necessary under the constraint of pure differential privacy.
One way to remove the requirement of having a finite cover is to relax to a weaker notion of privacy known as approximate differential privacy. With this notion, Bun et al. [BunKSW19] introduced another method to learn a class of distributions that, instead of requiring a finite cover, requires a “locally small” cover, i.e. a cover where each distribution in the class is well-approximated by only a small number of elements within the cover. They prove that the class of Gaussians with arbitrary mean and a fixed, known covariance matrix has a locally small cover which implies an approximate differentially private algorithm to learn this class of distributions. Later, Aden-Ali, Ashtiani, and Kamath [Aden-AliAK21] proved that the class of mean-zero Gaussians (with no assumptions on the covariance matrix) admits a locally small cover. This can then be used to obtain an approximate differentially private algorithm to learn the class of all Gaussians.
It is a straightforward observation that if a class of distributions admits a finite cover then the class of its mixtures also admits a finite cover. Combined with the aforementioned work of Bun et al. this implies a pure differentially private algorithm for learning mixtures of Gaussians with bounded mean and spectrally bounded covariance matrices. It is natural to wonder whether an analogous statement holds for locally small covers. In other words, if a class of distributions admits a locally small cover then does the class of mixtures also admit a locally small cover? If so, this would provide a fruitful direction to design differentially private algorithms for learning mixtures of arbitrary Gaussians. Unfortunately, there are simple examples of classes of distributions that admit a locally small cover yet their mixture do not. This leaves open the question of designing private algorithms for many classes of distributions that are learnable in the non-private setting. One concrete open problem is for the class of mixtures of two arbitrary univariate Gaussian distributions. A more general problem is private learning of mixtures of axis-aligned (or general) Gaussian distributions.
1.1 Main Results
We demonstrate that it is indeed possible to privately learn mixtures of unbounded univariate Gaussians. More generally, we give sample complexity upper bounds for learning mixtures of unbounded -dimensional axis-aligned Gaussians. In the following theorem and the remainder of the paper, denotes the number of samples that is given to the algorithm.
Theorem 1.1 (Informal).
Let and .
The sample complexity of learning a mixture of -dimensional axis-aligned Gaussians to -accuracy in total variation distance under -differential privacy and success probability
-differential privacy and success probabilityis
The formal statement of this theorem can be found in Theorem 5.1. We note that the condition on is standard in the differential privacy literature. Indeed, for useful privacy, should be “cryptographically small”, i.e., .
Even for the univariate case, our result is the first
sample complexity upper bound for learning mixture of Gaussians under differential privacy for which the variances are unknown and the parameters of the Gaussians may be unbounded. In the non-private setting, it is known thatsamples are necessary and sufficient to learn an axis-aligned Gaussian in [SureshOAJ14, AshtianiBHLMP20]. In the private setting, the best known sample complexity lower bound is under -DP when [KamathLSU19]. Obtaining improved upper or lower bounds in this setting remains an open question.
If the covariance matrix of each component of the mixture is the same and known or, without loss of generality, equal to the identity matrix, then we can improve the dependence on the parameters and obtain a result that is in line with the non-private setting.
Theorem 1.2 (Informal).
Let and . The sample complexity of learning a mixture of -dimensional Gaussians with identity covariance matrix to -accuracy in total variation distance under -differential privacy and success probability is
We relegate the formal statement and the proof of this theorem to the appendix (see Appendix E). Note that the work of [NissimRS07] implies an upper bound of for private learning of the same class albeit in the incomparable setting of parameter estimation.
Comparison with locally small covers.
While the results in [BunKSW19, Aden-AliAK21] for learning Gaussian distributions under approximate differential privacy do not yield finite-time algorithms, they do give strong information-theoretic upper bounds. This is achieved by showing that certain classes of Gaussians admit locally small covers. It is thus natural to ask whether it is possible to use this approach based on locally small covers to obtain sharper upper bounds than our main result. Unfortunately, we cannot hope to do so because it is not possible to construct locally small covers for mixture classes in general. While univariate Gaussians admit locally small covers [BunKSW19], the following simple example shows that mixtures of univariate Gaussians do not.
Proposition 1.3 (Informal version of Proposition b.6).
Every cover for the class of mixtures of two univariate Gaussians is not locally small.
To prove our result, we devise a novel technique which reduces the problem of privately learning mixture distributions to the problem of private list-decodable learning of distributions. The framework of list-decodable learning was introduced by Balcan, Blum, and Vempala [BalcanBS08] and Balcan, Röglin, and Teng [BalcanRT09] in the context of clustering but has since been studied extensively in the literature in a number of different contexts [CharikarSV17, DiakonikolasKS18b, KarmalkarKK19, CherapanamjeriMY20, DiakonikolasKK20, RaghavendraY20, RaghavendraY20b, BakshiK21]. The problem of list-decodable learning of distributions is as follows. There is a distribution of interest that we are aiming to learn. However, we do not receive samples from ; rather we receive samples from a corrupted distribution where is some arbitrary distribution. In our application, will be quite close to . In other words, most of the samples are corrupted. The goal in list-decodable learning is to output a short list of distributions with the requirement that is close to at least one of the ’s. The formal definition of list-decodable learning can be found in Definition 2.8. Informally, the reduction can be summarized by the following theorem which is formalized in Section 3.
Theorem 1.4 (Informal).
If a class of distributions is privately list-decodable then mixtures of distributions from are privately learnable.
Roughly speaking, the reduction from learning mixtures of distribution to list-decodable learning works as follows. Suppose that there is an unknown distribution which is a mixture of distributions . A list-decodable learner would then receive samples from as input and output a short list of distributions so that for every there is some element in that is close to . In particular, some mixture of distributions from must be close to the true distribution . Since is a small finite set, the set of possible mixtures must also be relatively small. This last observation allows us to make use of private hypothesis selection which selects a good hypothesis from a small set of candidate hypotheses [BunKSW19, Aden-AliAK21]. In Section 3, we formally describe the aforementioned reduction. We note that a similar connection between list-decodable learning and learning mixture distributions was also used by Diakonikolas et al. [DiakonikolasKS18b]. However, our reduction is focused on the private setting.
The reduction shows that to privately learn mixtures, it is sufficient to design differentially private list-decodable learning algorithms that work for (corrupted versions of) the individual mixture components. To devise list-decodable learners for (corrupted) univariate Gaussian, we utilize “stability-based” histograms [KorolovaKMN09, BunNS16] that satisfy approximate differential privacy.
To design a list-decodable learner for corrupted univariate Gaussians, we follow a three-step approach that is inspired by the seminal work of Karwa and Vadhan [KarwaV18]. First, we use a histogram to output a list of variances one of which approximates the true variance of the Gaussian. As a second step, we would like to output a list of means which approximate the true mean of the Gaussian. This can be done using histograms provided that we roughly know the variance of the Gaussian. Since we have candidate variances from the first step, we can use a sequence of histograms where the width of the bins of each of the histograms is determined by the candidate variances from the first step. As a last step, using the candidate variances and means from the first two steps, we are able to construct a small set of distributions one of which approximates the true Gaussian to within accuracy . In the axis-aligned Gaussians setting, we use our solution for the univariate case as a subroutine on each dimension separately. Now that we have a list-decodable learner for axis-aligned Gaussians, we use our reduction to obtain a private learning algorithm for learning mixtures of axis-aligned Gaussians.
1.3 Open Problems
Many interesting open problems remain for privately learning mixtures of Gaussians. The simplest problem is to understand the exact sample complexity (up to constants) for learning mixtures of univariate Gaussians under approximate differential privacy. We make the following conjecture based on known bounds for privately learning a single Gaussian [KarwaV18].
Conjecture 1.5 (Informal).
The sample complexity of learning a mixture of , univariate Gaussians to within total variation distance with high probability under -DP is
Another wide open question is whether it is even possible to privately learn mixtures of high-dimensional Gaussians when each Gaussian can have an arbitrary covariance matrix. We believe it is possible, and make the following conjecture, again based on known results for privately learning a single high-dimensional Gaussian with no assumptions on the parameters [BunKSW19, Aden-AliAK21].
Conjecture 1.6 (Informal).
The sample complexity of learning a mixture of , -dimensional Gaussians to with total variation distance with high probability under -DP is
1.4 Additional Related Work
Recently, [BunKSW19] showed how to learn spherical Gaussian mixtures where each Gaussian component has bounded mean under pure differential privacy. Acharya, Sun and Zhang [AcharyaSZ20] were able to obtain lower bounds in the same setting that nearly match the upper bounds of Bun, Kamath, Steinke and Wu [BunKSW19]. Both [NissimRS07, KamathSSU19] consider differentially private learning of Gaussian mixtures, however their focus is on parameter estimation and therefore require additional assumptions such as separation or boundedness of the components.
There has been a flurry of activity on differentially private distribution learning and parameter estimation in recent years for many problem settings [NissimRS07, BunUV14, DiakonikolasHS15, SteinkeU17a, SteinkeU17b, DworkSSUV15, BunSU17, KarwaV18, KamathLSU19, CaiWZ19, BunKSW19, DuFMBG20, AcharyaSZ20, KamathSU20, BiswasDKU20, LiuKKO21]. There has also been a lot of work in the locally private setting [DuchiJW17, WangHWNXYLQ16, KairouzBR16, AcharyaSZ19, DuchiR18, DuchiR19, JosephKMW19, YeB18, GaboardiRS19]. Other work on differentially private estimation include [DworkL09, Smith11, BarberD14, AcharyaSZ18, BunS19, CanonneKMUZ19, ZhangKKW20]. For a more comprehensive review of differentially private statistics, see [KamathU20].
For any , denotes the set . Let
denote a random variablesampled from the distribution . Let denote an i.i.d. random sample of size from distribution
. For a vector, we refer to the th element of vector as . For any , we define the -dimensional probability simplex to be . For a vector and a positive semidefinite matrix , we use
to denote the multivariate normal distribution with meanand covariance matrix .
We define to be the class of univariate Gaussians and to be the class of axis-aligned Gaussians.
Definition 2.1 (-net).
Let be a metric space. A set is an -net for under the metric if for all , there exists such that .
For any and , there exists an -net of under the -norm of size at most .
Definition 2.3 ().
Let be a class of probability distributions. Then the class of
be a class of probability distributions. Then the class of-mixtures of , written ), is defined as
2.1 Distribution Learning
A distribution learning method is a (potentially randomized) algorithm that, given a sequence of i.i.d. samples from a distribution , outputs a distribution as an estimate of
. The focus of this paper is on absolutely continuous probability distributions (distributions that have a density with respect to the Lebesgue measure), so we refer to a probability distribution and its probability density function interchangeably. The specific measure of “closeness” between distributions that we use is thetotal variation (TV) distance.
Let and be two probability distributions defined over and let be the Borel sigma-algebra on . The total variation distance between and is defined as
where denotes the probability measure that assigns to . Moreover, if is a set of distributions over a common domain, we define .
We now formally define a PAC learner.
Definition 2.5 (PAC learner).
We say Algorithm is a PAC-learner for a class of distributions which uses samples, if for every , every , and every the following holds: if the algorithm is given parameters and a sequence of i.i.d. samples from as inputs, then it outputs an approximation such that with probability at least .222The probability is over samples drawn from and the randomness of the algorithm.
We work with a standard additive corruption model often studied in the list-decodable setting that is inspired by the work of Huber [Huber64]. In this model, a sample is drawn from a distribution of interest with some probability, and with the remaining probability is drawn from an arbitrary distribution. Our list-decodable learners take samples from these “corrupted” distributions as input.
Definition 2.6 (-corrupted distributions).
Fix some distribution and let . We define a -corrupted distribution of as as any distribution such that
for an arbitrary distribution . We define to be the set of all -corrupted distributions of .
Observe that is monotone increasing in , i.e. for all . To see this, note that if then we can also rewrite
where . Hence, .
We note that in this work, we will most often deal with -corrupted distribution where is quite close to ; in other words, the vast majority of the samples are corrupted.
Now we define list-decodable learning. In this setting, the goal is to learn a distribution given samples from a -corrupted distribution of . Since is close to , instead of finding a single distribution that approximates , our goal is to output a list of distributions, one of which is accurate. This turns out to be a useful primitive to design algorithms for learning mixture distributions.
Definition 2.8 (list-decodable learner).
We say algorithm is an -list-decodable learner for a class of distributions using samples if for every , , , and , the following holds: given parameters and a sequence of i.i.d. samples from as inputs, outputs a set of distributions with such that with probability no less than we have .
2.2 Differential Privacy
Let be the set of all datasets of arbitrary size over a domain set . We say two datasets are neighbours if and differ by at most one data point. Informally, an algorithm is differentially private if its output on neighbouring databases are similar. Formally, differential privacy (DP)333We will use the acronym DP to refer to both the terms “differential privacy” and “differentially private”. Which term we are using will be clear from the specific sentence. has the following definition.
Definition 2.9 ([DworkMNS06, DworkKMMN06]).
A randomized algorithm is -differentially private if for all , for all neighbouring datasets , and for all measurable subsets ,
If , we say that is -differentially private.
We refer to -DP as pure DP, and -DP for as approximate DP. We make use of the following property of differentially private algorithms which asserts that adaptively composing differentially private algorithms remains differentially private. By adaptive composition, we mean that we run a sequence of algorithms where the choice of algorithm may depend on the outputs of .
Lemma 2.10 (Composition of DP [DworkMNS06, DworkRV10]).
If is an adaptive composition of differentially private algorithms then the following two statements hold:
If are -differentially private, then is -differentially private for
If are -differentially private for some , then for any , is -differentially private for
The first statement in Lemma 2.10 is often referred to as basic composition and the second statement is often referred to as advanced composition. We also make use of the fact that post-processing the output of a differentially private algorithm does not impact privacy.
Lemma 2.11 (Post Processing).
If is -differentially private, and is any randomized function, then the algorithm is -differentially private.
We now define -DP PAC learners and -DP -List-Decodable learners.
Definition 2.12 (-DP PAC learner).
We say algorithm is an -DP PAC learner for a class of distributions that uses samples if:
Algorithm is a PAC Learner for that uses samples.
Algorithm satisfies -DP.
Definition 2.13 (-DP list-decodable learner).
We say algorithm is an -DP -list-decodable learner for a class of distributions that uses samples if:
Algorithm is a -list-decodable learner for that uses samples.
Algorithm satisfies -DP.
3 List-decodability and Learning Mixtures
In this section, we describe our general technique which reduces the problem of private learning of mixture distributions to private list-decodable learning of distributions. We show that if we have a differentially private list-decodable learner for a class of distributions then this can be transformed, in a black-box way, to a differentially private PAC learner for the class of mixtures of such distributions. In the next section, we describe private list-decodable learners for the class of Gaussians and thereby obtain private algorithms for learning mixtures of Gaussians.
First, let us begin with some intuition in the non-private setting. Suppose that we have a distribution which can be written as . Then we can view as a -corrupted distribution of for each . Any list-decodable algorithm that receives samples from as input is very likely to output a candidate set which contains distributions that are close to for each . Hence, if we let , then must be close to some distribution in . The only remaining task is to find a distribution in that is close to ; this final task is known as hypothesis selection and has a known solution [DevroyeL01]. We note that the above argument can be easily generalized to the setting where is a non-uniform mixture, i.e. where .
The above establishes a blueprint that we can follow in order to obtain a private learner for mixture distributions. In particular, we aim to come up with a private list-decoding algorithm which receives samples from to produce a set . Thereafter, one can construct a candidate set as mixtures of distributions from . Note that this step does not access the samples and therefore maintains privacy. In order to choose a good candidate from , we make use of private hypothesis selection [BunKSW19, Aden-AliAK21].
We now formalize the above argument. Algorithm 1 shows how a list-decodable learner can be used as a subroutine for learning mixture distributions. In the algorithm, we also make use of a subroutine for private hypothesis selection [BunKSW19, Aden-AliAK21]. In hypothesis selection, an algorithm is given i.i.d. sample access to some unknown distribution as well as a list of distributions to pick from. The goal of the algorithm is to output a distribution in the list that is close to the unknown distribution.
Lemma 3.1 ([Aden-AliAK21],Theorem 27).
Let . There exist an -DP algorithm with the following property: for every , and every set of distributions , when PHS is given , and a dataset of i.i.d. samples from an unknown (arbitrary) distribution as input, it outputs a distribution such that
with probability no less than so long as
We now formally relate the two problems via the theorem below.
Let and . Suppose that is -DP -list-decodable using samples. Then Algorithm 1 is an -DP PAC learner for that uses
We begin by briefly showing that Algorithm 1 satisfies -DP before arguing about its accuracy.
We now proceed to show that Algorithm 1 PAC learns . In step 1 of Algorithm 1, we use the -DP -list-decodable learner to obtain a set of distributions of size at most . Note that for any mixture component , is a -corrupted distribution of since
Let denote the set of non-negligible components. We first show that for any non-negligible component , there exists that is close to .
If then for all with probability at least .
Steps 1 and 1 of Algorithm 1 constructs a candidate set of mixture distributions using and a net of the probability simplex . The next claim shows that as long as is small for every non-negligible , is small as well.
If for every , then . In addition, .
Step 1 constructs a set which is an -net of the probability simplex in the -norm. By the hypothesis of the claim, for each , there exists such that . Recall that . Let such that . Now let . Note that . Moreover, a straightforward calculation shows that (see Proposition C.1 for the detailed calculations). This proves that .
Lastly, to bound we have . Note that since it is the output of an -list-decodable learner and by Proposition 2.2. This implies the claimed bound on . ∎
The only remaining step is to select a good hypothesis from . This is achieved using the private hypothesis selection algorithm from Lemma 3.1 which guarantees that step 1 of Algorithm 1 returns satisfying with probability as long as
Finally, the claimed sample complexity bound follows from the samples required to construct (which follows from Claim 3.3) and the samples required for private hypothesis selection which is given in Eq. (1). ∎
This reduction is quite useful because it is conceptually much simpler to devise list-decodable learners for a given class . In what follows, we will devise such list-decodable learners for certain classes and use Theorem 3.2 to obtain private PAC learners for mixtures of these classes.
4 Learning Mixtures of Univariate Gaussians
Let be the class of all univariate Gaussians. In this section we consider the problem of privately learning univariate Guassian Mixtures, . In the previous section, we showed that it is sufficient to design private list-decodable learners for univariate Gaussians. As a warm-up and to build intuition about our techniques, we begin with the simpler problem of constructing private list-decodable learners for Gaussians with a single known variance . In what follows, we often use “tilde” (e.g. ) to denote sets that are meant to be coarse, or constant, approximations and “hat” (e.g. ) to denote sets that are meant to be fine, say , approximations.
4.1 Warm-up: Learning Gaussian Mixtures with a Known, Shared Variance
In this sub-section we will construct a private list-decodable learner for univariate Gaussians with a known variance . A useful algorithmic primitive that we will use throughout this section and the next is the stable histogram algorithm.
Lemma 4.1 (Histogram learner [KorolovaKMN09, BunNS16]).
Let , and . Let be a dataset of points over a domain . Let be a countable index set and be a collection of disjoint bins defined on , i.e. and for . Finally, let . There is an -DP algorithm that takes as input parameters , dataset and bins , and outputs estimates such that for all ,
with probability no less than so long as
For any fixed we define to be the set of all univariate Gaussians with variance . For the remainder of this section, we let and . (Recall that means that for some distribution .) Algorithm 2 shows how we privately output a list of real numbers, one of which is close to the mean of given samples from .
The following lemma shows that the output of Algorithm 2
is a list of real numbers with the guarantee that at least one element in the list is close to the true mean of a Gaussian which has been corrupted. Note that the lemma assumes the slightly weaker condition where the algorithm receives an approximation to the standard deviation instead of the true standard deviation. This additional generality is used in the next section.
Algorithm 2 is an -DP algorithm such that for any and , when it is given parameters , , and dataset of i.i.d. samples from as input, it outputs a set of real numbers of size
Furthermore, with probability no less than there is an element such that
so long as
Let us begin by gathering several straightforward observations about the algorithm. Let be the probability that a sample drawn from lands in bin . Let be the actual number of samples drawn from that have landed in . Let . It is a simple calculation to check that . Thus, we would like to show that or, equivalently, that . As a first step, we show that many samples actually land in bin .
If then with probability at least .
A standard Chernoff bound (Lemma A.3) implies that with probability at least provided for some constant . As this implies . ∎
Next, we claim that the output of the stable histogram approximately preserves the weight of all the bins and, moreover, that the output does not have too many heavy bins. The first assertion implies that since bin is heavy, the stable histogram also determines that bin is heavy. The second assertion implies that the algorithm does not fail. Let be the output of the stable histogram, as defined in Algorithm 2.
If then with probability , we have (i) for all and (ii) .
The first assertion directly follows from Lemma 4.1 with . In the event that , we now show that . Note that it suffices to argue that if then . Since , this implies that . Indeed, we argue the contrapositive. If then and, hence, . ∎
Proof of Lemma 4.2.
We briefly prove that the algorithm is private before proceeding to the other assertions of the lemma.
Bound on .
For the bound on , observe that if then the algorithm fails so deterministically.
Let be as defined in the statement of the lemma. We now show that there exists such that . Let . For the remainder of the proof, we assume that .
Claim 4.3 asserts that, with probability , we have . Claim 4.4 asserts that, with probability , and that . By a union bound, with probability , we have that and the algorithm does not fail. This implies that so . Finally, note that where the last inequality uses the assumption that . ∎
For any and , there is an -DP -list-decodable learner for with known where , and the number of samples used is
The algorithm is simple; we run Univariate-Mean-Decoder and obtain the set . Let be an -net of the set of intervals of size , i.e.
We then return . Finally, Lemma 4.2 and post-processing (Lemma 2.11) imply that the algorithm is -DP while Lemma 4.2 and Proposition A.1 imply the accuracy guarantee.444Note that we can only use Proposition A.1 for target as large as . For any target , we can simply run the algorithm with . ∎
For any and , there is an -DP PAC learner for with known that uses
Similar ideas can also be used to privately learn the class . The details can be found in Appendix E.
4.2 Learning Arbitrary Univariate Gaussian Mixtures
In this section, we construct a list-decodable learner for , the class of all univariate Gaussians. First, in Algorithm 3, we design an -DP algorithm that receives samples from where and outputs a list of candidate values for the standard deviation, one of which approximates the standard deviation of with high probability. Then, in Algorithm 4, we use Algorithm 2 and Algorithm 3 to design an -DP list-decoder for .
4.2.1 Estimating the variance
We begin with a method to estimate the variance. Algorithm 3 shows how to take a set of samples and output a list of standard deviations, one of which approximates the true standard deviation up to a factor of .