The primary model we study is distributed learning with the constraint of local differential privacy (LDP) [Warner, EvfimievskiGS03, KasiviswanathanLNRS11]. In this model each client (or user) holds an individual data point and a server can communicate with the clients. The goal of the server is to solve some statistical analysis on the data stored at the clients. In addition, the server is not trusted and the communication should not reveal significant private information about the users’ data. Specifically, the entire protocol needs to satisfy differential privacy [DworkMNS:06]. In the general version of the model, the executed protocol can involve an arbitrary number of rounds of interaction between the server and the clients. In practice, however, network latencies significantly limit the number of rounds of interaction that can be executed. Indeed, currently deployed systems that use local differential privacy are non-interactive [ErlingssonPK14, appledp, DKY17-Microsoft]. Namely, the server sends each client a request; based on the request each client runs some differentially private algorithm on its data and sends a response back to the server. The server then analyzes the data it received (without further communication with the clients). See Section 2.1 for a formal definition of the model.
This motivates the question: which problems can be solved by non-interactive LDP protocols? This question was first formally addressed by maxnames KasiviswanathanLNRS11 who also established an equivalence, up to polynomial factors, between algorithms in the statistical query (SQ) framework of maxnames Kearns:98 and LDP protocols111More formally, the equivalence is for a more restricted way to measure privacy based on composition of the privacy parameters of each message sent by a user.. In this equivalence, non-interactive protocols correspond to non-adaptive SQ algorithms. Unfortunately, most SQ learning algorithms are adaptive and thus, for most problems, this equivalence only gives interactive LDP protocols. Using this equivalence, maxnames KasiviswanathanLNRS11 also constructed an artificial learning problem which requires an exponentially larger number of samples to solve by any non-interactive LDP protocol than it does when interaction is allowed.
Motivated by the industrial applications of the LDP model, maxnames SmithTU17 studied the complexity of solving stochastic convex loss minimization problems by non-interactive LDP algorithms. In these problems we are given a family of loss functionsconvex in and a convex body . For a distribution over the goal is to find an approximate minimizer of
over . They gave a non-interactive LDP algorithm that uses an exponential in number of samples. Additionally, they showed that such dependence is unavoidable for the commonly used optimization algorithms whose queries rely solely on the information in the neighborhood of the query point (such as gradients or Hessians). Their bounds have been strengthened and generalized in a number of subsequent works [DuchiRY18, WoodworthWSMS18, BalkanskiSinger18, DiakonikolasGuzman18, WangGX18] but the question of whether a non-interactive LDP protocol for optimizing convex functions with polynomial sample complexity exists remained open.
A recent work of maxnames DanielyF18 shows that there exist natural learning problems that are exponentially harder to solve by LDP protocols without interaction. Specifically, they consider PAC learning a class of Boolean functions over a domain . A PAC learning algorithm for receives i.i.d. samples where is drawn from an unknown distribution and , and its goal is to find which achieves a classification error of at most , namely
maxnames DanielyF18 show that the number of samples required by any non-interactive LDP protocol to learn with a non-trivial error is lower bounded by a polynomial in the margin complexity of . The margin of a linear separator over captures how well the points with are separated from those with , and is formally defined as
The margin complexity of is the inverse of the largest margin that can be achieved by embedding into such that every can be realized as a linear separator with margin at least . It is a well-studied notion within learning theory and communication complexity, measuring the complexity of Boolean function classes and their corresponding sign matrices in (e.g. [Novikoff:62, AizermanBR67, BoserGV92, ForsterSS01, Ben-DavidES02, Sherstov:08, LinialS:2009, KallweitSimon:11]). There exist known classes of functions, as decision lists and general linear separators, that are PAC learnable by (interactive) SQ algorithms but have exponentially large margin complexity. Thus, non-interactive LDP protocols require an exponentially larger number of samples for PAC learning such classes than interactive ones. This result also leads to the question of whether all classes with inverse polynomial margin complexity can be learned efficiently non-interactively (see [DanielyF19:open]
for a more detailed discussion). Such large-margin linear classifiers are much more common in practice and are significantly easier to learn than general linear separators. For example, a simple Perceptron algorithm can be used instead of the more involved algorithms like the Ellipsoid method that are used when the margin is exponentially small.
1.1 Our results
We show that both learning large-margin linear separators and learning of linear models with a convex loss require an exponential number of samples in the non-interactive LDP model. Formally, we define the margin relative to a distribution on as the margin relative to the support of the distribution: . We give the following lower bound for learning large-margin linear classifiers.
Fix , and . Let be a randomized, non-interactive -LDP learning algorithm over using samples. Assume that for any linear separator and distribution over with margin , outputs a hypothesis with an expected error of . Then, , where depends only on .
In particular, this lower bound is always exponential either in the margin or in the dimension of the problem. Note that linear separators with margin can be learned with error by an -LDP algorithm with rounds of interaction and using samples. This can be done by using a standard SQ implementation of the Perceptron algorithm [BlumFKV:97, FeldmanGV:15] (after a random projection to remove the dependence on the dimension) or via a reduction to convex loss minimization described below together with an LDP algorithm for convex optimization from [DuchiJW:13focs]. Our lower bound is also essentially tight in terms of the achievable error. There exist an efficient non-interactive algorithm achieving an error of , while is impossible for all .
As in the prior work [KasiviswanathanLNRS11, DanielyF18], we exploit the connection to statistical query algorithms. Here, we assume a distribution over and instead of i.i.d. samples from , an SQ algorithm has access to an SQ oracle for . Given a query function an SQ oracle for with tolerance parameter returns the value with some added noise of magnitude bounded by [Kearns:98]. Such an algorithm is non-adaptive if its queries do not depend on the answers to prior queries. Our lower bound is effectively a lower bound against non-adaptive statistical query algorithms together with the known simulation of a non-interactive LDP protocol by a non-adaptive SQ algorithm [KasiviswanathanLNRS11]. The SQ model captures a broad class of learning algorithms and thus our lower bound can be viewed as showing the importance of interactive access to data beyond the distributed learning setting.
Our lower bound for non-adaptive SQ algorithms is based on a new technique for constructing hard to distinguish pairs of distributions over data. The key technical element of this construction is a pair of distributions over that have nearly matching moments but whose supports are nearly linearly separable with significant margin. To design such distributions we rely on tools from the classical moment problem (see Sec. 2.3 for details). A more detailed overview of the proof requires some of the preliminaries and appears in Section 3.1.
Convex loss optimization of linear models:
We now spell out the implications of our lower bound in Theorem 1 for stochastic convex optimization. Our lower bounds will apply to optimization of the simple class of convex linear models. These models are defined by some loss function for some that is convex in the first parameter for every . In our reduction the label is in and the loss function can be further simplified as for a fixed convex function . In our reduction and are in , the unit ball of . We show that there exists -Lipschitz, -smooth and -strongly convex such that the following lower bound holds.
For any parameters , and , there exists a loss function where is convex, -Lipschitz, -smooth and -strongly convex, such that any non-interactive -LDP algorithm that outputs satisfying , requires
samples, where is a universal constant.
This implies that with -Lipschitzness and -smoothness, the sample complexity is exponential either in or in , and if we add the assumption of -strong convexity, the sample complexity can be exponential in . For comparison, for general convex functions the only known upper bounds are exponential in the dimension [SmithTU17, WangGX18]
. For linear models, by polynomial approximation it is possible to obtain bounds without an exponential dependence in the dimension: for example, maxnames ZhengML17collect showed that logistic regression can be solved with roughly
samples and maxnames WangSX19 study general linear models.Efficient non-interactive LDP algorithms exist for least squares linear regression[SmithTU17]Wang2019principal] since for these tasks low order statistics suffice for finding a solution. See Section 4 for a more general statement of Theorem 2 and proof.
Communication constrained setting:
An additional benefit of proving the lower bound via statistical queries is that we can extend our results to other models known to be related to statistical queries. In particular, we consider distributed protocols in which only a small number of bits is communicated from each client. Namely, each client applies a function with range to their input and sends the result to the server (for some ). As the server only has to communicate a random seed which is practically small and can provably be compressed to bits, this model is useful when the communication cost is high and the complete sample
is expensive to send, for example, when its dimension is large. In the context of learning this model was introduced by maxnames Ben-DavidD98 and generalized by maxnames SteinhardtVW16. Identical and closely related models are often studied in the context of distributed statistical estimation with communication constraints (e.g.[luo2005universal, rajagopal2006universal, ribeiro2006bandwidth, ZhangDJW13, SteinhardtD15, suresh2016distributed, acharya2018inference, acharya2019hadamard, acharya2019distributed, acharya2019inference]). As in the setting of LDP, the number of rounds of interaction that the server uses to solve a learning problem is a critical resource. Using the equivalence between this model and SQ learning we immediately obtain analogous lower bounds for this model. In particular, we show that either or is required for learning non-interactively. See Section 5 for additional details.
Our work provides nearly tight lower bounds for learning by non-interactive or one-round LDP protocols. An important question left open is whether linear classification and convex optimization can be solved by algorithms using a small number of rounds of interaction in the above models. Such lower bounds are not known even for the harder problem considered in [DanielyF18]. In contrast, known techniques for solving these problems require a polynomial number of rounds (see [SmithTU17] for a discussion). We hope that the construction in this paper will provide a useful step toward lower bounds against multi-round SQ or LDP algorithms. We remark, however, that general multi-round LDP protocols can be stronger than statistical query algorithms [JosephMR19] and thus may require an entirely different approach (see discussion in Section 2.1 for more details).
1.2 Related work
Most positive results for non-interactive LDP model concern relatively simple data analysis tasks, such as computing counts and histograms (e.g. [HsuKR12, ErlingssonPK14, BassilyS15, BunNS18, ErlingssonFMRTT18]). Efficient non-interactive algorithms for learning large-margin classifiers and convex linear models can be obtained given access to public unlabeled data [DanielyF18, WangZGX2019]. A number of lower bounds on the sample complexity of LDP algorithms demonstrate that (non-interactive) LDP protocols are less efficient than the central model of differential privacy [KasiviswanathanLNRS11, DuchiWJ13:nips, Ullman18, duchi2019lower].
Joseph et al. [JosephMNR:19, JosephMR19] explore a different aspect of interactivity in LDP. Specifically, they distinguish between two types of interactive protocols: fully-interactive and sequentially-interactive ones. Fully-interactive protocols place no restrictions on interaction whereas sequentially-interactive ones only allows asking one query per user. They give a separation showing that sequentially-interactive protocols may require exponentially more samples than fully interactive ones. This separation is orthogonal to ours since our lower bounds are against completely non-interactive protocols and we separate them from sequentially-interactive protocols. maxnames acharya2018inference implicitly consider another related model: one-way non-interactive protocols where the server does not communicate the choice of a randomizer to the clients or, equivalently, cannot share a random string with clients. They give a polynomial separation between one-way non-interactive protocols and non-interactive protocols for the problem of identity testing for a discrete distribution over elements ( vs samples).
2.1 Models of computation
Local differential privacy:
In the local differential privacy (LDP) model [Warner, EvfimievskiGS03, KasiviswanathanLNRS11] it is assumed that each of users holds a sample of some dataset . In the general version of the model the users can communicate with the server arbitrarily. The protocol is said to satisfy -LDP if the algorithm that outputs the transcript222The transcript is the set of all messages sent in the protocol. of the protocol given the dataset satisfies the standard definition of -differential privacy [DworkMNS:06].
We are interested in the non-interactive (one-round) LDP protocols. Such protocols can equivalently be described as non-interactively accessing the following oracle:
An -DP local randomizer is a randomized algorithm that given an input , outputs a message , such that and , . For a dataset , an oracle takes as an input an index and a local randomizer and outputs a random value obtained by applying . An algorithm is non-interactive -LDP if it accesses only via the oracle with -DP local randomizers, each sample is accessed at most once and all of its queries are determined before observing any of the oracle’s responses.
We remark that for non-interactive protocols, querying the same sample multiple times (subject to the entire communication satisfying -DP) does not affect the model. Also for non-interactive protocols, allowing -differential privacy instead of -DP does not affect the power of the model [BunNS18] (as long as is sufficiently small).
The statistical query model of maxnames Kearns:98 is defined by having access to a statistical query oracle to the data distribution instead of i.i.d. samples from . The oracle is defined as follows:
Given a domain , a statistical query is any (measurable) function . A statistical query oracle with tolerance receives a statistical query and outputs an arbitrary value such that .
To solve a learning problem in this model an algorithm has to succeed for any oracle’s responses that satisfy the guarantees on the tolerance. In other words, the guarantees of the algorithm should hold in the worst case over the responses of the oracle. A randomized learning algorithm needs to succeed for any SQ oracle whose responses may depend on the all queries asked so far but not on the internal randomness of the learning algorithm.
We say that an SQ algorithm is non-interactive (or non-adaptive) if all its queries are determined before observing any of the oracle’s responses. maxnames KasiviswanathanLNRS11 show that one can simulate a non-interactive -LDP algorithm using a non-adaptive SQ algorithm.
Theorem 3 ([KasiviswanathanLNRS11]).
Let be an -LPD algorithm that makes non-interactive queries to for drawn i.i.d. from some distribution . Then for every there is a non-adaptive SQ algorithm that in expectation makes queries to for and whose output distribution has a total variation distance of at most from the output distribution of .
We remark that this simulation extends to interactive LDP protocols as long as they rely on local randomizers with the sum of privacy parameters used on every point being at most . Such protocols, first defined in [KasiviswanathanLNRS11] are referred to as compositional -LDP. They are known to be exponentially weaker than the general interactive LDP protocols although the separation is known only for rather unnatural problems [JosephMR19]. The converse of this connection is also known: SQ algorithms can be simulated by -compositional LDP protocols (and this simulation preserves the number of rounds of interaction) [KasiviswanathanLNRS11].
2.2 Boolean Fourier analysis
Boolean Fourier analysis concerns with the Fourier coefficients of functions of Boolean inputs, . Let
be the uniform distribution over, and for any , define the coefficient
As is an orthonormal basis of the space of functions , can be decomposed as . Plancherel’s theorem states that
and Parseval’s theorem is the special case where . For a distribution over we define the Fourier coefficient as the coefficients of the function , namely,
Lastly, note that for a distribution and a function , it follows from Plancherel’s theorem that
2.3 The classical moment problem
Given a probability distributionand , it is natural to try and characterize all distributions that have the same first moments as , namely, distributions with for all . There is a great literature in this topic, e.g. [akhiezer1965classical, krein1977markov] (see [benjamini2012k] for an application in computer science). The study uses the notion of orthogonal polynomials:
Let be a probability distribution over with all moments finite. We say that a sequence of polynomials are orthogonal with respect to if the satisfy the following:
For all , is of degree and has a positive leading coefficient.
For all , .
Denote the above sequence of polynomials as the orthogonal polynomials with respect to .
It is known that there is a unique sequence of orthogonal polynomials with respect to , hence we call them the orthogonal polynomials (w.r.t ). Given the orthogonal polynimials , define the function as follows:
These functions characterize the amount of mass that can be concentrated on the point by distributions that match the first moments of :
Theorem 4 ([akhiezer1965classical], Theorem 2.5.2).
Let be a distribution with finite moments, fix and and let be defined with respect to . The following holds:
There exists a distribution matching the first moments of with .
Any distribution that matches the first moments of satisfies: .
3 Proof of Theorem 1
Let , , and define . Let be a non-adaptive statistical query algorithm such that for any linear separator and distribution over with margin , returns a hypothesis with . If has access to statistical queries with tolerance , then requires at least queries, where is a constant depending only on .
We start with a brief sketch of the proof. Let and . Our proof is based on a construction of two distribution and over and two linear functions and that are hard to distinguish but they almost always disagree on the label . Specifically, the have the following properties:
Any satisfies for , and additionally, and have -classification margin over the supports of and , respectively.
and have nearly the same Fourier coefficients: for any , is exponentially small.
for nearly all values of : , for where .
Given these two distributions, we can create a hard family of distributions containing many pairs obtained from the original pair by a simple translation. Any efficient SQ algorithm would find most pairs of distributions impossible to distinguish. That is, the algorithm cannot distinguish which of the two distributions in the pair is the correct one. As a consequence, it will not be able to predict the correct label of for most values of .
In the rest of this section we describe how and are constructed. The construction involves multiple consecutive steps that we describe below. We start with two distributions and over that satisfy:
and have matching first moments.
and , where .
The distribution is a mixture in which the value has weight
and a scaled and shifted exponential distribution defined onhas weight . To show that there exists a distribution which matches the first moments of and satisfies , it suffices to show that , where is the function from Eq. (5), which is defined by the orthogonal polynomials of . We calculate these polynomials as a linear combination of the orthogonal polynomials of the exponential distribution, for which a closed formula is known. We remark that instead of the exponential distribution other distributions can be used to get a similar bound on .
Based on and , we create two distributions and over which satisfy:
and nearly match all Fourier coefficients.
To draw we first draw and then draw each bit of independently with mean . Similarly, we draw given . The Fourier coefficients of and correspond to the moments of and , respectively: and similarly for and . Hence the Fourier coefficients of and nearly match (note that we’ve only shown that and match the first moments, however, the higher moments are exponentially small and negligible). The second property of and follows from the second property of and (except with some small failure probability which we can condition out).
Next, we explain the distributions and and the functions and that appear in the first paragraph: is defined as a majority over the first bits, and is a majority over the last bits, . To draw , we independently draw , and . Then, we set . We define nearly the same way, with the only difference that . From the properties of and , all properties of and presented in the first paragraph are satisfied.
3.2 Proof of Theorem 5
We begin with some notations:
Given a statistical query , denote .
We use to denote universal constants or constants depending only on . In the proof we will allow redundant constants depending on (e.g. the advantage will be rather than .
Let denote the uniform distribution over a finite set , let denote the total variation distance of two distributions and let denote the support of a probability distribution .
In contrast to the presentation in the intro, we conveniently assume that the distributions are only over rather than over .
The general idea is to split the bits of into two bit-sets, each containing bits. The value of will be a function of one of these sets, however any efficient non-adaptive algorithm would not be capable of finding the correct subset. Moreover, intuitively speaking, the incorrect subset will almost always lie by claiming the wrong value for .
We begin with two distributions and that nearly match all Fourier coefficients, however, for any while with probability for .
There exists two distributions, and over , such that the following holds:
, where is obtained by drawing and outputting , and is a constant depending only on .
Any satisfies .
and are nearly indistinguishable: for any , , where is a constant depending only on .
The proof utilizes results from the classical moment problem, and involves calculating the orthogonal polynomials of some distribution, as will be elaborated in Section 3.3.
Given and , we construct two pairs of distribution-function and which are hard to distinguish, in a sense that will be clear later. The function is a majority of the first coordinates, and is a majority of the last bits, . A random is drawn by drawing independently , , and setting . Note that , where is the value drawn above. Similarly, is drawn similarly, with the following distinction: . Here, notice that .
Since is nearly distributed as , with high probability over , the majority of the first coordinates of is almost always the opposite of the majority of the last last coordinates (and similarly when ). In particular, if one does not know whether the true function equals or , it is impossible to predict given with probability significantly greater than a half.
Utilizing the fact that the building blocks of and , namely and , nearly match their Fourier coefficients, we can generate a family of hard distributions by simple translations of and : for any define the pairs and as follows: and is obtained by drawing and setting for . Similarly, and is obtained by drawing and setting . The following are simple properties of the defined distributions, which follow mainly from Lemma 1, and are proved in Section 3.4
Fix . Then, and satisfy the following properties:
where depends only on (recall that ).
Next, we claim that for any set of statistical queries and for nearly all values of , the queries will have nearly the same value for both and . This follows from the fact that and have all their Fourier coefficient close to each other.
Fix a set of statistical queries for . Then,
where depend only on .
The proof will be presented in Section 3.5. Next, we define the exact statistical query setting: define the number of allowed queries and tolerance to ensure that the algorithm cannot distinguish between and : and , for the constants from Lemma 3. We define the SQ oracle such that it gives the same answers to and for most : given a statistical query , it acts as follows:
If the true distribution-function pair is for some then return the true value .
If the pair is and then return .
Otherwise return .
To conclude the proof, recall that Lemma 2 states that for nearly all values of , . In particular, if one cannot distinguish between these two functions, then they cannot know the true classification of . There some delicacy that should be taken care of: if the total variation distance between and was large, it would have been possible, given , to guess whether it was drawn from or with a non-negligible success probability. However, Lemma 2 ensures that this is not the case. The formal proof is presented below:
Proof of Theorem 5.
We start by assuming that the algorithm is deterministic and then extend to randomized algorithms. From this assumption it follows that the statistical queries are deterministic as well. Fix such that the responses of the oracle to are the same as for . From Lemma 3 and from the definition of the oracle, nearly all are such. For these , the algorithm has to learn some hypothesis without knowing if the true distribution-function pair is or . Let denote the learned hypothesis given . For these hard values of , . Let , where is the constant from Lemma 2. Applying Lemma 2 multiple times, we obtain that for any such :
where depends only on . Lastly, assume that the algorithm is randomized. Any randomized algorithm is just a distribution over deterministic algorithms, hence Eq. (7) will hold even if the algorithm is allowed to be randomized and the probability is taken over and the randomness of the algorithm. ∎
3.3 Proof of Lemma 1
Throughout the proof we will use the following parameters: , and . From the assumptions in Theorem 5, .
The first step is to find two distributions over of a particular shape that their first moments match. The first distribution is a mixture that samples with probability
and an exponential random variable with probability. By calculating the orthogonal polynomials of and applying Theorem 4, we find a distribution that matches the first moments of , and additionally, .
In the second step, we shift, scale and condition and , to obtain two distributions and that have nearly matching moments and satisfy the following conditions: ; ; is supported on and is supported on .
In the third step, we use and to generate and , respectively. To generate , we first draw and then, conditioned on , we draw each i.i.d. from the distribution over with expectation . The distribution is similarly defined using , except that we additionally condition on the high-probability event that . It follows from a simple argument that the Fourier coefficients satisfy and similarly, . We obtain that all Fourier coefficients of and nearly match.
Lastly, we claim that . To obtain this, first note that , as both and have mass on . As and are obtained from and using nearly the same transformation, we can apply the data processing inequality (presented in Section A) to bound .
We divide the proof into four parts, according to the steps described above.
Step 1: Distributions and over that match the first moments
We start by constructing two distributions over with matching first moments. Let distribution be the following mixture: with probability sample from the exponential distribution with parameter , and with probability sample . We start with the following lemma:
There exists a distribution that matches the first moments of and additionally, , where is a universal constant.
Before proving this lemma, we give some intuition: By Theorem 4, it suffices to show that , where is as defined in Section 2.3 with respect to the moments of . The same theorem implies that since , then ; and since is continuous, for any sufficiently small . To show that , we calculate the orthogonal polynomials of as linear combinations of the Laguerre polynomials, the orthogonal polynomials for the exponential distribution. Recall that is defined as a function of these polynomials, which allows us to bound .
First, we present the orthogonal polynomials of the exponential distribution:
The orthogonal polynomials for the exponential distribution with parameter are the Laguerre polynomials
Using a simple calculation, one obtains that the orthogonal polynomials for equal
is a normalizing constant. To verify this formula it suffices to check that for and that and these equations uniquely define (up to sign changes).
To get a closed form equation of the orthogonal polynomial, we use the identity
to obtain that
Using the above formula, we can prove the following bound on :
Assume that . Then,