New, more powerful computational hardware and access to substantial amounts of data has made fitting accurate models for image classification, text translation, physical particle prediction, astronomical observation, and other predictive tasks possible with accuracy that was previously completely infeasible [11, 7, 49]. In many modern applications, data comes from measurements on small-scale devices with limited computation and communication ability—remote sensors, fitness monitors—making fitting large scale predictive models both computationally and statistically challenging. Moreover, as more modes of data collection and computing move to peripherals—watches, power-metering, internet-enabled home devices, and even lightbulbs—issues of privacy become ever more salient.
Such large-scale data collection motivates substantial work. Stochastic gradient methods are now the de facto approach to large-scale model-fitting [68, 18, 60, 28], and recent work of McMahan et al.  describes systems (which they term federated learning) for aggregating multiple stochastic model-updates from distributed mobile devices. Yet even if only updates to a model are transmitted, leaving all user or participant data on user-owned devices, it is easy to compromise the privacy of users [37, 57]
. To see why this issue arises, consider any generalized linear model based on a data vector, target , and with loss of the form . Then trivially one has for , hence, a scalar multiple of the user’s clear data —a clear compromise of privacy. In this paper, we describe an approach to fitting such large-scale models both privately and practically.
A natural approach to addressing the risk of information disclosure in such federated learning scenarios is to use differential privacy , which provides strong guarantees on the risk of compromising any user’s data. To implement differential privacy, one defines a mechanism , a randomized mapping from a sample of data points to some space , which is -differentially private if
for all samples and
differing in at most one entry, i.e. if one element is present in one sample and absent in the other. Because of its strength and protection properties, differential privacy (and its variants) are now essentially the standard privacy definition in data analysis and machine learning[22, 32, 23]. Nonetheless, implementing such an algorithm presumes a level of trust between users and a centralized data analyst, which may be undesirable or even untenable, as the data analyst has essentially unfettered access to a user’s data. Other approaches to protecting individual updates is to use secure multiparty computation (SMC), sometimes in conjunction with differential privacy protections; see, for example, Bonawitz et al. . Traditional approaches to SMC require substantial communication and computation, making them untenable for large-scale data collection schemes, and Bonawitz et al. 
address a number of these, though individual user communication and computation still increases with the number of users submitting updates and requires multiple rounds of communication, which may be unrealistic when estimating models from peripheral devices.
An alternative to these approaches is to use locally private algorithms [65, 36, 31], in which an individual keeps his or her data private even from the data collector. Such scenarios are natural in distributed (or federated) learning scenarios, where individuals provide data from their devices [53, 8] but wish to maintain privacy. In our learning context, where a user has data that he or she wishes to remain private, a randomized mechanism is -local differentially private if for all and sets ,
Roughly, a mechanism satisfying inequality (2) guarantees that even if an adversary knows that the initial data is one of or , the adversary cannot distinguish them given an outcome (the probability of error must be at least ) . Taking as motivation this testing definition, the “typical” recommendation for the parameter is to take as a small constant [66, 35, 32].
Local privacy protections provide numerous benefits: they allow easier compliance with regulatory strictures, reduce risks (such as hacking) associated with maintaining private data, and allow more transparent protection of user privacy, because private data never leaves an individual’s device in the clear. Yet substantial work in the statistics, machine learning, and computer science communities has shown that local differential privacy and its relaxations cause nontrivial challenges for learning systems. Indeed, Duchi et al. [30, 31] show that in a minimax (worst case population distribution) sense, learning with local differential privacy must suffer a degradation in sample complexity that scales linearly in the dimension of the problem, at least for privacy parameters . Duchi and Ruan  develop this approach further, arguing that a worst-case analysis is too conservative and may not accurately reflect the difficulty of problem instances one actually encounters, so that an instance-specific theory of optimality is necessary. In spite of this instance-specific optimality theory for locally private procedures—that is, fundamental limits on learning that apply to the particular problem at hand—Duchi and Ruan’s results suggest that local notions of privacy as currently conceptualized restrict some of the deployments of learning systems.
We consider an alternative conceptualization of privacy protections and the concomitant guarantees from differential privacy and the likelihood ratio bound (2). The testing interpretation of differential privacy suggests that when , the definition (2) is almost vacuous. We argue that, at least in large-scale learning scenarios, this testing interpretation is unrealistic, and allowing mechanisms with may provide meaningful privacy protections. Rather than providing protections against arbitrary inferences, we wish to provide protection against accurate reconstruction of an individual’s data . In the large scale learning scenarios we consider, an adversary given a random observation likely has little prior information about , so that protecting only against reconstructing (functions of) under some assumptions on the adversary’s prior knowledge allows substantially improved model-fitting.
The formal setting for our problems is as follows. Given data , , drawn from a distribution , we seek a parameter vector that will have good future performance when evaluated under loss , that is, solve the population risk minimization problem
The standard approach  to such problems is to construct the empirical risk minimizer . In this paper, however, we consider the stochastic minimization problem (3) while providing both local privacy to individual data and—to maintain the satisfying guarantees of centralized differential privacy (1)—stronger guarantees on the global privacy of the output of our procedure. With this as motivation, we describe our contributions at a high level. As above, we argue that large local privacy (2) parameters, , still provide reasonable privacy protections. We develop new mechanisms and privacy protecting schemes that more carefully reflect the statistical aspects of problem (3), which we demonstrate are (in a sense) theoretically optimal. A substantial portion of this work is devoted to providing practical procedures while providing meaningful local privacy guarantees, which currently do not exist. Consequently, we provide extensive empirical results that demonstrate the tradeoffs between private federated (distributed) learning scenarios, showing that it is possible to achieve performance comparable to federated learning procedures without privacy safeguards.
1.1 Our approach and results
We propose and investigate a two-pronged approach to model fitting under local privacy. Motivated by the difficulties associated with local differential privacy we discuss in the immediately subsequent section, we reconsider the threat models (or types of disclosure) in locally private learning. Instead of considering an adversary with access to all data, we consider “curious” onlookers, who wish to decode individuals’ data but have little prior information on them. Formalizing this (as we discuss in Section 2) allows us to consider substantially relaxed values for the privacy parameter , sometimes even scaling with the dimension of the problem, while still providing protection. While this brings us away from the standard guarantees of differential privacy, we can still provide privacy guarantees for the type of onlookers we consider.
This model of privacy is natural in federated learning scenarios [53, 8], where we wish to perform distributed model training. Here, by providing protections against curious onlookers, a company can protect its users against reconstruction of their data by, for example, internal employees. By encapsulating this more relaxed local privacy model within a broader central differential privacy layer, we can still provide satisfactory privacy guarantees to users, protecting them against strong external adversaries as well.
We make several contributions to achieve these goals. In Section 2, we describe a model for curious adversaries, with appropriate privacy guarantees, and demonstrate how (for these curious adversaries) it is still nearly impossible to accurately reconstruct individuals’ data. We then detail a prototypical private federated learning system in Section 3. In this direction, we develop new (near-optimal) privacy mechanisms for privatization of high-dimensional vectors in unit balls (Section 4). These mechanisms yield substantial improvements over the minimax optimal schemes Duchi et al. [31, 30] develop, providing order of magnitude improvements over classical noise addition schemes, and we provide a unifying theorem showing the asymptotic behavior of a stochastic-gradient-based private learning scheme in Section 4.4. We conclude our development in Section 5 with several large-scale distributed model-fitting problems, showing how the tradeoffs we make allow for practical procedures. Our approaches allow substantially improved model-fitting and prediction schemes; in situations where local differential privacy with smaller privacy parameter fails to learn a model at all, we can achieve models with performance near non-private schemes.
1.2 Why local privacy makes model fitting challenging
To motivate our approaches, we discuss why local privacy causes some difficulties in a classical learning problem. Duchi and Ruan  help to elucidate the precise reasons for the difficulty of estimation under
-local differential privacy, and we can summarize them heuristically here, focusing on the machine learning applications of interest. To do so, we begin with a brief detour through classical statistical learning and the usual convergence guarantees that are (asymptotically) possible.
Consider the population risk minimization problem (3), and let denote its minimizer. We perform a heuristic Taylor expansion to understand the difference between and . Indeed, we have
(for an error term in the Taylor expansion of ), which—when carried out rigorously—implies
The influence function  of the parameter measures the effect that changing a single observation has on the resulting estimator .
, and typically a problem is “easy” when the variance of the functionis small—thus, individual observations do not change the estimator substantially. In the case of local differential privacy, however, as Duchi and Ruan  demonstrate, (optimal) locally private estimators typically have the form
where is a noise term that must be taken so that and are indistinguishable for all . Essentially, a local differentially private procedure cannot leave small even when it is typically small (i.e. the problem is easy) because it could be large for some value . In locally private procedures, this means that differentially private tools for typically “insensitive” quantities (cf. ) cannot apply, as an individual term in the sum (5) is (obviously) sensitive to arbitrary changes in . The consequences of this are striking, and extend even to weakenings of local differential privacy : it makes adaptivity to easy problems essentially impossible for standard -locally-differentially private procedures, at least when is small, and introduces substantial dimension-dependent penalties in the error . Thus, to enable high-quality estimates for quantities of interest in machine learning tasks, we explore locally differentially private settings with larger privacy parameter .
2 Privacy protections
In developing any statistical machine learning system providing privacy protections, it is important to consider the types of attacks that we wish to protect against. In distributed model fitting and federated learning scenarios, we consider two potential attackers: the first is a curious onlooker who can observe all updates to a model and communication from individual devices, and the second is from a powerful external adversary who can observe the final (shared) model or other information about individuals who may participate in data collection and model-fitting. For the latter adversary, essentially the only effective protection is to use a small privacy parameter in a localized or centralized differentially private scheme [54, 35, 59]. For the curious onlookers, however—for example, internal employees of a company fitting large-scale models—we argue that protecting against reconstruction is reasonable.
2.1 Reconstruction breaches
Abstractly, we work in a setting in which a user or study participant has data he or she wishes to keep private. Via some process, this data is transformed into a vector —which may simply be an identity transformation, but may also be a gradient of the loss on the datum or other derived statistic. We then privatize via a randomized mapping . An onlooker may then wish to estimate or evaluate some function on the private data ,
. Thus, we have the Markov chain
and the onlooker, who observes only , wishes to estimate . In most scenarios with a curious onlooker, however, if or is suitably high dimensional, the onlooker has limited prior information about , so that relatively little obfuscation is required in the mapping from .
As a motivating example, consider an image processing scenario. A user has an image , where are wavelet coefficients of (in some prespecified wavelet basis) ; without loss of generality, we assume we normalize to have energy . Let be a low-dimensional version of (say, based on the first 1/8 of wavelet coefficients); then (at least intuitively, and we can make this rigorous) taking to be a noisy version of such that —that is, noise on the scale of the energy —should be sufficient to guarantee that the observer is unlikely to be able to reconstruct to any reasonable accuracy. Moreover, a simple modification of the techniques of Hardt and Talwar  shows that for , and -differentially private quantity for satisfies whenever . That is, we might expect that even very large provide protections against reconstruction.
With this in mind, let us formalize a reconstruction breach in our scenario. Here, the onlooker (or adversary) has a prior on , and there is a known (but randomized) mechanism , . We then have the following definition.
[Reconstruction breach] Let be a prior on , and let be generated with Markov structure for a mechanism . Let be the target of reconstruction and
be a loss function. Then an estimatorprovides an -reconstruction breach for the loss if there exists such that
If for every estimator ,
then the mechanism is -protected against
reconstruction for the loss .
Key to Definition 2.1 is that it applies uniformly
across all possible observations of the mechanism —there are no
rare breaches of privacy.111 With that said, we ignore measurability
issues here; in our setting, all random variables are mutually absolutely
continuous and are generated by regular conditional probability
distributions, the conditioning on
With that said, we ignore measurability issues here; in our setting, all random variables are mutually absolutely continuous and are generated by regular conditional probability distributions, the conditioning onin Def. 2.1 has no issues . This requires somewhat stringent conditions on mechanisms and also disallows recent relaxed privacy definitions [33, 21, 59].
2.2 Separated private mechanisms
We will enforce (analogues of) central differential privacy (1) in our overall model-fitting system; here, McMahan et al.  show that certain noise-adding strategies that we discuss in the sequel allow efficient modeling.
The more challenging part of our development, then, is to consider mechanisms for providing privacy in the local model. Motivated by the difficulties we outline in Section 1.2 for locally private model fitting—in particular, that estimating the magnitude of a gradient or influence function is challenging—we consider mechanisms that transmit information by privatizing a pair , where and are the privatized versions of and .
Continuing our running thread of federated learning scenarios, when is a vector, we split into its direction and magnitude . This splitting allows us to develop mechanisms that more carefully reflect the challenges of transmitting vectors that may have varying scales—important, especially, in situations like the one we described in Section 1.2.
More succinctly, we consider mechanisms as the pair ; we would like to guarantee that the pair contains little information about (and so in turn, little about ). As we use this separated scheme repeatedly, we make the following definition.
[Separated Differential Privacy] A pair of mechanisms mapping from to (i.e. a channel with the Markovian structure of Fig. 1) is -separated differentially private if is -locally differentially private and is -locally differentially private (Eq. (2)).
2.3 Protecting against reconstruction
We can now develop guarantees against reconstruction based on our previous definitions. We begin with a simple claim. Let the pair of mechanisms satisfy -separated differential privacy over generated by and as in Fig. 1, and let for a measurable function . Then for any on and measurable sets , the posterior distribution (for and satisfies
The result is immediate by Bayes’ rule.
Based on Lemma 2.3, we can show the following result, which guarantees that difficulty of reconstruction of a signal is preserved under private mappings.
Assume that the prior on is such that for a tolerance , probability , target function , and loss , we have
for all . Let satisfy -separated differential privacy. Then the pair is -protected against reconstruction for the loss .
Lemma 2.3 immediately implies that for any estimator based on , we have for any realized and
The final quantity is for , as desired. ∎
Let us now provide a more explicit example of loss and reconstruction that is natural in the distributed learning scenarios we consider. We assume that , and for , consider a matrix with orthonormal rows, so that and is a projection matrix. We consider the problem of reconstructing
Thus, the adversary seeks to reconstruct a -dimensional projection of the vector . For example, if is an image or other signal, may be the first
rows of the standard Fourier transform matrix, so that we seek low frequency information about. (In a wavelet scenario, this may be the first level of the wavelet hierarchy.) In these cases, reconstructing is enough for an adversary to get a general sense of the private data, and protecting against reconstruction is more challenging for small .
Now, for a prior on , let be the induced prior on , and let be the uniform prior on . We define
We use the -distance on the sphere as our loss, (when , otherwise setting ). For uniform and , we have , so that thresholds of the form with small are the most natural to consider in the reconstruction breaches (6). The following proposition demonstrates that locally differentially private mechanisms protect against reconstruction (and as an immediate consequence, that any separated differentially private scheme, Def. 1, does as well). Let be -locally differentially private (2) and . Let as in Eq. (8), and let . Then is -protected against reconstruction for
Simplifying this slightly and rewriting, assuming the reconstruction takes values in , we have
for any of the form and . That is, unless or are of the order of , the probability of obtaining reconstructions better than (nearly) random guessing is extremely low.
Let and . We then have
We collect a number of more or less standard facts on the uniform distribution onin Appendix C, which we reference frequently. Using Lemma C and Eq. (29), we have for all and that
Because has prior such that , we obtain
Then Lemma 2.3 gives the first result of the proposition.
2.4 Other privacy definitions and existing mechanisms
In our separated mechanisms, we decompose a vector into its unit direction and magnitude . In the sequel, we design, under appropriate conditions, optimal differentially private mechanisms acting on both and , which allow us to provide strong convergence guarantees in different stochastic optimization and learning problems. Given the numerous relaxations of differential privacy [59, 33, 21], however—many of which permit noise addition schemes with simple Gaussian noise addition as well as a number of the benefits of differential privacy—a natural idea is to simply add noise satisfying one of these weaker definitions to . First, these weakenings can never actually protect against a reconstruction breach for all possible observations (Definition 2.1)—they can only protect conditional on the observation lying in some appropriately high probability set (cf. [13, Thm. 1]). Second, as we discuss now, most standard mechanisms add more noise than ours. As we show in the sequel (Section 4), our -differentially private mechanisms release such that and , which we show is optimal.
The first (standard) approach in differential privacy and its weakenings is to add noise via , where is mean zero noise independent of . For utility, we then consider the mean squared error . Because two vectors can satisfy , the standard Laplace mechanism  adds noise vector with , yielding . The -extension of the Laplace mechanism  adds noise with density , so that marginally and . These are evidently of the wrong order of magnitude for .
Duchi et al.  provide an alternative sampling strategy that enjoys substantially better dimension dependence. In this case, we sample
To debias the vector , one sets ; as , this mechanism satisfies
This linear scaling in dimension is in fact optimal (and better than the noise addition strategies detailed above); unfortunately, it does not converge to zero as grows, and relaxing privacy does not give a more accurate estimator beyond squared -error.
As we mention above, alternative strategies using weakenings of differential privacy do not provide the strong anti-reconstruction guarantees possible with pure differential privacy. Nonetheless, we discuss three briefly in turn: Rényi , concentrated [33, 21], and approximate differential privacy . In the local privacy case where —the high privacy regime—Duchi and Ruan  show that none of these weakenings offers any benefits in terms of statistical power. Each of these is easiest to describe with a small amount of additional notation. Let denote the distribution of the randomized mechanism conditional on .
Rényi differential privacy
The -Rényi divergence between distributions and is
where . Then a mechanism with distribution is -Rényi-differentially private  if for all . For , if denotes a prior belief on the possible values of and the posterior belief given the observation , then Rényi differential privacy is equivalent to the condition that
that is, the prior and posterior odds ratios do not change (on average) much. Clearly-differential privacy provides -Rényi privacy for all . For these weakenings, the basic mechanism for privatizing a single vector in the sphere is the Gaussian mechanism for . Using that , we see that the choice is sufficient for -Rényi privacy, yielding errors scaling as . It is not completely clear for which values of Rényi-differential privacy provides appropriate protections, but for , this is again evidently of larger magnitude than our differentially private schemes.
Concentrated differential privacy
Approximate differential privacy
The mechanism with conditional satisfies -approximate differential privacy if for all . (Here one thinks of as being a very small value, typically sub-polynomial in the sample size.) In the case that we wish to privatize vectors , the Gaussian mechanism is (to within constants) optimal , adding noise for a numerical constant and yielding error . When , this error too is of higher order than the mechanisms we develop, while providing weaker privacy guarantees. (See also the discussion of McSherry .)
3 Applications in federated learning
Our overall goal is to implement federated learning, where distributed units send private updates to a shared model to a centralized location. Recalling our population risk (3), basic distributed learning procedures (without privacy) iterate as follows [16, 25, 20]:
A centralized parameter is distributed among a batch of workers, each with a local sample , .
Each worker computes an update to the model parameters.
The centralized procedure aggregates into a global update and updates .
The prototypical situation is to use a stochastic gradient method to implement steps 1–3, so that for some stepsize in step 2, and is simply the average of the stochastic gradients at each sample in step 3.
In our private distributed learning context, we elaborate steps 2 and 3 so that each provides privacy protections: in the local update step 2, we use locally private mechanisms to protect individual’s private data —satisfying Definition 2.1 on protection against reconstruction breaches. Then in the central aggregation step 3, we apply centralized differential privacy mechanisms to guarantee that any models communicated to users in the broadcast 1 is globally private. The overall feedback loop then provides meaningful privacy guarantees, as a user’s data is never transmitted clearly to the centralized server, and strong centralized privacy guarantees mean that the final and intermediate parameters provide no sensitive disclosures.
3.1 A private distributed learning system
Let us more explicitly describe the implementation of a distributed learning system. The outline of our system is similar to the development of Duchi et al. [30, 31, Sec. 5.2] and the system that McMahan et al.  outline; we differ in that we allow more general updates and privatize individual users’ data before communication, as the centralized data aggregator may not be completely trusted. We decompose the system into five components: (1) transmission of the model to users, (2) computing local updates, (3) transmission of the privatized update, (4) centralized aggregation of the updates, and (5) aggregate model privatization.
Transmission of central model
The central aggregator maintains a global model parameter
as well as a collection of hyperparameters, which govern the behavior of the individual (distributed) updates as well as the aggregation strategies. The parameter and are distributed to a subset of the worker nodes (e.g. a device, user, or individual observation holder). This subset is a random subset of expected size , where is the total number of potential workers and is the subsampling rate, where user is selected independently with probability .
On the local workers, we leverage the user data to update the central model parameters . We consider a generic method that denotes the update rule on each worker. Assuming that worker has a local sample , each performs
To make this abstract description somewhat more concrete, we note that there are many possible updates for solving the problem (3). Perhaps the most popular rule is to apply a gradient update, where for a stepsize we apply
Transmission of privatized updates
After completing the local update to , we define the local difference and transmit this to the aggregator using a separated mechanism (Definition 1). In particular, we privatize both the direction and magnitude using the privacy preserving mechanisms we detail in Section 4: letting be an unbiased (private) estimate of and
an unbiased estimate of, we return . Our development of mechanisms in the sequel shows the possible norms and privacy levels here.
Given the collection of privatized updates , we aggregate by projecting each update onto an -ball of radius . Letting denote this projection, we define the aggregated update
The presence of the projection is relatively innocuous—for online stochastic gradient settings, where for a stepsize that decreases to zero as the iterative procedure continuous, we eventually have (with probability 1) that for any as .222A formal argument using Borel-Cantelli is straightforward but tedious; we omit this. The key is that the truncation to a radius allows centralized privacy protections:
Aggregate Model Privatization
Our local privatization with separated differential privacy provides safeguards against reconstruction breaches from adversaries with diffuse prior knowledge; to provide stronger global protections, we incorporate centralized (approximate) differential privacy at the centralized model. The update (10) has -sensitivity at most (modifying a single update can cause to change by at most in -distance). Consequently, addition of appropriate Gaussian noise, as described above, allows us to guarantee -approximate differential privacy . In particular, at the th global update we modify the shared parameter via
so that reflects the -sensitivity. One must choose the parameter
to enforce the desired privacy guarantees after a sufficient number of rounds; in our experimental work, we use the “moments accountant” analysis ofAbadi et al.  to guarantee centralized privacy.
3.2 Asymptotic Analysis
To provide a fuller story and demonstrate the general consequences of our development, we now turn to an asymptotic analysis of our distributed statistical learning scheme for solving problem (3) under locally privatized updates. We ignore the central privatization term (11), as it is asymptotically negligible in our setting. To set the stage, we consider minimizing the population risk using an i.i.d. sample for some population .
The simplest strategy in this case is to use the stochastic gradient method, which (using a stepsize sequence ) performs updates
where . In this case, under the assumptions that is in a neighborhood of with and that for some and all , Polyak and Juditsky  provide the following result. [Theorem 2 ] Let . Assume the stepsizes for some . Then under the preceding conditions,
We now consider the impact that local privacy has on this result. Let be a local privatizing mechanism, and define . We assume that each application of the mechanism is (conditional on the pair ) independent of all others. In this case, the stochastic gradient update becomes
As a consequence of Corollary 3.2, we have Let the conditions of Corollary 3.2 hold. Assume that . Assume additionally that the privatization mechanism is (conditional on unbiased and that there exists such that . Then
Key to Corollary 3.2 is that—as we describe in the next section—we can design privatization schemes for which . This is (in a sense) the “correct” scaling of the problem with dimension and local privacy level . This scaling is in contrast to previous work in local privacy (e.g. that by Duchi et al. ). In this work, the best such asymptotics (see Section 5.2.1 of ) have asymptotic mean-square error of . Already , and the given operator norms are potentially much worse than the asymptotics Corollary 3.2 reveals. Thus, our ability to develop mechanisms that are (near) optimal in high regimes allows model fitting that was impossible with previous mechanisms and guarantees of local privacy.
4 Private Mechanisms for Releasing High Dimensional Vectors
The main application of the privacy mechanisms we develop is to private (distributed) learning scenarios, where we wish to perform stochastic gradient-like updates to a shared parameter. The key to efficiency in all of these applications is to have accurate estimates of the actual update —frequently simply the stochastic gradient —so in this section, we develop new (and optimal) mechanisms for privatizing -dimensional vectors. We consider two regimes of the highest interest: Euclidean settings [61, 60] (where we wish to privatize vectors belonging to balls) and the highly non-Euclidean scenarios that arise in high-dimensional estimation and optimization, e.g. mirror descent settings [60, 14] (where we wish to privatize vectors belonging to balls). In each application, we consider mechanisms that release an estimate of the magnitude of the vector and the direction . We thus describe mechanisms for releasing unit vectors in - and -balls, after which we show how to release the scalar magnitude; the combination allows us to release (optimally accurate) unbiased vector estimates, which we can employ in distributed and online statistical learning problems. We conclude the section with an asymptotic normality result on the convergence of stochastic gradient procedures that unifies our entire development, providing a convergence guarantee that is better than any available for previous locally differentially private learning procedures.
4.1 Privatizing unit vectors with high accuracy
We begin with the Euclidean case, which arises in most classical applications of stochastic gradient-like methods [69, 61, 60]. In this case, we have a vector (i.e. ), and we wish to generate an -differentially private vector with the property that
where the size is as small as possible to maximize the efficiency in Corollary 3.2.
We modify Duchi et al.’s sampling strategy (9) to develop an optimal mechanism. Our insight is that instead of sampling from the hemispheres , we can instead sample from spherical caps. That is, we draw our vector from a cap with some probability or from its complement with probability , where and are constants we shift to trade accuracy and privacy more precisely. In Figure 2, we give a visual representation comparing the approach of Duchi et al.  and our mechanism, which we term (see Algorithm 1); in the next subsection we demonstrate the choices of and scaling factors to make the scheme differentially private and unbiased. Algorithm 1 takes as input , , and and returns satisfying . In the algorithm, we require the incomplete beta function
For completeness, in Appendix D we show how to approximately sample conditional vectors on the surface of a hypersphere for .