1 Introduction
In a variety of applications in practice, the amount of labeled data available from the domain of interest is too modest to train an accurate model. Instead, the learner must resort to using labeled samples from an alternative source domain, whose distribution is expected to be close to that of the target domain. Additionally, typically a large amount of unlabeled data from the target domain is also at one’s disposal.
The problem of generalizing from that distinct source domain to a target domain for which few or no labeled data is available is a fundamental challenge in learning theory and algorithmic design known as the domain adaptation problem. We study a privacyconstrained and thus even more demanding scenario of domain adaptation, motivated by the critical data restrictions in modern applications: in practice, often the labeled data available from the source domain is public with no privacy constraints, but the unlabeled data from the target domain is subject to privacy constraints.
Differential privacy has become the gold standard of privacypreserving data analysis as it offers formal and quantitative privacy guarantees and enjoys many attractive properties from an algorithmic design perspective [DR14]
. Despite the remarkable progress in the field of differentially private machine learning, the problem of differentially private domain adaptation is still not wellunderstood. In this work, we present several new differentially private adaptation algorithms for the scenario described above that we show benefit from strong generalization guarantees. We also report the results of experiments demonstrating their effectiveness. Note that there has been a sequence of publications that provide formal differentially private learning guarantees assuming access to public data
[CH11, BNS13, BTT18, ABM19, NB20, BCM20]. However, their results are not applicable to the adaptation problem we study, since they assume that the source and target domains coincide.The design of our algorithms and their guarantees benefit from the theoretical analysis of domain adaptation by a series of prior publications, starting with the introduction of a distance between distributions by [KBG04] and [BBCP06]. These authors used this notion to derive learning bounds for the zeroone loss, (see also the followup publications [BCK08, BDBC10]) in terms of a quantity denoted by that depends on the hypothesis set
and the distribution and that cannot be estimated from observations. Later,
[*]MansourMohriRostamizadeh2009 and [CM14] introduced the notion of discrepancy, which they used to give a general analysis of singlesource adaptation for arbitrary loss functions. The notion of discrepancy is a divergence measure tailored to domain adaptation that coincides with the
distance in the special case of the zeroone loss. Unlike other divergence measures between distributions such as an distance, discrepancy takes into account the loss function and the hypothesis set and, crucially, can be estimated from finite samples. The authors presented Rademacher complexity learning bounds in terms of the discrepancy for arbitrary hypothesis sets and loss functions, as well as pointwise learning bounds for kernelbased hypothesis sets.For regression problems with no privacy constraints on the source or target data, [CM14] gave a discrepancy minimization algorithm based on a reweighting of the losses of sample points. They further presented a series of experimental results demonstrating that their algorithm outperformed all other baselines in a series of tasks. Building on that approach, we design new differentially private discrepancybased algorithms for adaptation from a source domain with public labeled data to a target domain with unlabeled private data. In Section 3, we briefly present some background material on the discrepancy analysis of adaptation motivating that approach.
The design and analysis of our private algorithms crucially hinge upon several key properties we prove for a smooth approximation of the weighted discrepancy, such as its smoothness with respect to the norm and the sensitivity of its gradient (Section 4). In Section 5, we present new twostage adaptation algorithms that can be viewed as private counterparts of the discrepancy minimization algorithm of [CM14]. As with that algorithm, the first stage consists of finding a reweighting of the source sample that minimizes the discrepancy, the second stage of minimizing a regularized weighted empirical loss based on the reweighting found in the first stage. Since the second stage does not involve private data, only the first stage requires a private solution. Our solutions are based on private variants of FrankWolfe and MirrorDescent algorithms, and they are computationally efficient. We describe these solutions in detail and prove privacy and generalization guarantees for both algorithms. We further compare the benefits of these algorithms as a function of the sample sizes.
In Section 6, we present a new, computationally efficient, singlestage differentially private adaptation algorithm seeking to directly minimize the sum of the weighted empirical loss and the discrepancy. Since attaining the minimum in this case is generally intractable due to nonconvexity of the objective, instead, our algorithm finds an approximate stationary point of this objective. Our algorithm is comprised of a sequence of FrankWolfe updates, where each update consists of a differentially private update of the weights and a nonprivate update of the predictor. In fact, our algorithm can be used in much more general settings of private nonconvex optimization over (a Cartesian product of) domains with different geometries. We formally prove the privacy and convergence guarantees of our algorithm in a general problem setting, and then derive its generalization guarantees in the context of adaptation. Finally, in Section 7, we report our experimental results.
We start with the introduction of preliminary concepts and definitions relevant to our analysis.
2 Preliminaries
Let denote the input space and the output space, which we assume to be a measurable subset of . We assume that is included in the ball of radius , . We will also assume that is included in a bounded interval of diameter . Let be a family of hypotheses mapping from to . We focus on the family of linear hypotheses . We will be mainly interested in the regression setting, though some of our results can be extended to other contexts. For any , we denote by the familiar squared loss of for the labeled point . We denote by an upper bound on the loss: , for all .
Learning scenario:
We identify a domain with a distribution over and refer to the source domain as the one corresponding to a distribution and the target domain, the one corresponding to a distribution . We assume that the learner receives a sample of labeled points drawn i.i.d. from a distribution over and it also has access to a large sample of unlabeled points drawn i.i.d. from , the input marginal distribution associated to . We view the data from , that is sample , as public data, and the data from , sample , as private data.
The objective of the learner is to use the samples and to select a hypothesis with small expected loss with respect to the target domain: . In the absence of any constraints, this coincides with the standard problem of singlesource domain adaptation, studied in a very broad recent literature, starting with the theoretical studies of [BBCP06, MMR09, CM14].
Discrepancy notions:
Clearly, the success of adaptation depends on the closeness of the distributions and , which can be measured according to various divergences. The notion of discrepancy has been shown to be appropriate measure of divergence between distributions in the context of domain adaptation. We will distinguish the socalled discrepancy , which can only be estimated when sufficient labeled data is available from both distributions, and the standard discrepancy , which can be estimated from finite unlabeled samples from both distributions:
We will be using the twosided versions of these expressions. For example, we will use
though part of our analysis holds with onesided definitions too.
Matrix definitions:
We will adopt the following matrix definitions and notation. We denote by the set of realvalued matrices and by the subset of formed by symmetric matrices. We will denote by the Frobenius product defined for all by . For any matrix , we denote by the
th eigenvalue of
in decreasing order and will also denote by its largest eigenvalue, and by its smallest eigenvalue. We also denote bythe vector of eigenvalues of
. For any , we will denote by the Schatten norm of defined by . Note that corresponds to the spectral norm: , which we also denote by .Smoothness:
We will say that a continuously differentiable function defined over a vector space is smooth for norm if , where is the dual norm associated to . When is twice differentiable, it is known that the condition on the Hessian , implies that is smooth [Sid19][Chapter 5; lemma 8].
Differential Privacy [Dmns06, Dkm06]:
Let . A (randomized) algorithm is differentially private if for all pairs of datasets that differ in exactly one entry, and every measurable , we have: We consider differentially private algorithms that have access to an auxiliary public dataset in addition to their input private dataset . In such case, we view the public set as being “hardwired” to the algorithm, and the constraint of differential privacy is imposed only w.r.t. the private dataset.
3 Background on discrepancybased generalization bounds
In this section, we briefly present some background material on discrepancybased generalization guarantees. A more detailed discussion is presented in Appendix A. Let the output labeldiscrepancy be defined as follows:
where is the labeled version of (i.e., is associated with its true (hidden) labels). Note that measures the difference of the distributions on the input domain. In contrast, accounts for the difference of the output labels in and . Note that under the covariateshift and separability assumption, we have . In general, adaptation is not possible when is large since the labels received on the training sample would then be very different from the target ones. Thus, we will assume, as in previous work, that we have . Then, the following learning bound, expressed in terms of the empirical unlabeled discrepancy , , and the Rademacher complexity of
, holds with probability at least
for all and all distributions over [CM14, CMMM19]:(1) 
The following more explicit upper bound on the Rademacher complexity holds when is the class of linear predictors and the support of is included in the ball of radius : [MRT18]. [CM14] proposed an adaptation algorithm motivated by these learning bounds and other pointwise guarantees expressed in terms of discrepancy. Their algorithm can be viewed as a twostage method seeking to minimize the first two terms of this learning bound. It consists of first finding a minimizer of the weighted discrepancy (second term) and then minimizing (a regularized) weighted empirical loss (first term) w.r.t. for that value of .
We will design private adaptation algorithms for a similar twostage approach, as well as a singlestage approach seeking to choose and to directly minimize the first two terms of the bound. The privacy and accuracy guarantees of our algorithms crucially rely on a careful analysis of a smooth approximation of the discrepancy term, which we present in the following section.
4 Discrepancy analysis and smooth approximation
4.1 Analysis
For the squared loss and , the weighted discrepancy term of the learning bound (1) can be expressed in terms of the spectral norm of a matrix that is an affine function of .
Lemma 1 ([Mmr09]).
For any distribution over , the following inequality holds:
where and where , and , .
For completeness, the short proof is given in Appendix B. In view of that, the learning bound (1) suggests seeking and to minimize the first two terms:
(2) 
Note that the second term of the bound is subdifferentiable but it is not differentiable both because of the underlying maximum operator and because the maximum eigenvalue is not differentiable at points where its multiplicity is more than one. Furthermore, the first term of the objective function is convex with respect to and convex with respect to , but it is not jointly convex.
The private algorithms we design require both the smoothness of objective, which would not hold given the first issue mentioned, and the sensitivity of the gradients. Thus, instead, we will use a uniform smooth approximation of the second term, for which we analyze in detail the smoothness and gradient sensitivity.
4.2 Softmax smooth approximation
A natural approximation of is based on the softmax approximation:
Note that while is a function of the eigenvalues, which are not differentiable everywhere, it is in fact infinitely differentiable since it can be expressed in terms of the trace of the exponential matrix or the trace of powers of : . The matrix exponential can be computed in , using an SVD of matrix . The following inequalities follow directly the properties of the softmax:
(3) 
Note that we have . Thus, for , gives a uniform approximation of . Note that the components of the gradient of are given by
(4) 
where denotes the Frobenius inner product. Both the smoothness and sensitivity of will be needed for the derivation of our algorithm. We now analyze these properties of function , using function which is defined for any symmetric matrix as follows:
The following result provides the desired smoothness result needed for , which we prove by using the smoothness of .
Theorem 1.
The softmax approximation is smooth for .
The proof is given in Appendix C.1. Next, we analyze the sensitivity of , that is the maximum variation in norm of when a single point in the sample of size drawn from is changed to another one .
Theorem 2.
The gradient of the softmax approximation is sensitive.
The proof is given in Appendix C.1.
Note that the softmax function is known to be convex [BV14]. Since is an affine function of and that composition with affine functions preserves convexity, this shows that is also a convex function. The following further shows that is Lipschitz.
Theorem 3.
For any , the gradient of is bounded as follows: .
The proof is given in Appendix C.1. In view of the expression of the weighted discrepancy , the smooth approximation of the maximum eigenvalue of leads immediately to a smooth approximation of , with
Thus, inherits the key properties of gathered in the following corollary.
Corollary 1.
The following properties holds for :

is convex and is a uniform approximation of .

is smooth for .

is sensitive.

for any , .
5 Twostage private adaptation algorithms
Here, we discuss private solutions for a twostage approach that consists of first finding that minimizes the empirical discrepancy and next fixing to that value and minimizing the empirical weighted loss over . In the absence of privacy constraints, this coincides with the twostage algorithm of [CM14]. The first stage consists of seeking to minimize an regularized version of the discrepancy, the second stage simply consists of fixing the solution obtained in the first stage and seeking minimizing the weighted empirical loss:
(5) 
where is the Euclidean ball in of radius . Equivalently, we can define an regularized version of the weighted empirical loss and minimize it over ; namely, solve
where
is a hyperparameter. Regularization in the first stage is done to ensure that the resulting weights
are not too sparse since sparse solutions can lead to poor output model in the second stage of the adaptation algorithm.In the second stage, no private data is involved. Thus, in this section, we give two private algorithms for the first stage of discrepancy minimization. Our private algorithms aim at minimizing an regularized version of the smooth approximation, , of the discrepancy discussed in Section 4.2. To emphasize its dependence on the private unlabeled dataset , we will use the notation . Namely, our algorithms aim at privately minimizing an regularized version of :
As mentioned earlier, the regularization term is used to avoid sparse solutions that may impact the accuracy of the output model in the second stage of the adaptation algorithm. Our algorithms are based on private variants of the FrankWolfe algorithm and the Mirror Descent algorithm. The general structure of these algorithms follow known private constructions devised in the context of differenitally private empirical risk minimization [TGTZ15, BGN21, AFKT21]. However, we note that the guarantees of both algorithms crucially rely on the smoothness and sensitivity properties of the approximation proved in the previous section. Solving the optimization with respect to the smooth approximation of the discrepancy enables us to bound the sensitivity of the gradients (see Theorem 2), which helps us devise private solutions for this problem.
We defer the description of these algorithms to Appendix D. We state below their formal guarantees.
Theorem 4.
The smoothness we created in also enables us to use a private variant of the FrankWolfe algorithm, whose optimization error scales only logarithmically with .
Theorem 5.
Note that compared to the guarantees of the private FrankWolfe algorithm in Theorem 4, the optimization error of the Noisy Mirror Descent algorithm (Theorem 5) exhibits a better dependence on at the expense of worse dependence on . In Appendix D, we give full proofs of these theorems.
Note that by standard stability arguments, the minimum weighted empirical loss of the second stage when training with is close to the minimum weighted empirical loss when training with when the discrepancy between and is small [MMR09]. Theorems 4 and 5 precisely supply guarantees for that closeness in discrepancy via the inequality , thereby guaranteeing the closeness of the loss of our private predictor (output of the second stage) to the minimum weighted empirical loss. This together with the learning bound (1) immediately provide a bound on the expected loss of our private predictor.
6 Singlestage private adaptation algorithm
In this section, we give a novel private algorithm for that outputs an approximate stationary point of the smooth approximation of the learning bound: Here, is the smooth approximation of the discrepancy discussed in Section 4.2 (where the subscript in and is used to emphasize the dependence on the private dataset ).
As discussed earlier, the function is generally nonconvex in . Since attaining a global minimizer of is generally intractable, a reasonable alternative is to find (an approximate) stationary point of . Note that is smooth in w.r.t. (as discussed in Section 4.2) and smooth in w.r.t. (due to the nature of the squared loss). These smoothness properties allow us to design our private solution. Given the approximation guarantee (3), the datadependent terms in the learning bound (1) can thus be approximated by . Hence, our strategy here is to find an approximate stationary point of via our private algorithm, and then derive a learning bound in terms of . The formal definition of an approximate stationary point is given next.
Definition 1 (approximate stationary point).
Let be a differentiable function over a convex and compact subset of a normed vector space. Let . We say that is an approximate stationary point of if the stationarity gap of at , defined as is at most .
First, we will give a generic differentiallyprivate algorithm for approximating a stationary point of smooth nonconvex objectives (defined by a private dataset ) that satisfy certain smoothness and Lipschitzness conditions. We give formal definitions of these conditions below.
Definition 2 (Lipschitz function).
Consider a function , where is a convex set whose diameter is bounded by (we refer to as bounded set), and is a convex bounded set. Let . We say that is Lipschitz if for any , is Lipschitz w.r.t. over , and for every , is Lipschitz w.r.t. over .
Definition 3 (smooth function).
This notion is defined analogously. We say that is Lipschitz if for any , is smooth w.r.t. over , and for every , is Lipschitz w.r.t. over .
Our private algorithm (Algorithm 1 below) takes as input an objective , where is a convex polyhedral set with bounded diameter and is a convex set with bounded diameter. Hence, our objective mentioned earlier is a special case. The algorithm is comprised of a number rounds, where in each round, two private FrankWolfe update steps are performed; one for and another for . The privacy mechanism for each is different due to the different geometries of and . We note that in the special case where , there is no need to privatize the FrankWolfe step for due to the fact that such update step depends only on the weighted empirical loss over the public data and the fact that differential privacy is closed under postprocessing (the previous update step for is carried out in a differentially private manner).
When satisfies the Lipschitzness and smoothness properties defined above w.r.t. and , we give formal convergence guarantees to a stationary point in terms of a highprobability bound on the stationarity gap of the output (see Definition 1). Despite the different geometries of and , our final bound is roughly the sum of the bounds we would obtain if we ran two separate FrankWolfe algorithms (one over and the other over . This is mainly due to the hybrid Lipschitzness and smoothness conditions ( for and for ), which enable us to decompose the bound on the convergence rate over and .
Theorem 6.
Algorithm 1 is differentially private. Assume that the objective is Lipschitz and smooth. Assume further that for all and , . Then, for any , there exists a choice of and such that with probability at least (over the algorithm’s randomness), the stationarity gap of the output is upper bounded by
The proof is given in Appendix E. We note that our adaptation objective satisfies all the conditions in Theorem 6. In Appendix E.1, we give a detailed discussion regarding instantiating Algorithm 1 with and the specific settings of all the parameters in this special case. As a result, we immediately reach the following corollary.
Corollary 2.
Let be the input to Algorithm 1. Let . There exists a choice of and fow which, with probability at least , the output of the algorithm is an approximate stationary point of with stationarity gap upper bounded as
Hence, bound (1) implies that w.p. over the choice of the public and private datasets and the algorithm’s internal randomness, the expected loss of the predictor (defined by the output ) w.r.t. the target domain is bounded as
Remark.
Note that is an approximate stationary point of . In practice, can be an approximation of a good local minimum of as demonstrated by our experiments. In such situations, the above bound implies a good prediction accuracy for the output predictor. Note also that the bound above is given in terms of the softmax approximation parameter . In general, this parameter should be treated as a hyperparameter and tuned appropriately to minimize the above bound. One reasonable choice of can be obtained by balancing the bound on the stationarity gap with the error term due to the softmax approximation. In such case, .
7 Experiments
(a)  (b) 
The objective of this section is to provide proofofconcept experiments to demonstrate that reasonable privacy guarantees could be achieved, when using our private domain adaptation algorithms. We use a setting similar to that of [CM14, Section 7.1] and demonstrate that the utility of private adaptation degrades gracefully with increased privacy guarantees and that the singlestage FrankWolfe algorithm performs best in most scenarios.
We carried out experiments with the following synthetic dataset. Let and . Let be a spherical Gaussian centered around
and with variance
in all directions. Letbe a Gaussian distribution with mean
and with variance in all directions. We defined the labeling function via if , otherwise, where . We chose the target distribution to be and the source distribution as a mixture of and with the weight of set to . We fixed the number of source samples to be and varied the number of unlabeled target samples from to . All experiments were repeated ten times for statistical consistency. We set , , the privacy parameter , and variedin experiments. The standard deviations were calculated over
runs in experiments.In this setup, we first ran differentially private discrepancy minimization using Algorithms 2 and 3. We plotted for different values of in Figure 1(a). The performance of the noisy FrankWolfe algorithm degrades smoothly with and improves with . However the performance of the noisy mirror decent algorithm is much worse. This is in line with the theoretical guarantees as in these experiments and noisy FrankWolfe algorithm has a better convergence guarantee in this regime. We expect mirror descent to perform better with much larger values of . Furthermore, observe that the noisy mirror descent has a high standard deviation compared to FrankWolfe algorithm as the noise added in mirror descent scales polynomially in , whereas it scales only logarithmically in for the FrankWolfe algorithm.
We next compared our singlestage (Algorithm 1) and the twostage differentially private algorithms with the model trained only with the public dataset (Figure 1(b)). As an oracle baseline, we also plotted the model trained with the labeled private dataset. Note that this model uses extra information that is not available during training and is plotted for illustration purposes only. The singlestage FrankWolfe algorithm without privacy admits the same performance as the model trained on the labeled private dataset. It performs better than the twostage FrankWolfe algorithm, however the gap decreases as the privacy guarantee improves. The performance of the mirror descent algorithm without differential privacy is similar to that of FrankWolfe algorithm, however as theory indicates, the performance degrades quickly with the privacy parameter. Similar to Figure 1(a), the performance of the noisy mirror descent algorithm is much worse and has a high standard deviation.
8 Conclusion
We presented new differentially private adaptation algorithms benefitting from strong theoretical guarantees. Our analysis can form the basis for the study of privacy for other related adaptation scenarios, including scenarios where a small amount of (private) labeled data is also available from the target domain and those with multiple sources. Our singlestage private algorithm is further likely to be of independent interest for private optimization of other similar objective functions.
Acknowledgements
This work was done while RB was visiting Google, NY. RB’s research at OSU is supported by NSF Award AF1908281, NSF Award 2112471, Google Faculty Research Award, and NSF CAREER Award 2144532.
References
 [ABM19] Noga Alon, Raef Bassily, and Shay Moran. Limits of private learning with access to public data. NeuRIPS 2019, also available at arXiv:1910.11519 [cs.LG], 2019.
 [ACG16] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
 [AFKT21] Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal rates in L1 geometry. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 1824 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 393–403. PMLR, 2021.
 [BBCP06] Shai BenDavid, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Bernhard Schölkopf, John C. Platt, and Thomas Hofmann, editors, Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 47, 2006, pages 137–144. MIT Press, 2006.
 [BCK08] John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for domain adaptation. In Advances in neural information processing systems, pages 129–136, 2008.
 [BCM20] Raef Bassily, Albert Cheu, Shay Moran, Aleksandar Nikolov, Jonathan Ullman, and Steven Wu. Private query release assisted by public data. In International Conference on Machine Learning, pages 695–703. PMLR, 2020.
 [BDBC10] Shai BenDavid, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(12):151–175, 2010.
 [BGM21] Raef Bassily, Cristóbal Guzmán, and Michael Menart. Differentially private stochastic optimization: New results in convex and nonconvex settings. Advances in Neural Information Processing Systems, 34, 2021.
 [BGN21] Raef Bassily, Cristóbal Guzmán, and Anupama Nandi. Noneuclidean differentially private stochastic convex optimization. arXiv preprint arXiv:2103.01278, 2021.
 [BLST10] Raghav Bhaskar, Srivatsan Laxman, Adam Smith, and Abhradeep Thakurta. Discovering frequent patterns in sensitive data. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 503–512, 2010.

[BNS13]
Amos Beimel, Kobbi Nissim, and Uri Stemmer.
Private learning and sanitization: Pure vs. approximate differential
privacy.
In
Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques
, pages 363–378. Springer, 2013.  [BTT18] Raef Bassily, Abhradeep Thakurta, and Om Thakkar. Modelagnostic private learning. In Advances in Neural Information Processing Systems 31, pages 7102–7112. Curran Associates, Inc., 2018.
 [BV14] Stephen P. Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2014.
 [CH11] Kamalika Chaudhuri and Daniel Hsu. Sample complexity bounds for differentially private learning. In Proceedings of the 24th Annual Conference on Learning Theory, pages 155–186, 2011.
 [CM14] Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theor. Comput. Sci., 519:103–126, 2014.
 [CMMM19] Corinna Cortes, Mehryar Mohri, and Andrés Muñoz Medina. Adaptation based on generalized discrepancy. J. Mach. Learn. Res., 20:1:1–1:30, 2019.
 [DKM06] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 486–503. Springer, 2006.
 [DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
 [DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
 [JN08] Anatoli Juditsky and Arkadi Nemirovski. Large deviations of vectorvalued martingales in 2smooth normed spaces. Rapport de recherche hal00318071, HAL, 2008.
 [KBG04] Daniel Kifer, Shai BenDavid, and Johannes Gehrke. Detecting change in data streams. In Mario A. Nascimento, M. Tamer Özsu, Donald Kossmann, Renée J. Miller, José A. Blakeley, and K. Bernhard Schiefer, editors, (e)Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB 2004, Toronto, Canada, August 31  September 3 2004, pages 180–191. Morgan Kaufmann, 2004.
 [MMR09] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In COLT 2009  The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 1821, 2009, 2009.
 [MRT18] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. Adaptive computation and machine learning. MIT Press, 2018. Second edition.
 [NB20] Anupama Nandi and Raef Bassily. Privately answering classification queries in the agnostic pac model. In Algorithmic Learning Theory, pages 687–703, 2020.
 [Nes07] Yurii Nesterov. Smoothing technique and its applications in semidefinite optimization. Math. Program., 110:245–259, 2007.
 [NJLS09] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
 [NY83] A.S. Nemirovsky and D.B. Yudin. Problem Complexity and Method Efficiency in Optimization. A WileyInterscience publication. Wiley, 1983.
 [Sid19] Aaron Sidford. Introduction to optimization theory  MS&E213 / CS269O. Stanford Course Notes, 2019. https://web.stanford.edu/~sidford/courses/19fa_opt_theory/.
 [TGTZ15] Kunal Talwar, Abhradeep Guha Thakurta, and Li Zhang. Nearly optimal private lasso. Advances in Neural Information Processing Systems, 28:3025–3033, 2015.
Appendix A Background on discrepancybased generalization bounds
In this section, we briefly present some background material on discrepancybased generalization guarantees.
The following learning bound was given by [CMMM19]: for any , with probably at least over the draw of a sample , for any distribution over , for all , the following inequality holds:
(6) 
This bound is tight in the sense that for the hypothesis reaching the maximum in the definition of the discrepancy, the bound coincides with the standard Rademacher complexity bound on [CMMM19]. The bound suggests choosing and the distribution to minimize the righthand side. The first term of the bound is not jointly convex with respect to and . Instead, the algorithm suggested by [CM14] (see also [MMR09]) consists of a twostage procedure: first choose to minimize the weighted empirical discrepancy, next fix and choose to minimize the weighted empirical loss .
In practice, we do not have labeled data from or too few to be able to accurately minimize the discrepancy, since otherwise adaptation would not be even necessary and we could directly use labeled data from for training. Instead, we upper bound the discrepancy in terms of the discrepancy and the output labeldiscrepancy defined as follows:
where is the labeled version of (i.e., is associated with its true (hidden) labels). Note that measures the difference of the distributions on the input domain. In contrast, accounts for the difference of the output labels in and . We will assume that . Note that under the covariate shift assumption and separable case, we have . In general, adaptation is not possible when can be large since the labels received on the training sample can be different from the target ones.
We will say that a loss function is admissible if for all and [CMMM19]. Note that this is a slightly weaker condition than that of Lipschitzness of the loss with respect to its first argument.
Theorem 7.
Let be a admissible loss. Then, the following upper bound holds:
The proof is given in Appendix B. Note that the squared loss is admissible: since the function is 2Lipschitz on , we have . Thus, the learning bound (6) can be expressed in terms of the discrepancy and the Rademacher complexity of as follows, using the fact [MRT18][Prop. 11.2]:
We will be considering a family of linear hypotheses and will be assuming that the support of is included in the ball of radius . The following more explicit upper bound on the Rademacher complexity then holds when the support of is included in the ball of radius : [MRT18].
Appendix B Discrepancy analysis and bounds
See 7
Proof.
For any hypothesis in , we can write