DeepAI
Log In Sign Up

Private Domain Adaptation from a Public Source

A key problem in a variety of applications is that of domain adaptation from a public source domain, for which a relatively large amount of labeled data with no privacy constraints is at one's disposal, to a private target domain, for which a private sample is available with very few or no labeled data. In regression problems with no privacy constraints on the source or target data, a discrepancy minimization algorithm based on several theoretical guarantees was shown to outperform a number of other adaptation algorithm baselines. Building on that approach, we design differentially private discrepancy-based algorithms for adaptation from a source domain with public labeled data to a target domain with unlabeled private data. The design and analysis of our private algorithms critically hinge upon several key properties we prove for a smooth approximation of the weighted discrepancy, such as its smoothness with respect to the ℓ_1-norm and the sensitivity of its gradient. Our solutions are based on private variants of Frank-Wolfe and Mirror-Descent algorithms. We show that our adaptation algorithms benefit from strong generalization and privacy guarantees and report the results of experiments demonstrating their effectiveness.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/19/2020

A Theory of Multiple-Source Adaptation with Limited Target Labeled Data

We study multiple-source domain adaptation, when the learner has access ...
06/11/2021

TOHAN: A One-step Approach towards Few-shot Hypothesis Adaptation

In few-shot domain adaptation (FDA), classifiers for the target domain a...
09/26/2014

Unsupervised Domain Adaptation by Backpropagation

Top-performing deep architectures are trained on massive amounts of labe...
10/23/2021

Domain Adaptation via Maximizing Surrogate Mutual Information

Unsupervised domain adaptation (UDA), which is an important topic in tra...
02/26/2020

Multi-source Domain Adaptation in the Deep Learning Era: A Systematic Survey

In many practical applications, it is often difficult and expensive to o...
03/05/2021

Discrepancy-Based Active Learning for Domain Adaptation

The goal of the paper is to design active learning strategies which lead...
07/13/2021

Fast Batch Nuclear-norm Maximization and Minimization for Robust Domain Adaptation

Due to the domain discrepancy in visual domain adaptation, the performan...

1 Introduction

In a variety of applications in practice, the amount of labeled data available from the domain of interest is too modest to train an accurate model. Instead, the learner must resort to using labeled samples from an alternative source domain, whose distribution is expected to be close to that of the target domain. Additionally, typically a large amount of unlabeled data from the target domain is also at one’s disposal.

The problem of generalizing from that distinct source domain to a target domain for which few or no labeled data is available is a fundamental challenge in learning theory and algorithmic design known as the domain adaptation problem. We study a privacy-constrained and thus even more demanding scenario of domain adaptation, motivated by the critical data restrictions in modern applications: in practice, often the labeled data available from the source domain is public with no privacy constraints, but the unlabeled data from the target domain is subject to privacy constraints.

Differential privacy has become the gold standard of privacy-preserving data analysis as it offers formal and quantitative privacy guarantees and enjoys many attractive properties from an algorithmic design perspective [DR14]

. Despite the remarkable progress in the field of differentially private machine learning, the problem of differentially private domain adaptation is still not well-understood. In this work, we present several new differentially private adaptation algorithms for the scenario described above that we show benefit from strong generalization guarantees. We also report the results of experiments demonstrating their effectiveness. Note that there has been a sequence of publications that provide formal differentially private learning guarantees assuming access to public data

[CH11, BNS13, BTT18, ABM19, NB20, BCM20]. However, their results are not applicable to the adaptation problem we study, since they assume that the source and target domains coincide.

The design of our algorithms and their guarantees benefit from the theoretical analysis of domain adaptation by a series of prior publications, starting with the introduction of a -distance between distributions by [KBG04] and [BBCP06]. These authors used this notion to derive learning bounds for the zero-one loss, (see also the follow-up publications [BCK08, BDBC10]) in terms of a quantity denoted by that depends on the hypothesis set

and the distribution and that cannot be estimated from observations. Later,

[*]MansourMohriRostamizadeh2009 and [CM14] introduced the notion of discrepancy

, which they used to give a general analysis of single-source adaptation for arbitrary loss functions. The notion of discrepancy is a divergence measure tailored to domain adaptation that coincides with the

-distance in the special case of the zero-one loss. Unlike other divergence measures between distributions such as an -distance, discrepancy takes into account the loss function and the hypothesis set and, crucially, can be estimated from finite samples. The authors presented Rademacher complexity learning bounds in terms of the discrepancy for arbitrary hypothesis sets and loss functions, as well as pointwise learning bounds for kernel-based hypothesis sets.

For regression problems with no privacy constraints on the source or target data, [CM14] gave a discrepancy minimization algorithm based on a reweighting of the losses of sample points. They further presented a series of experimental results demonstrating that their algorithm outperformed all other baselines in a series of tasks. Building on that approach, we design new differentially private discrepancy-based algorithms for adaptation from a source domain with public labeled data to a target domain with unlabeled private data. In Section 3, we briefly present some background material on the discrepancy analysis of adaptation motivating that approach.

The design and analysis of our private algorithms crucially hinge upon several key properties we prove for a smooth approximation of the weighted discrepancy, such as its smoothness with respect to the -norm and the sensitivity of its gradient (Section 4). In Section 5, we present new two-stage adaptation algorithms that can be viewed as private counterparts of the discrepancy minimization algorithm of [CM14]. As with that algorithm, the first stage consists of finding a reweighting of the source sample that minimizes the discrepancy, the second stage of minimizing a regularized weighted empirical loss based on the reweighting found in the first stage. Since the second stage does not involve private data, only the first stage requires a private solution. Our solutions are based on private variants of Frank-Wolfe and Mirror-Descent algorithms, and they are computationally efficient. We describe these solutions in detail and prove privacy and generalization guarantees for both algorithms. We further compare the benefits of these algorithms as a function of the sample sizes.

In Section 6, we present a new, computationally efficient, single-stage differentially private adaptation algorithm seeking to directly minimize the sum of the weighted empirical loss and the discrepancy. Since attaining the minimum in this case is generally intractable due to non-convexity of the objective, instead, our algorithm finds an approximate stationary point of this objective. Our algorithm is comprised of a sequence of Frank-Wolfe updates, where each update consists of a differentially private update of the weights and a non-private update of the predictor. In fact, our algorithm can be used in much more general settings of private non-convex optimization over (a Cartesian product of) domains with different geometries. We formally prove the privacy and convergence guarantees of our algorithm in a general problem setting, and then derive its generalization guarantees in the context of adaptation. Finally, in Section 7, we report our experimental results.

We start with the introduction of preliminary concepts and definitions relevant to our analysis.

2 Preliminaries

Let denote the input space and the output space, which we assume to be a measurable subset of . We assume that is included in the ball of radius , . We will also assume that is included in a bounded interval of diameter . Let be a family of hypotheses mapping from to . We focus on the family of linear hypotheses . We will be mainly interested in the regression setting, though some of our results can be extended to other contexts. For any , we denote by the familiar squared loss of for the labeled point . We denote by an upper bound on the loss: , for all .

Learning scenario:

We identify a domain with a distribution over and refer to the source domain as the one corresponding to a distribution and the target domain, the one corresponding to a distribution . We assume that the learner receives a sample of labeled points drawn i.i.d. from a distribution over and it also has access to a large sample of unlabeled points drawn i.i.d. from , the input marginal distribution associated to . We view the data from , that is sample , as public data, and the data from , sample , as private data.

The objective of the learner is to use the samples and to select a hypothesis with small expected loss with respect to the target domain: . In the absence of any constraints, this coincides with the standard problem of single-source domain adaptation, studied in a very broad recent literature, starting with the theoretical studies of [BBCP06, MMR09, CM14].

Discrepancy notions:

Clearly, the success of adaptation depends on the closeness of the distributions and , which can be measured according to various divergences. The notion of discrepancy has been shown to be appropriate measure of divergence between distributions in the context of domain adaptation. We will distinguish the so-called -discrepancy , which can only be estimated when sufficient labeled data is available from both distributions, and the standard discrepancy , which can be estimated from finite unlabeled samples from both distributions:

We will be using the two-sided versions of these expressions. For example, we will use

though part of our analysis holds with one-sided definitions too.

Matrix definitions:

We will adopt the following matrix definitions and notation. We denote by the set of real-valued matrices and by the subset of formed by symmetric matrices. We will denote by the Frobenius product defined for all by . For any matrix , we denote by the

th eigenvalue of

in decreasing order and will also denote by its largest eigenvalue, and by its smallest eigenvalue. We also denote by

the vector of eigenvalues of

. For any , we will denote by the -Schatten norm of defined by . Note that corresponds to the spectral norm: , which we also denote by .

Smoothness:

We will say that a continuously differentiable function defined over a vector space is -smooth for norm if , where is the dual norm associated to . When is twice differentiable, it is known that the condition on the Hessian , implies that is --smooth [Sid19][Chapter 5; lemma 8].

Differential Privacy [Dmns06, Dkm06]:

Let . A (randomized) algorithm is -differentially private if for all pairs of datasets that differ in exactly one entry, and every measurable , we have: We consider differentially private algorithms that have access to an auxiliary public dataset in addition to their input private dataset . In such case, we view the public set as being “hardwired” to the algorithm, and the constraint of differential privacy is imposed only w.r.t. the private dataset.

3 Background on discrepancy-based generalization bounds

In this section, we briefly present some background material on discrepancy-based generalization guarantees. A more detailed discussion is presented in Appendix A. Let the output label-discrepancy be defined as follows:

where is the labeled version of (i.e., is associated with its true (hidden) labels). Note that measures the difference of the distributions on the input domain. In contrast, accounts for the difference of the output labels in and . Note that under the covariate-shift and separability assumption, we have . In general, adaptation is not possible when is large since the labels received on the training sample would then be very different from the target ones. Thus, we will assume, as in previous work, that we have . Then, the following learning bound, expressed in terms of the empirical unlabeled discrepancy , , and the Rademacher complexity of

, holds with probability at least

for all and all distributions over [CM14, CMMM19]:

(1)

The following more explicit upper bound on the Rademacher complexity holds when is the class of linear predictors and the support of is included in the -ball of radius : [MRT18]. [CM14] proposed an adaptation algorithm motivated by these learning bounds and other pointwise guarantees expressed in terms of discrepancy. Their algorithm can be viewed as a two-stage method seeking to minimize the first two terms of this learning bound. It consists of first finding a minimizer of the weighted discrepancy (second term) and then minimizing (a regularized) -weighted empirical loss (first term) w.r.t. for that value of .

We will design private adaptation algorithms for a similar two-stage approach, as well as a single-stage approach seeking to choose and to directly minimize the first two terms of the bound. The privacy and accuracy guarantees of our algorithms crucially rely on a careful analysis of a smooth approximation of the discrepancy term, which we present in the following section.

4 Discrepancy analysis and smooth approximation

4.1 Analysis

For the squared loss and , the weighted discrepancy term of the learning bound (1) can be expressed in terms of the spectral norm of a matrix that is an affine function of .

Lemma 1 ([Mmr09]).

For any distribution over , the following inequality holds:

where and where , and , .

For completeness, the short proof is given in Appendix B. In view of that, the learning bound (1) suggests seeking and to minimize the first two terms:

(2)

Note that the second term of the bound is sub-differentiable but it is not differentiable both because of the underlying maximum operator and because the maximum eigenvalue is not differentiable at points where its multiplicity is more than one. Furthermore, the first term of the objective function is convex with respect to and convex with respect to , but it is not jointly convex.

The private algorithms we design require both the smoothness of objective, which would not hold given the first issue mentioned, and the sensitivity of the gradients. Thus, instead, we will use a uniform -smooth approximation of the second term, for which we analyze in detail the smoothness and gradient sensitivity.

4.2 Softmax smooth approximation

A natural approximation of is based on the softmax approximation:

Note that while is a function of the eigenvalues, which are not differentiable everywhere, it is in fact infinitely differentiable since it can be expressed in terms of the trace of the exponential matrix or the trace of powers of : . The matrix exponential can be computed in , using an SVD of matrix . The following inequalities follow directly the properties of the softmax:

(3)

Note that we have . Thus, for , gives a uniform -approximation of . Note that the components of the gradient of are given by

(4)

where denotes the Frobenius inner product. Both the smoothness and sensitivity of will be needed for the derivation of our algorithm. We now analyze these properties of function , using function which is defined for any symmetric matrix as follows:

The following result provides the desired smoothness result needed for , which we prove by using the -smoothness of .

Theorem 1.

The softmax approximation is -smooth for .

The proof is given in Appendix C.1. Next, we analyze the sensitivity of , that is the maximum variation in -norm of when a single point in the sample of size drawn from is changed to another one .

Theorem 2.

The gradient of the softmax approximation is -sensitive.

The proof is given in Appendix C.1.

Note that the softmax function is known to be convex [BV14]. Since is an affine function of and that composition with affine functions preserves convexity, this shows that is also a convex function. The following further shows that is -Lipschitz.

Theorem 3.

For any , the gradient of is bounded as follows: .

The proof is given in Appendix C.1. In view of the expression of the weighted discrepancy , the smooth approximation of the maximum eigenvalue of leads immediately to a smooth approximation of , with

Thus, inherits the key properties of gathered in the following corollary.

Corollary 1.

The following properties holds for :

  1. is convex and is a uniform -approximation of .

  2. is -smooth for .

  3. is -sensitive.

  4. for any , .

The proof is given in Appendix C.2. In Appendix C.3, we also present and analyze a -norm smooth approximation of the discrepancy. This approximation can be used to design private adaptation algorithms with a relative deviation guarantee that can be more favorable in some contexts.

5 Two-stage private adaptation algorithms

Here, we discuss private solutions for a two-stage approach that consists of first finding that minimizes the empirical discrepancy and next fixing to that value and minimizing the empirical -weighted loss over . In the absence of privacy constraints, this coincides with the two-stage algorithm of [CM14]. The first stage consists of seeking to minimize an -regularized version of the discrepancy, the second stage simply consists of fixing the solution obtained in the first stage and seeking minimizing the -weighted empirical loss:

(5)

where is the Euclidean ball in of radius . Equivalently, we can define an -regularized version of the weighted empirical loss and minimize it over ; namely, solve

where

is a hyperparameter. Regularization in the first stage is done to ensure that the resulting weights

are not too sparse since sparse solutions can lead to poor output model in the second stage of the adaptation algorithm.

In the second stage, no private data is involved. Thus, in this section, we give two private algorithms for the first stage of discrepancy minimization. Our private algorithms aim at minimizing an -regularized version of the smooth approximation, , of the discrepancy discussed in Section 4.2. To emphasize its dependence on the private unlabeled dataset , we will use the notation . Namely, our algorithms aim at privately minimizing an -regularized version of :

As mentioned earlier, the regularization term is used to avoid sparse solutions that may impact the accuracy of the output model in the second stage of the adaptation algorithm. Our algorithms are based on private variants of the Frank-Wolfe algorithm and the Mirror Descent algorithm. The general structure of these algorithms follow known private constructions devised in the context of differenitally private empirical risk minimization [TGTZ15, BGN21, AFKT21]. However, we note that the guarantees of both algorithms crucially rely on the smoothness and sensitivity properties of the approximation proved in the previous section. Solving the optimization with respect to the smooth approximation of the discrepancy enables us to bound the sensitivity of the gradients (see Theorem 2), which helps us devise private solutions for this problem.

We defer the description of these algorithms to Appendix D. We state below their formal guarantees.

Theorem 4.

The Noisy Frank-Wolfe algorithm (Algorithm 2 in Appendix D.1) is -differentially private. Let . There exists a choice of the parameters of Algorithm 2 such that with high probability over the algorithm’s internal randomness, the output satisfies

The smoothness we created in also enables us to use a private variant of the Frank-Wolfe algorithm, whose optimization error scales only logarithmically with .

Theorem 5.

The Noisy Mirror Descent algorithm (Algorithm 3 in Appendix D.2) is -differentially private. Let . There exists a choice of the parameters of Algorithm 3 such that with high probability over the algorithm’s randomness, the output satisfies

Note that compared to the guarantees of the private Frank-Wolfe algorithm in Theorem 4, the optimization error of the Noisy Mirror Descent algorithm (Theorem 5) exhibits a better dependence on at the expense of worse dependence on . In Appendix D, we give full proofs of these theorems.

Note that by standard stability arguments, the minimum weighted empirical loss of the second stage when training with is close to the minimum weighted empirical loss when training with when the discrepancy between and is small [MMR09]. Theorems 4 and 5 precisely supply guarantees for that closeness in discrepancy via the inequality , thereby guaranteeing the closeness of the loss of our private predictor (output of the second stage) to the minimum -weighted empirical loss. This together with the learning bound (1) immediately provide a bound on the expected loss of our private predictor.

6 Single-stage private adaptation algorithm

In this section, we give a novel private algorithm for that outputs an approximate stationary point of the smooth approximation of the learning bound: Here, is the smooth approximation of the discrepancy discussed in Section 4.2 (where the subscript in and is used to emphasize the dependence on the private dataset ).

As discussed earlier, the function is generally non-convex in . Since attaining a global minimizer of is generally intractable, a reasonable alternative is to find (an approximate) stationary point of . Note that is smooth in w.r.t. (as discussed in Section 4.2) and smooth in w.r.t. (due to the nature of the squared loss). These smoothness properties allow us to design our private solution. Given the approximation guarantee (3), the data-dependent terms in the learning bound (1) can thus be approximated by . Hence, our strategy here is to find an approximate stationary point of via our private algorithm, and then derive a learning bound in terms of . The formal definition of an approximate stationary point is given next.

Definition 1 (-approximate stationary point).

Let be a differentiable function over a convex and compact subset of a normed vector space. Let . We say that is an -approximate stationary point of if the stationarity gap of at , defined as is at most .

First, we will give a generic differentially-private algorithm for approximating a stationary point of smooth non-convex objectives (defined by a private dataset ) that satisfy certain smoothness and Lipschitzness conditions. We give formal definitions of these conditions below.

Definition 2 (-Lipschitz function).

Consider a function , where is a convex set whose -diameter is bounded by (we refer to as -bounded set), and is a convex -bounded set. Let . We say that is -Lipschitz if for any , is -Lipschitz w.r.t.  over , and for every , is -Lipschitz w.r.t.  over .

Definition 3 (-smooth function).

This notion is defined analogously. We say that is -Lipschitz if for any , is -smooth w.r.t. over , and for every , is -Lipschitz w.r.t. over .

Our private algorithm (Algorithm 1 below) takes as input an objective , where is a convex polyhedral set with bounded -diameter and is a convex set with bounded -diameter. Hence, our objective mentioned earlier is a special case. The algorithm is comprised of a number rounds, where in each round, two private Frank-Wolfe update steps are performed; one for and another for . The privacy mechanism for each is different due to the different geometries of and . We note that in the special case where , there is no need to privatize the Frank-Wolfe step for due to the fact that such update step depends only on the -weighted empirical loss over the public data and the fact that differential privacy is closed under post-processing (the previous update step for is carried out in a differentially private manner).

When satisfies the Lipschitzness and smoothness properties defined above w.r.t. and , we give formal convergence guarantees to a stationary point in terms of a high-probability bound on the stationarity gap of the output (see Definition 1). Despite the different geometries of and , our final bound is roughly the sum of the bounds we would obtain if we ran two separate Frank-Wolfe algorithms (one over and the other over . This is mainly due to the hybrid Lipschitzness and smoothness conditions ( for and for ), which enable us to decompose the bound on the convergence rate over and .

0:  Private dataset: , privacy parameters , a convex -bounded polyhedral set: with vertices , a convex -bounded set , a function (defined via the dataset ), bound on the global -sensitivity of , bound on the global -sensitivity of , step size: , number of iterations: .
1:  Set .
2:  Set .
3:  Choose arbitrarily .
4:  for  to  do
5:     .
6:     Draw independently .
7:     .
8:     .
9:     
10:     .
11:     , where .
12:     .
13:     
14:     
15:  end for
16:  return  , where  
Algorithm 1 Private Frank-Wolfe for approximating stationary points of
Theorem 6.

Algorithm 1 is -differentially private. Assume that the objective is -Lipschitz and -smooth. Assume further that for all and , . Then, for any , there exists a choice of and such that with probability at least (over the algorithm’s randomness), the stationarity gap of the output is upper bounded by

The proof is given in Appendix E. We note that our adaptation objective satisfies all the conditions in Theorem 6. In Appendix E.1, we give a detailed discussion regarding instantiating Algorithm 1 with and the specific settings of all the parameters in this special case. As a result, we immediately reach the following corollary.

Corollary 2.

Let be the input to Algorithm 1. Let . There exists a choice of and fow which, with probability at least , the output of the algorithm is an approximate stationary point of with stationarity gap upper bounded as

Hence, bound (1) implies that w.p. over the choice of the public and private datasets and the algorithm’s internal randomness, the expected loss of the predictor (defined by the output ) w.r.t. the target domain is bounded as

Remark.

Note that is an approximate stationary point of . In practice, can be an approximation of a good local minimum of as demonstrated by our experiments. In such situations, the above bound implies a good prediction accuracy for the output predictor. Note also that the bound above is given in terms of the soft-max approximation parameter . In general, this parameter should be treated as a hyper-parameter and tuned appropriately to minimize the above bound. One reasonable choice of can be obtained by balancing the bound on the stationarity gap with the error term due to the soft-max approximation. In such case, .

7 Experiments

(a) (b)
Figure 1: (a) Value of the spectral norm for the output of noisy Frank-Wolfe (solid lines) and noisy Mirror descent (dotted lines) discrepancy minimization as a function of the number of samples from the private dataset . (b) Test error as a function of the number of samples from the private dataset . The solid lines correspond to the single-stage algorithm, the dotted lines to the two-stage mirror decent algorithm, and dashed lines to the two-stage Frank Wolfe algorithm.

The objective of this section is to provide proof-of-concept experiments to demonstrate that reasonable privacy guarantees could be achieved, when using our private domain adaptation algorithms. We use a setting similar to that of [CM14, Section 7.1] and demonstrate that the utility of private adaptation degrades gracefully with increased privacy guarantees and that the single-stage Frank-Wolfe algorithm performs best in most scenarios.

We carried out experiments with the following synthetic dataset. Let and . Let be a spherical Gaussian centered around

and with variance

in all directions. Let

be a Gaussian distribution with mean

and with variance in all directions. We defined the labeling function via if , otherwise, where . We chose the target distribution to be and the source distribution as a mixture of and with the weight of set to . We fixed the number of source samples to be and varied the number of unlabeled target samples from to . All experiments were repeated ten times for statistical consistency. We set , , the privacy parameter , and varied

in experiments. The standard deviations were calculated over

runs in experiments.

In this setup, we first ran differentially private discrepancy minimization using Algorithms 2 and 3. We plotted for different values of in Figure 1(a). The performance of the noisy Frank-Wolfe algorithm degrades smoothly with and improves with . However the performance of the noisy mirror decent algorithm is much worse. This is in line with the theoretical guarantees as in these experiments and noisy Frank-Wolfe algorithm has a better convergence guarantee in this regime. We expect mirror descent to perform better with much larger values of . Furthermore, observe that the noisy mirror descent has a high standard deviation compared to Frank-Wolfe algorithm as the noise added in mirror descent scales polynomially in , whereas it scales only logarithmically in for the Frank-Wolfe algorithm.

We next compared our single-stage (Algorithm 1) and the two-stage differentially private algorithms with the model trained only with the public dataset (Figure 1(b)). As an oracle baseline, we also plotted the model trained with the labeled private dataset. Note that this model uses extra information that is not available during training and is plotted for illustration purposes only. The single-stage Frank-Wolfe algorithm without privacy admits the same performance as the model trained on the labeled private dataset. It performs better than the two-stage Frank-Wolfe algorithm, however the gap decreases as the privacy guarantee improves. The performance of the mirror descent algorithm without differential privacy is similar to that of Frank-Wolfe algorithm, however as theory indicates, the performance degrades quickly with the privacy parameter. Similar to Figure 1(a), the performance of the noisy mirror descent algorithm is much worse and has a high standard deviation.

8 Conclusion

We presented new differentially private adaptation algorithms benefitting from strong theoretical guarantees. Our analysis can form the basis for the study of privacy for other related adaptation scenarios, including scenarios where a small amount of (private) labeled data is also available from the target domain and those with multiple sources. Our single-stage private algorithm is further likely to be of independent interest for private optimization of other similar objective functions.

Acknowledgements

This work was done while RB was visiting Google, NY. RB’s research at OSU is supported by NSF Award AF-1908281, NSF Award 2112471, Google Faculty Research Award, and NSF CAREER Award 2144532.

References

  • [ABM19] Noga Alon, Raef Bassily, and Shay Moran. Limits of private learning with access to public data. NeuRIPS 2019, also available at arXiv:1910.11519 [cs.LG], 2019.
  • [ACG16] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
  • [AFKT21] Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal rates in L1 geometry. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 393–403. PMLR, 2021.
  • [BBCP06] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Bernhard Schölkopf, John C. Platt, and Thomas Hofmann, editors, Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, pages 137–144. MIT Press, 2006.
  • [BCK08] John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for domain adaptation. In Advances in neural information processing systems, pages 129–136, 2008.
  • [BCM20] Raef Bassily, Albert Cheu, Shay Moran, Aleksandar Nikolov, Jonathan Ullman, and Steven Wu. Private query release assisted by public data. In International Conference on Machine Learning, pages 695–703. PMLR, 2020.
  • [BDBC10] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
  • [BGM21] Raef Bassily, Cristóbal Guzmán, and Michael Menart. Differentially private stochastic optimization: New results in convex and non-convex settings. Advances in Neural Information Processing Systems, 34, 2021.
  • [BGN21] Raef Bassily, Cristóbal Guzmán, and Anupama Nandi. Non-euclidean differentially private stochastic convex optimization. arXiv preprint arXiv:2103.01278, 2021.
  • [BLST10] Raghav Bhaskar, Srivatsan Laxman, Adam Smith, and Abhradeep Thakurta. Discovering frequent patterns in sensitive data. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 503–512, 2010.
  • [BNS13] Amos Beimel, Kobbi Nissim, and Uri Stemmer. Private learning and sanitization: Pure vs. approximate differential privacy. In

    Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques

    , pages 363–378. Springer, 2013.
  • [BTT18] Raef Bassily, Abhradeep Thakurta, and Om Thakkar. Model-agnostic private learning. In Advances in Neural Information Processing Systems 31, pages 7102–7112. Curran Associates, Inc., 2018.
  • [BV14] Stephen P. Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2014.
  • [CH11] Kamalika Chaudhuri and Daniel Hsu. Sample complexity bounds for differentially private learning. In Proceedings of the 24th Annual Conference on Learning Theory, pages 155–186, 2011.
  • [CM14] Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theor. Comput. Sci., 519:103–126, 2014.
  • [CMMM19] Corinna Cortes, Mehryar Mohri, and Andrés Muñoz Medina. Adaptation based on generalized discrepancy. J. Mach. Learn. Res., 20:1:1–1:30, 2019.
  • [DKM06] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 486–503. Springer, 2006.
  • [DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
  • [DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
  • [JN08] Anatoli Juditsky and Arkadi Nemirovski. Large deviations of vector-valued martingales in 2-smooth normed spaces. Rapport de recherche hal-00318071, HAL, 2008.
  • [KBG04] Daniel Kifer, Shai Ben-David, and Johannes Gehrke. Detecting change in data streams. In Mario A. Nascimento, M. Tamer Özsu, Donald Kossmann, Renée J. Miller, José A. Blakeley, and K. Bernhard Schiefer, editors, (e)Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB 2004, Toronto, Canada, August 31 - September 3 2004, pages 180–191. Morgan Kaufmann, 2004.
  • [MMR09] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In COLT 2009 - The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009, 2009.
  • [MRT18] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. Adaptive computation and machine learning. MIT Press, 2018. Second edition.
  • [NB20] Anupama Nandi and Raef Bassily. Privately answering classification queries in the agnostic pac model. In Algorithmic Learning Theory, pages 687–703, 2020.
  • [Nes07] Yurii Nesterov. Smoothing technique and its applications in semidefinite optimization. Math. Program., 110:245–259, 2007.
  • [NJLS09] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
  • [NY83] A.S. Nemirovsky and D.B. Yudin. Problem Complexity and Method Efficiency in Optimization. A Wiley-Interscience publication. Wiley, 1983.
  • [Sid19] Aaron Sidford. Introduction to optimization theory - MS&E213 / CS269O. Stanford Course Notes, 2019. https://web.stanford.edu/~sidford/courses/19fa_opt_theory/.
  • [TGTZ15] Kunal Talwar, Abhradeep Guha Thakurta, and Li Zhang. Nearly optimal private lasso. Advances in Neural Information Processing Systems, 28:3025–3033, 2015.

Appendix A Background on discrepancy-based generalization bounds

In this section, we briefly present some background material on discrepancy-based generalization guarantees.

The following learning bound was given by [CMMM19]: for any , with probably at least over the draw of a sample , for any distribution over , for all , the following inequality holds:

(6)

This bound is tight in the sense that for the hypothesis reaching the maximum in the definition of the -discrepancy, the bound coincides with the standard Rademacher complexity bound on [CMMM19]. The bound suggests choosing and the distribution to minimize the right-hand side. The first term of the bound is not jointly convex with respect to and . Instead, the algorithm suggested by [CM14] (see also [MMR09]) consists of a two-stage procedure: first choose to minimize the -weighted empirical discrepancy, next fix and choose to minimize the -weighted empirical loss .

In practice, we do not have labeled data from or too few to be able to accurately minimize the -discrepancy, since otherwise adaptation would not be even necessary and we could directly use labeled data from for training. Instead, we upper bound the -discrepancy in terms of the discrepancy and the output label-discrepancy defined as follows:

where is the labeled version of (i.e., is associated with its true (hidden) labels). Note that measures the difference of the distributions on the input domain. In contrast, accounts for the difference of the output labels in and . We will assume that . Note that under the covariate shift assumption and separable case, we have . In general, adaptation is not possible when can be large since the labels received on the training sample can be different from the target ones.

We will say that a loss function is -admissible if for all and [CMMM19]. Note that this is a slightly weaker condition than that of -Lipschitzness of the loss with respect to its first argument.

Theorem 7.

Let be a -admissible loss. Then, the following upper bound holds:

The proof is given in Appendix B. Note that the squared loss is -admissible: since the function is 2-Lipschitz on , we have . Thus, the learning bound (6) can be expressed in terms of the discrepancy and the Rademacher complexity of as follows, using the fact [MRT18][Prop. 11.2]:

We will be considering a family of linear hypotheses and will be assuming that the support of is included in the ball of radius . The following more explicit upper bound on the Rademacher complexity then holds when the support of is included in the -ball of radius : [MRT18].

Appendix B Discrepancy analysis and bounds

See 7

Proof.

For any hypothesis in , we can write