1 Introduction
Machine learning today relies on large quantities of high quality data. However, due to the expense and difficulties in data acquisition, often, one cannot have a sufficient amount of training data. Thus, one may want to make use of the external datasets that are provided by other datarich organizations. Such scenario frequently appears in many realworld applications. For example, in Figure 1, small or newly established hospitals may not have sufficient labeled medical records, and they may wish to borrow some knowledge from large and public hospitals to boost the performance of their prediction models. Transfer learning [Pan and Yang2010] is a proven tool for this purpose, as an established hospital may already have much experience in the form of labeled medical data, which can be transferred to a newly established medical service.
When knowledge transfer is conducted, data privacy becomes a serious concern. Recently, there even are several laws for the privacy issue. One of the most famous is Europe’s General Data Protection Regulation (GDPR) ^{1}^{1}1https://en.wikipedia.org/wiki/General_Data_Protection_Regulation, which is a regulation to regulate the protection on private data and on data transmission between organizations. Such regulations and public requirements place raise challenges for crossorganizational transfer learning. When applying transfer learning, the source and target datasets may belong to different institutes or companies. As a result, the privacy of users in a source organization may be compromised at the target organization. As far as we know, there is no existing transfer learning algorithm designed for solving this problem. We believe that it is time for machine learning researchers to design new methods to tackle this problem and make transfer learning compliant with the laws.
In the past, researchers have designed various ways to ensure protection of privacy in data publishing. The theory of differential privacy [Dwork et al.2006b, Dwork2008] has been developed as a standard for ensure the privacy of data when data is exchanged between organizations. To design an algorithm with a privacypreserving guarantee, carefully designed noise is often added to the original data to disambiguate the analytic algorithms. However, for machine learning, such injection of noise often incurs a major degradation of learning performance. As a result, many machine learning algorithms have been modified to achieve the differential privacy while preserving the performance of machine learning, including logistic regression [Chaudhuri, Monteleoni, and Sarwate2011], tree models [Emekçi et al.2007, Jagannathan, Pillaipakkamnatt, and Wright2009, Fong and WeberJahnke2012]
, and deep neural networks
[Shokri and Shmatikov2015, Abadi et al.2016], etc. In our scenario, we do not consider tree models, as they are usually difficult to transferred. Deep neural networks, is also not a good choice, as it is hard to interpreted and can be a problem when facing the law (e.g., GDPR). However, linear models is simple and easy to understand, and its differentially private variants also have rich and rigorous theoretical guarantee [Chaudhuri, Monteleoni, and Sarwate2011, Bassily, Smith, and Thakurta2014, Hamm, Cao, and Belkin2016, Kasiviswanathan and Jin2016]. These considerations motivate us in using linear models for privacypreserving transfer learning.In this paper, we start from the hypothesis transfer learning method of [Kuzborskij and Orabona2013], which we use to combine with privatepreserving logistic regression [Chaudhuri, Monteleoni, and Sarwate2011]. We show how to innovatively integrate these methods to solve the knowledge sharing problem with privacy preserving guarantee. However, the simpleminded combination of transfer learning with privacy preserving constraints can suffer from poor learning performance. Specifically, to preserve the privacy, the noise added to the source learning models can be too high when the target samples sizes are small or when the feature dimensions are high. The challenge is how to make the best balance. This is a serious problem for transfer learning to the target domain. As a summary, contributions are as follows:

We propose a novel method, which is based on HTL, to solve the knowledge sharing problem from the source to the target. This method draws a best balance between privacypreserving concern, transfer learning performance and targetdomain data sizes;

We prove the algorithm has an differential privacy guarantee for both the source and target. We also analyze the generalization performance of the proposed method, which shows it is indeed better than the simple combination when privacy budgets are the same; and

Except for the popular MNIST dataset, we also test our method on a realworld RUIJIN dataset. This dataset contains medical records from thousands of people, and is collected from hospitals across the whole China. The results show, given the same privacy budget, our method can achieve much better performance than other stateofthearts.
The rest of the paper is organized as follows. Section 2 gives related works; the proposed algorithm is in Section 3; experiments are performed in Section 4; and concluding remarks are given in Section 5.
Notation
In this paper, vectors are denoted by lowercase boldface. We use subscript
and , e.g., and , to denote variables in the source and target domain. Then, we use the superscript to denote the variables associated with the th split of features, e.g., . Parameters with a privacy guarantee is indicated by a “bar” above, e.g., .2 Related Works
1. Differential Privacy.
Differential privacy [Dwork et al.2006b, Dwork and Roth2014] is established as a rigorous standard to guarantee privacy for algorithms that access to the private data. We say that the algorithm preserves differential privacy if one entry difference in the dataset dose not affect the likelihood of a specific output of the algorithm by more than . A formal definition is defined as follows.
Definition 1.
[Dwork et al.2006b] A randomized mechanism is differentially private, if for all output of , and for all databases and which differ by at most one element, we have .
The parameter is called privacy budget. To meet an differential privacy guarantee, usually, carefully generated noise needs to be added into the learning algorithm. Smaller provides stricter privacy guarantee but usually requires adding more noise, and leads to larger decreasing on the learning performance.
2. PrivacyPreserving Logistic Regression (PLR).
The stateoftheart privacypreserving learning with logistic regression (LR) is proposed in [Chaudhuri, Monteleoni, and Sarwate2011]. Given dataset where s are samples and s are corresponding labels, they design an differentially private variant for
(1) 
where is the logistic loss, is the learning parameter and is the hyperparameter. The following assumption is made on (1).
Assumption 1.
The regularizer is 1strongly convex ^{2}^{2}2If a function is strongly convex, then for any .
However, directly obtaining from (1) fails to reach this goal [Dwork et al.2006b]. Thus, Chaudhuri et.al. 2011 proposed to change the objective as
(2) 
where . Thus, two extra terms, i.e., random vector and quadratic term , are added to (1). Specifically, is a constant depending on , and , which is a requirement of the proof; is drawn from the ^{3}^{3}3To draw , one can first pick
from the Gamma distribution
, and pick a unit vector uniformly, then [Chaudhuri, Monteleoni, and Sarwate2011]. distribution and the noise level is determined by . The complete mechanism based on (2) is in Algorithm 1, and its privacy guarantee is in Proposition 1.Proposition 1.
[Chaudhuri, Monteleoni, and Sarwate2011] Algorithm 1 has the differential privacy guarantee if Assumption 1 is satisfied.
However, in a machine learning model, we are also concerned with model performance. Having a good privacy guarantee does not mean that the model has a good learning performance; in fact, in many practices, the performance dramatically degrades. Thus, given the same privacy budget, algorithms with a better learning guarantee are desired. Proposition 2 is offered for such guarantee of Algorithm 1.
Proposition 2.
[Chaudhuri, Monteleoni, and Sarwate2011] Let and be a reference predictor and be the data distribution. Then, there exists a constant such that for , if the training examples from are i.i.d. drawn according to , and
(3) 
then the output of Algorithm 1 satisfies where is the expected loss of predictor over distribution .
Proposition 2 shows how much samples are needed in order to reach
error as compared with the desired classifier. Basically, the smaller the R.H.S. in (
3) the better the learning performance. Comparing with nonprivacy model (1), which only needs samples to reach the same error [ShalevShwartz and Srebro2008], we can see there are two extra terms that depend on in the operator in Proposition 2. Thus, the learning performance of Algorithm 1 is indeed deteriorated.3. Hypothesis Transfer Learning (HTL).
Transfer learning [Pan and Yang2010] is a powerful and promising method to extract useful knowledge from a source domain to a target domain. In this paper, we consider hypothesis transfer learning (HTL) [Orabona et al.2009, Kuzborskij and Orabona2013, Kuzborskij and Orabona2017]. Using (1) as an example, HTL improves the performance on target domain via adding an extra regularization, i.e.,
(4) 
where is the logistic loss, is a hyperparameter, is the data in the target domain and is the predictor obtained from the source domain.
Although there are other techniques for transfer learning such as domain adaptation and feature representation transfer [Pan and Yang2010], as we will show in Section 3, HTL is a very good choice because it is a general method that can be incorporated with the privacy guarantee very well.
4. Learning from Multiparties.
Finally, we discuss about learning privately from multiparty data [Pathak, Rane, and Raj2010], which is the most related scenario to ours. In this case, each party has its own private data, and the task here is to obtain final predictions using data from all parties. There are two lines of researches. The first one is to train classifiers in each party locally, then the problem is how to privately combine these predictions. A carefully designed protocol is proposed in [Pathak, Rane, and Raj2010] to solve this problem. The performance is surpassed by [Hamm, Cao, and Belkin2016], which used a second level classifier to combine different predictions. Later, public unlabeled nonsensitive data are introduced to improve the second level classifier in [Papernot et al.2017]
. However, our goal here is to boost learning performance in the target domain, and these works cannot achieve this. The second line of researches lie on simultaneously training a model among all parties. A stochastic gradient descent among parties are used in
[Rajkumar and Agarwal2012], and then a multitask learning method is proposed in [Xie et al.2017]. While these works improve performance of previous ones based on aggregation, they gradually lose privacy guarantee during the iterations of algorithms. Thus, they do not apply here either.3 The Proposed Approach
First in Section 3.1, we show how a simple combination can be done based on privacypreserving LR and HTL. As such combination can suffer from small samples in the target, we improve privacypreserving LR in Section 3.2, and use it to solve the knowledge sharing problem in Section 3.3.
3.1 First Attempt: A Simple Combination
Basically, HTL can already solve the knowledge sharing problem as described in Section 1. We can obtain using the source data from (1), and then obtain the desired predictor on target data from (4). The problem is that there is no privacy guarantee yet. However, Algorithm 1 can be used to solve this issue. The key idea is to perturb the objective by and . Thus,we solve
(5) 
for the source, where . Then, we solve
(6) 
for the target, where and is obtained from (5). The complete procedure for the simple approach is given in Algorithm 2, and it can be easily seen and differential privacy are guaranteed for the source and target.
Simple combination issues.
However, such a direct combination has two main problems. First,

Algorithm 2 suffers from poor performance when the dimension is high.
This problem comes from the fact that Algorithm 1 is sensitive to the dimension , which determines the noise level. The theoretical analysis in Proposition 2 also shows that with limited privacy budget, more samples are needed to preserve the generalization performance when increases. The problem is even worse in our setting, as the number of samples of target is usually small.

Algorithm 2 treats all features equally, thus does not consider feature importance.
Note that, the noise is equally added to all features in Algorithm 2. However, if we can add less noise into features that are more importance while keeping the same privacy guarantee, we are likely to get a better learning performance.
Overview of proposed method.
To address the above two problems (A) and (B), we propose to split features of the source data into disjoint subsets and transfer the learned model from each subset to the target domain separately. For models trained on these subsets, we can hope that each of them will suffer a smaller impact from the noises while preserving privacy. Besides, when the feature importance is available, less noise can be added into subset with larger importance. These are the intuitions of our work. The framework is in Figure 2.
3.2 Building Block: PLR with Feature Splitting
In order to accomplish the above procedure, we first propose a new algorithm, which incorporates feature splitting into Algorithm 1. This algorithm will also acts as a key building block for our approach.
The proposed method is presented in Algorithm 3. In the sequel, let be a split of all features into subsets, and the importance of each split is with and . First, features in dataset is split into subsets based on in step 1. Then, same as Algorithm 1, we refine as in step 2. However, to ensure privacy, the features in each subset are also rescaled (step 4) and extra terms involving with and are added (step 510). Finally, we train a logistic regression within each subset of the dataset (step 11).
The privacy guarantee is in Proposition 3. This proposition extends previous one (Proposition 1) for Algorithm 1. However, the extension is not trivial. First, we have multiple learning problems in Algorithm 3 and they are not independent with each other due to shared labels. Finally, different importance s exist for each subset of features.
Proposition 3.
If each is strongly convex, then Algorithm 3 has a differential privacy guarantee.
As multiple predictors are generated from Algorithm 3, unlike Proposition 2, which gives the learning performance for the whole Algorithm 1, we can only give the learning performance for each predictor generated from Algorithm 3, which is in Proposition 4.
Proposition 4.
Let where is any constant vector. There exists a constant such that for , if the training samples in are drawn i.i.d. according to , and if
(7) 
where and is a reference predictor. Then, in Algorithm 3 satisfies .
Comparing with Proposition 2, the main difference is that is replaced with in Proposition 4. Note that as feature vectors in Algorithm 3 is scaled as , we have . When is small and dimension is large, the second term in (3) and (7) will be dominated. In this case, the ratio of sample complexities in Proposition 4 divided by Proposition 2 is . If features are randomly split and their importance is treated equally, then , and as . This leads to a strict improvement over Proposition 2. Besides, if feature importance is known, for the important feature group , and will be smaller. While the bound for
from less important features are worse, as important features are more informative, we will see splitting based on features’ importance can have a significant improvement over random splitting in a dataset where the variance of feature importance is large.
3.3 Proposed PPTLFS Approach
Now, we are ready to present the proposed algorithm to solve the privacypreserving transfer learning problem in Figure 1. The complete procedure is given in Algorithm 4, and illustrated in Figure 2. The detailed steps are as follows:
Feature Splitting.
First, we require both source and target having the same splitting on features. This can be done either by a random splitting or based on features’ importance.
Source (step 1).
Target (step 24).
With from the source, we then try to learn a highquality model in the target domain. Following Section 3.1,we adapt HTL by taking as the regularizer. This again can be done with Algorithm 3. However, the learning process leads to predictors in the target, and we do not know which predictor should be used nor how to combine them. To solve this problem, we next propose to use a variant of stacked generalization [Ting and Witten1997].
We randomly split into two disjoint parts , each has half of the samples of (step 2). Then, Algorithm 3 is trained based on (level0 model), and we obtain as the predictors from the target (step 3). After that,we take the predicted labels from by these predictors as features, and train a privacypreserving logistic regression with parameter (level1 model), which can be done by Algorithm 1 (step 4 in Algorithm 4). The final prediction is then given by the output of level1 model. In this way, the level1 model acts as an aggregator to combine in a datadriven manner.
0.5  1.0  2.0  4.0  8.0  

PPTLFS(W)  0.9007 0.0391  0.9500 0.0204  0.9825 0.0051  0.9895 0.0019  0.9921 0.0017 
PPTLFS(R)  0.7687 0.0998  0.8459 0.0701  0.9361 0.0214  0.9761 0.0095  0.9964 0.0004 
SimComb  0.7005 0.0613  0.8088 0.0812  0.9642 0.0073  0.9906 0.0018  0.9943 0.0009 
SourceD  0.6527 0.0927  0.7510 0.0854  0.8806 0.0431  0.9401 0.0180  0.9523 0.0088 
Direct  0.6467 0.0787  0.6978 0.0651  0.8657 0.0429  0.9632 0.0117  0.9877 0.0039 
0.5  1.0  2.0  4.0  8.0  

PPTLFS(W)  0.7108 0.0611  0.7469 0.0319  0.7649 0.0255  0.7581 0.0392  0.7564 0.0362 
PPTLFS(R)  0.6235 0.0732  0.6778 0.0495  0.7107 0.0474  0.7372 0.0337  0.7226 0.0397 
SimComb  0.5570 0.1134  0.6023 0.0854  0.6255 0.0763  0.6416 0.0706  0.7386 0.0311 
SourceD  0.6067 0.0678  0.6298 0.0420  0.6546 0.0330  0.6489 0.0462  0.7349 0.0320 
Direct  0.5112 0.0733  0.5476 0.0882  0.6216 0.0815  0.6649 0.0552  0.7231 0.0381 
Algorithm Analysis.
Theorem 5.
The output is differentially private with respect to , and are differentially private to .
As the learning guarantee of stacked generalization is still an open issue [Ting and Witten1999], we cannot offer the guarantee of generalization performance for Algorithm 4 here. However, we can still expect Algorithm 4 to be much better than Algorithm 2. The reasons are as follows. First, when less noise is introduced, better generalization performance can be expected in the target domain, which is also verified by Proposition 4. This means that the prediction performance of and from Algorithm 4 is better than that of and from Algorithm 2. Second,while feature interactions may slightly drop by the feature splitting, the level1 model can alleviate this problem by aggregating predictions from different splits in level0 model in a supervised way. Finally, feature importance can be incooperated in Algorithm 2, which can further boost the learning performance by adding less noise on more important features.
4 Experiment
In the sequel, experiments are performed on a Server with Intel(R) Xeon(R) E5 CPU and 250G Memory. All the codes are implemented in Python.
4.1 Data Description
Two datasets are used in our experiments. The first one is MNIST, which are small images of digital numbers and popularly used for handwritten recognition [LeCun et al.1998]. The second one is a private dataset provided by ShangHai RUIJIN hospital used to train and test early signs of chronical diseases that cover a large sector of the Chinese population, such as Diabetes. Our problem setup (Figure 1) also comes from application background of such data set. The details of the data are as follows ^{4}^{4}4 Due to the noise, the dimension in private applications is much smaller than nonprivate applications. In our experiments, the dimension of datasets is as large as previous works [Chaudhuri, Monteleoni, and Sarwate2011, Abadi et al.2016] :
Mnist.
We generate a toy dataset of two classification tasks here. Digits and are taken as the source, and as the target. We apply PCA on the whole dataset and select the top 100 features with the largest variance. To simulate the case where the size of the task dataset is not large, we randomly picked 2000 samples for the source and 1000 samples for the target.
Ruijin.
This is a real dataset where user privacy is a big concern as the data contains patients’ personal medical records. The dataset is collected from two medical investigations over 22 medical centres distributed in different locations in China conducted by Shanghai RUIJIN Hospital in 2010 and 2013, respectively. There are in total 105,763 participants who participated in both investigations. The first investigation consists questionnaires and laboratory test records collecting demographic information, disease information, lifestyle information and physical examination results. The second investigation includes diagnosis of diabetes, which are used as the ground truth (labels) of diabetes prediction. Suggested by the hospital, in this dataset we choose centres with 52 important features (Table 3) and the first two centres are combined as the source and remained centres are target. When we make predictions for patients form a new center, we wish to rely on previous built models for other related but different centers; otherwise we would have to do a lot of new manual labor at each new center. When doing this transfer learning, user privacy is of key concern. Thus, we wish to apply the privacypreserving transfer learning algorithms to solve the problems.
centre id  1  2  3  4  5  6 

#sample  7882  4820  4334  4739  6121  2327 
centre id  7  8  9  10  11  12 
#sample  5619  6360  4966  5793  6215  3659 
centre id  13  14  15  16  17  18 
#sample  5579  2316  4285  6017  6482  4493 
centre  3  4  5  6  7  8 

PPTLFS(W)  0.7469 0.0321  0.7360 0.0317  0.7395 0.0400  0.7143 0.0397  0.7664 0.0388  0.7072 0.0169 
PPTLFS(R)  0.6778 0.0495  0.7237 0.0373  0.6517 0.1026  0.7081 0.0329  0.6529 0.0697  0.6627 0.0365 
SimComb  0.6023 0.0853  0.6080 0.0782  0.5279 0.0623  0.5632 0.0670  0.5776 0.0750  0.6012 0.0307 
SourceD  0.6298 0.0420  0.6151 0.0661  0.5270 0.0565  0.5743 0.0631  0.5592 0.0690  0.6017 0.0308 
Direct  0.5476 0.0882  0.6203 0.0556  0.6355 0.0458  0.5788 0.0749  0.5327 0.0575  0.6132 0.0350 
centre  9  10  11  12  13  14 
PPTLFS(W)  0.7205 0.0464  0.7529 0.0417  0.7010 0.0233  0.6983 0.0361  0.7357 0.0462  0.7384 0.0451 
PPTLFS(R)  0.6817 0.0336  0.6916 0.0444  0.6346 0.0267  0.6437 0.0499  0.6350 0.0536  0.6452 0.0610 
SimComb  0.5804 0.0708  0.5827 0.0566  0.5468 0.0665  0.5167 0.0747  0.5652 0.0587  0.5469 0.0893 
SourceD  0.5757 0.0769  0.5362 0.0768  0.5465 0.0476  0.5351 0.0585  0.5625 0.0582  0.5463 0.0712 
Direct  0.5609 0.0764  0.5842 0.0450  0.5151 0.0652  0.5546 0.0608  0.5531 0.0659  0.5203 0.0889 
centre  15  16  17  18  averaged ranking  
PPTLFS(W)  0.7459 0.0520  0.6614 0.0942  0.6968 0.0233  0.6039 0.0124  1  
PPTLFS(R)  0.7177 0.0647  0.6435 0.0439  0.6470 0.0619  0.5667 0.0362  2  
SimComb  0.5917 0.0806  0.6152 0.0708  0.5583 0.0652  0.5239 0.0269  3.81  
SourceD  0.5554 0.0830  0.6068 0.0916  0.5515 0.0620  0.5274 0.0246  3.93  
Direct  0.6193 0.0701  0.5634 0.0259  0.5578 0.0602  0.5169 0.0527  4.18 
4.2 Compared Methods
We compare the following methods: (i). Direct: Directly training an differentially private logistic regression model on the target dataset; (ii). SourceD: Directly using the differentially private source classifier to predict on the target dataset; (iii). SimComb: The Simple approach (Algorithm 2); (iv). Two variant of the proposed method (Algorithm 4). PPTLFS(R): features are uniformly random split into groups; and PPTLFS(W): features are split into groups with approximately equal size based on their importance. For MINST, the feature importance is given by the variance obtained from PCA; and for RUIJIN, feature importance is given by doctors in RUIJIN hospital.
For all experiments, we randomly select as training set and as testing set for both source and target datasets. All other parameters (all methods) are tuned by a fold crossvalidation. Moreover, we use for our proposed method. In the transfer phase of our proposed method, we use of the target training set as and the remaining as . For performance evaluation, we use the areaunderthecurve (AUC) [Hanley and McNeil1983] as the measurement, which is most often used for classification problems, and a higher AUC is desired. For each dataset, we repeat the experiment times.
4.3 Performance
Mnist.
We vary privacy budget as . The result is shown in ^{5}^{5}5
In the sequel, the highest and comparable AUCs according to the pairwise ttest with
confidence are highlighted. Table 1 and ^{6}^{6}6Due to lack of space, figures in large size are in Appendix LABEL:app:large. Figure 3(a). First, we can see that when gets smaller, AUC of all methods decrease. This is because more noise needs to be introduced for smaller . Then, SimComb consistently gets better performance than SourceD and Direct, which verifies the benefits of hypothesis transfer learning. Besides, when gets smaller, PPTLFS(R) can get better performance than SimComb. This observation is consistent with our Proposition 4, which show better learning performance can be achieved when is small and is large. Finally, PPTLFS(W) is the best method, as it further improves over PPTLFS(R) by adding less noise to more important features.Ruijin.
Here, we set . The result is shown in Table 4. However, unlike the previous case, SimComb may not get better performance than SourceD and Direct, which is perhaps due to the noise introduced in features. However, EtransR improves over SimComb by feature splitting, and consistently gets better performance than SourceD, Direct and SimComb. Finally, PPTLFS(W), which considers features importance, is the best method.
We then vary , and then plot the testing AUC on the centre 3 in Figure 3(b). The observations are the same as previous one for MNIST dataset. Testing AUC of all methods get lower with smaller , and PPTLFS(W) is the best (compared with methods having the privacy guarantee).
Influence of the splitting size .
In previous experiments we fix the split size . Here we will demonstrate the influence of . In RUIJIN and MNIST dataset, we test the performance of PPTLFS(W) and PPTLFS(R) for different ’s. We set , and other parameters are tuned as previous experiments. The experiment is repeat for times. The result is shown in Figure 4.
As we can see, there is a peak of for our proposed method in both datasets. When increases, in each subset the dimension decreases, thus the noise level decreases, which leads to the increasing of the performance. This is consistent with Proposition 4. When is too large, the interactions among features cannot be learned well, so the performance decreases. Another observation is that the peak of of PPTLFS(W) is smaller than that of PPTLFS(R). This is because PPTLFS(W) tends to put important features in subsets that can be learned well enough with a relatively small and get good performance. Note that for all ’s in both datasets, the performance of PPTLFS(R) is not better than PPTLFS(W) with a significance level
5 Conclusion and Future Works
In this paper we presented a novel privacypreserving transfer learning for the knowledge sharing problem. Unlike previous privacypreserving methods, we take particular attention to the learning performance in the target domains even after much noise is added to the source domain for privacy preservation. This is done by using a stacked learning model taking components from the split source domains by features. We show that this approach addressed both the privacy preserving issue and the model performance issue, which is a critical issue facing machine learning under privacy laws in many parts of the world.
In the future, first we plan to explore differential privacy, which is a relax version of Definition 1 [Dwork et al.2006a]
. This usually leads to weaker privacy guarantee but better predicting performance. We will also consider extending by transferring other types of learning algorithms, such as deep learning, kernel learning and Bayesian learning.
References
 [Abadi et al.2016] Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H. B.; Mironov, I.; Talwar, K.; and Zhang, L. 2016. Deep learning with differential privacy. In ACM SIGSAC Conference on Computer and Communications Security, 308–318. ACM.
 [Bassily, Smith, and Thakurta2014] Bassily, R.; Smith, A.; and Thakurta, A. 2014. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Annual Symposium on Foundations of Computer Science, 464–473. IEEE.
 [Chaudhuri, Monteleoni, and Sarwate2011] Chaudhuri, K.; Monteleoni, C.; and Sarwate, A. D. 2011. Differentially private empirical risk minimization. Journal of Machine Learning Research 12(Mar):1069–1109.
 [Dwork and Roth2014] Dwork, C., and Roth, A. 2014. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9(3–4):211–407.
 [Dwork et al.2006a] Dwork, C.; Kenthapadi, K.; McSherry, F.; Mironov, I.; and Naor, M. 2006a. Our data, ourselves: Privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, 486–503.
 [Dwork et al.2006b] Dwork, C.; McSherry, F.; Nissim, K.; and Smith, A. 2006b. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, 265–284. Springer.
 [Dwork2008] Dwork, C. 2008. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation, 1–19. Springer.

[Emekçi et al.2007]
Emekçi, F.; Sahin, O. D.; Agrawal, D.; and El Abbadi, A.
2007.
Privacy preserving decision tree learning over multiple parties.
Data & Knowledge Engineering
63(2):348–361.  [Fong and WeberJahnke2012] Fong, P. K., and WeberJahnke, J. H. 2012. Privacy preserving decision tree learning using unrealized data sets. IEEE Transactions on knowledge and Data Engineering 24(2):353–364.
 [Hamm, Cao, and Belkin2016] Hamm, J.; Cao, Y.; and Belkin, M. 2016. Learning privately from multiparty data. In International Conference on Machine Learning, 555–563.

[Hanley and McNeil1983]
Hanley, J. A., and McNeil, B. J.
1983.
A method of comparing the areas under receiver operating characteristic curves derived from the same cases.
Radiology 148(3):839–843.  [Jagannathan, Pillaipakkamnatt, and Wright2009] Jagannathan, G.; Pillaipakkamnatt, K.; and Wright, R. N. 2009. A practical differentially private random decision tree classifier. In IEEE International Conference on Data Mining Workshops, 114–121. IEEE.
 [Kasiviswanathan and Jin2016] Kasiviswanathan, S. P., and Jin, H. 2016. Efficient private empirical risk minimization for highdimensional learning. In International Conference on Machine Learning, 488–497.
 [Kuzborskij and Orabona2013] Kuzborskij, I., and Orabona, F. 2013. Stability and hypothesis transfer learning. In International Conference on Machine Learning, 942–950.
 [Kuzborskij and Orabona2017] Kuzborskij, I., and Orabona, F. 2017. Fast rates by transferring from auxiliary hypotheses. Machine Learning 106(2):171–195.
 [LeCun et al.1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
 [Orabona et al.2009] Orabona, F.; Castellini, C.; Caputo, B.; Fiorilla, A. E.; and Sandini, G. 2009. Model adaptation with leastsquares SVM for adaptive hand prosthetics. In IEEE International Conference on Robotics and Automation, 2897–2903.
 [Pan and Yang2010] Pan, S. J., and Yang, Q. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22(10):1345–1359.
 [Papernot et al.2017] Papernot, N.; Abadi, M.; Erlingsson, U.; Goodfellow, I.; and Talwar, K. 2017. Semisupervised knowledge transfer for deep learning from private training data. In International Conference on Learning Representations.
 [Pathak, Rane, and Raj2010] Pathak, M.; Rane, S.; and Raj, B. 2010. Multiparty differential privacy via aggregation of locally trained classifiers. In Advances in Neural Information Processing Systems, 1876–1884.
 [Rajkumar and Agarwal2012] Rajkumar, A., and Agarwal, S. 2012. A differentially private stochastic gradient descent algorithm for multiparty classification. In Artificial Intelligence and Statistics, 933–941.
 [ShalevShwartz and Srebro2008] ShalevShwartz, S., and Srebro, N. 2008. SVM optimization: inverse dependence on training set size. In International Conference on Machine Learning, 928–935. ACM.
 [Shokri and Shmatikov2015] Shokri, R., and Shmatikov, V. 2015. Privacypreserving deep learning. In ACM SIGSAC conference on computer and communications security, 1310–1321.
 [Ting and Witten1997] Ting, K. M., and Witten, I. H. 1997. Stacked generalization: when does it work? In International Joint Conference on Artifical Intelligence.
 [Ting and Witten1999] Ting, K. M., and Witten, I. H. 1999. Issues in stacked generalization. Journal of Artificial Intelligence Research 10:271–289.
 [Xie et al.2017] Xie, L.; Baytas, I.; Lin, K.; and Zhou, J. 2017. Privacypreserving distributed multitask learning with asynchronous updates. In International Conference on Knowledge Discovery and Data Mining, 1195–1204.
Comments
There are no comments yet.