1 Introduction
Multitask learning (MTL) [12]
refers to the paradigm of learning multiple related tasks together. By contrast, singletask learning (STL) refers to the paradigm of learning each individual task independently. MTL often leads to better trained models because the commonalities among related tasks may assist in the learning process for each specific task. For example, the capability of an infant to recognize a cat might help in the development of his/her capability to recognize a dog. In recent years, MTL has received considerable interest in a broad range of application areas, including computer vision
[38, 62][2] and health informatics [52, 55].Because MTL approaches explore and leverage the commonalities among related tasks within the learning process, either explicitly or implicitly, they pose a potential security risk. Specifically, an adversary may participate in the MTL process through a participating task, thereby acquiring the model information for another task. Consider the example of adversarial attacks [53, 41, 36], which have recently received significant research attention. In such an attack, adversarial perturbations are added to data samples to fool the models to infer incorrect predictions. Knowledge of the architecture and parameters of the attacked model will make the execution of a successful attack much easier; such a scenario is termed a whitebox attack [59]. A blackbox attack, in which at least the parameter values are not known to the adversary, is much more complicated and requires more conditions to be satisfied [59, 46]
. Therefore, the leakage of information on one model to adversaries can escalate a blackbox attack into a whitebox attack. Such attacks are effective not only against deep neural networks but also against linear models such as logistic regression models and support vector machines
[45], and they do not even necessarily require access to the data samples; instead, it may be possible to add one adversarial perturbation to the datarecording device, which may universally flip the predictions of most data samples [42].We consider a concrete example in which an adversary hacks the models of interest without requiring any individual’s data. This example involves fooling a facevalidation system. An adversary Jack from company A hacks the facevalidation model of company B through the joint learning of models related to both companies. Jack then creates a mask containing an adversarial perturbation and pastes it onto the facevalidation camera of company B to cause company B’s model to generate incorrect predictions for most employees of B, thereby achieving a successful whitebox attack. Furthermore, we assume that Jack may even have an opportunity to act as a user of company B’s model. Given the model of company B, Jack can easily disguise himself to cause the model to predict that he is a valid employee of B, thereby achieving another successful whitebox attack. However, if Jack has no information on the model, such attacks will be much more difficult. Similar whitebox attacks can be generalized to many other applications.
As another example, we consider personalized predictive modeling [43, 61], which has become a fundamental methodology in health informatics. In this type of modeling, a custommade model is built for each person. In modern health informatics, such a model may contain sensitive information, such as how and why a specific patient whether or not developed a specific disease. Such information is a different type of privacy from that of any single data record, and also requires protection. However, the joint training of personalized models may allow such causes of disease to be leaked to an adversary whose model is also participating in the joint training.
Because of the concerns discussed above, it is necessary to develop a secure training strategy for MTL approaches to prevent information on each model from leaking to other models.
However, few privacypreserving MTL approaches have been proposed to date [40, 9, 47, 50, 29]. Moreover, such approaches protect only the security of data instances rather than that of models/predictors. A typical focus of research is distributed learning [40, 9], in which the datasets for different tasks are distributively located. The local task models are trained independently using their own datasets before being aggregated and injected with useful knowledge shared across tasks. Such a procedure mitigates the privacy problem by updating each local model independently. However, these methods do not provide theoretical guarantees of privacy.
Pathak et al. [47] proposed a privacypreserving distributed learning scheme with privacy guarantees using tools offered by differential privacy [20], which provides a strong, cryptographically motivated definition of privacy based on rigorous mathematical theory and has recently received significant research attention due to its robustness to known attacks [15, 24]. This scheme is useful when one wishes to prevent potential attackers from acquiring information on any element of the input dataset based on a change in the output distribution. Pathak et al. [47] first trained local models distributively and then averaged the decision vectors of the tasks before adding noise based on the output perturbation method of [19]. However, because only averaging is performed in this method, it has a limited ability to cover more complicated relations among tasks, such as lowrank [4], groupsparse [54], clustered [27] or graphbased [63] task structures. Gupta et al. [29] proposed a scheme for transforming multitask relationship learning [63]
into a differentially private version, in which the output perturbation method is also adopted. However, their method requires a closedform solution (obtained by minimizing the leastsquare loss function) to achieve a theoretical privacy guarantee; thus, it cannot guarantee the privacy for methods like logistic regression who needs iterative optimization procedures. Moreover, on their synthetic datasets, their privacypreserving MTL methods underperformed even compared with nonprivate STL methods (which can guarantee optimal privacy against the leakage of information across tasks), which suggests that there is no reason to use their proposed methods. In addition, they did not study the additional privacy leakage due to the iterative nature of their algorithm (see
Kairouz et al. [35]), and they did not study the utility bound for their method. The approaches of both Pathak et al. [47] and Gupta et al. [29] protect a single data instance instead of the predictor/model of each task, and they involve adding noise directly to the predictors, which is not necessary for avoiding the leakage of information across tasks and may induce excessive noise and thus jeopardize the utility of the algorithms.To overcome these shortcomings, in this paper, we propose modelprotected multitask learning (MPMTL), which enables the joint learning of multiple related tasks while simultaneously preventing leakage of the models for each task.
Without loss of generality, our MPMTL method is designed based on linear multitask models [32, 39], in which we assume that the model parameters are learned by minimizing an objective that combines an average empirical prediction loss and a regularization term. The regularization term captures the commonalities among the different tasks and couples their model parameters. The solution process for this type of MTL method can be viewed as a recursive twostep procedure. The first step is a decoupled learning procedure in which the model parameters of each task are estimated independently using some precomputed shared information among tasks. The second step is a centralized transfer procedure in which some of the information shared among tasks is extracted for distribution to each task for the decoupled learning procedure in the next step. Our MPMTL mechanism protects the models by adding perturbations during this second step. Note that we assume a curator that collects models for joint learning but never needs to collect task data. We develop a rigorous mathematical definition of the MPMTL problem and propose an algorithmic framework to obtain the solution. We add perturbations to the covariance matrix of the parameter matrix because the tasks’ covariance matrix is widely used as a fundamental ingredient from which to extract useful knowledge to share among tasks [39, 63, 32, 5, 64, 52]. Consequently, our technique can cover a wide range of MTL algorithms. Fig. 1 illustrates the key ideas of the main framework.
Several methods have been proposed for the private release of the covariance matrix [34, 21, 10]. Considering an additive noise matrix, according to our utility analysis, the overall utility of the MTL algorithm depends on the spectral norm of the noise matrix. A list of the bounds on the spectral norms of additive noise matrices can be found in Jiang et al. [34]. We choose to add Wishart noise [34] to the covariance matrix for four reasons: (1) For a given privacy budget, this type of noise matrix has a better spectralnorm bound than that of Laplace noise matrix [34]. (2) Unlike Gaussian noise matrix which enables an private method with a positive , this approach enables an private method and can be used to build an iterative method that is entirely private, which provides a stronger privacy guarantee. (3) Unlike in the case of Gaussian and Laplace noise matrices, Wishart noise matrix is positive definite. Thus, it can be guaranteed that our method will not underperform compared with STL methods under high noise level, which means that participation in the joint learning process will have no negative effect on the training of any task model. (4) This approach allows arbitrary changes to any task, unlike the method of Blocki et al. [10]. The usage of the perturbed covariance matrix depends on the specific MTL method applied.
We further develop two concrete approaches as instantiations of our framework, each of which transforms an existing MTL algorithm into a private version. Specifically, we consider two popular types of basic MTL models: 1) a model that learns a lowrank subspace by means of a trace norm penalty [32]
and 2) a model that performs shared feature selection by means of a group
( norm) penalty [39]. In both cases, the covariance matrix is used to build a linear transform matrix with which to project the models into new feature subspaces, among which the most useful subspaces are selected. Privacy guarantees are provided. Utility analyses are also presented for both convex and strongly convex prediction loss functions and for both the basic and accelerated proximalgradient methods. To the best of our knowledge, we are the first to present a utility analysis for differentially private MTL algorithms that allow heterogeneous tasks. By contrast, the distributed tasks studied by
Pathak et al. [47] are homogeneous; i.e., the coding procedures for both features and targets are the same for different tasks. Furthermore, heterogeneous privacy budgets are considered for different iterations of our algorithms, and a utility analysis for budget allocation is presented. The utility results under the best budget allocation strategies are summarized in Table I (the notations used in the table are defined in the next section). In addition, we analyze the difference between our MPMTL scheme and the existing privacypreserving MTL strategies, which ensure the security of data instances. We also validate the effectiveness of our approach on benchmark and realworld datasets. Since existing privacypreserving MTL methods only protect single data instances, they are transformed into methods that can protect task models using our theoretical results. The experimental results demonstrate that our algorithms outperform existing privacypreserving MTL methods on the proposed modelprotection problem.Low rank  Group sparse  

No Acceleration  Convex  
Strong convex  
Use Acceleration  Convex  
Strong convex  
No Acceleration  Convex  
Strong convex  
Use Acceleration  Convex  
Strong convex 
The contributions of this paper are highlighted as follows.

We are the first to propose and address the modelprotection problem in MTL setting.

We develop a general algorithmic framework to solve the MPMTL problem to obtain secure estimates of the model parameters. We derive concrete instantiations of our algorithmic framework for two popular types of MTL models, namely, models that learn the lowrank and groupsparse patterns of the model matrix. It can be guaranteed that our algorithms will not underperform in comparison with STL methods under high noise level.

Privacy guarantees are provided. To the best of our knowledge, we are the first to provide privacy guarantees for differentially private MTL algorithms that allow heterogeneous tasks and that minimize the logistic loss and other loss functions that do not have closedform solutions.

Utility analyses are also presented for both convex and strongly convex prediction loss functions and for both the basic and accelerated proximalgradient methods. To the best of our knowledge, we are the first to present utility bounds for differentially private MTL algorithms that allow heterogeneous tasks. Heterogeneous privacy budgets are considered, and a utility analysis for budget allocation is presented.

Existing privacypreserving MTL methods that protect data instances are transformed into methods that can protect task models. Experiments demonstrate that our algorithms significantly outperform them on the proposed modelprotection problem.
The remainder of this paper is organized as follows. Section 2 introduces the background on MTL problems and the definition of the proposed modelprotection problem. The algorithmic framework and concrete instantiations of the proposed MPMTL method are presented in Section 3, along with the analyses of utility and privacybudget allocation. Section 4 presents an empirical evaluation of the proposed approaches, followed by the conclusions in Section 5.
2 Preliminaries and the Proposed Problem
In this section, we first introduce the background on MTL problems and then introduce the definition of model protection.
The notations and symbols that will be used throughout the paper are summarized in Table II.
the index set  

the index set with index removed  
the trace norm of a matrix (sum of the singular values of the matrix) 

the norm of a matrix (sum of the norms of the row vectors of the matrix)  
the trace of a matrix (sum of the diagonal elements of the matrix)  
the th largest singular value of a matrix, 
Extensive MTL studies have been conducted on linear models using regularized approaches. The basic MTL algorithm that we will consider is as follows:
(1) 
where is the number of tasks. The datasets for the tasks are denoted by , where for each , , where and denote the data matrix and target vector of the th task with samples and dimensionality , respectively. is the prediction loss function for the th task. In this paper, we will focus on linear MTL models, where denotes the predictor/decision vector for task and is the model parameter matrix. is a regularization term that represents the structure of the information shared among the tasks, for which is the prefixed hyperparameter. As a special case, STL can be described by (1) with .
The key to multitask learning is to relate the tasks via a shared representation, which, in turn, benefits the tasks to be learned. Each possible shared representation encodes certain assumptions regarding the task relatedness.
A typical assumption is that the tasks share a latent lowdimensional subspace [4, 16, 58]. The formulation also leads to a lowrank structure of the model matrix. Because optimization problems involving rank functions are intractable, a trace norm regularization method is typically used [3, 32, 48]:
(2) 
Another typical assumption is that all tasks share a subset of important features. Such task relatedness can be captured by imposing a groupsparse penalty on the predictor matrix to select shared features across tasks [54, 57, 39]. One commonly used groupsparse penalty is the group penalty [39, 44]:
(3) 
Now, we will present a compact definition of the modelprotection problem in the context of MTL and discuss the general approach without differential privacy. As can be seen from (1), as a result of the joint learning process, may contain some information on , for and . Then, it is possible for the owner of task to use such information to attack task . Thus, we define the modelprotection problem as follows.
Definition 1 (ModelProtection Problem for MTL).
The modelprotection problem for MTL has three objectives:
1) minimizing the information on that can be inferred from , for all ;
2) maximizing the prediction performance of , for all ; and
3) sharing useful predictive information among tasks.
Now, we consider two settings:
1) A noniterative setting, in which a trusted curator collects independently trained models, denoted by , for all tasks without their associated data to be used as input. After the joint learning procedure, the curator outputs the updated models, denoted by , and sends each updated model to each task privately. In this setting, the model collection process and the joint learning process are each performed only once.
2) An iterative setting, in which the model collection and joint learning processes are performed iteratively. Such a setting is more general.
One may observe that we assume the use of a trusted curator that collects the task models; this assumption will raise privacy concerns in settings in which the curator is untrusted. Such concerns are related to the demand for secure multiparty computation (SMC) [47, 25], the purpose of which is to avoid the leakage of data instances to the curator. Our extended framework considering SMC is presented in the supplementary material.
We note that the two types of problems represented by (2) and (3) are unified in the multitask feature learning framework, which is based on the covariance matrix of the tasks’ predictors [6, 22, 7]. Many other MTL methods also fall under this framework, such as the learning of clustered structures among tasks [27, 64] and the inference of task relations [63, 23, 11]. As such, we note that the tasks’ covariance matrix constitutes a major source of shared knowledge in MTL methods, and hence, it is regarded as the primary target for model protection.
Therefore, we address the modelprotection problem by first rephrasing the first objective in Definition 1 as follows: minimizing the changes in and the tasks’ covariance matrix ( or ) when task participates in the joint learning process, for all . Thus, the model of this new task is protected.
Then, we find that the concept of differential privacy (minimizing the change in the output distribution) can be adopted to further rephrase this objective as follows: minimizing the changes in the distribution of and the tasks’ covariance matrix when task participates in the joint learning process, for all .
In differential privacy, algorithms are randomized by introducing some kind of perturbation.
Definition 2 (Randomized Algorithm).
A randomized algorithm is associated with some mapping . Algorithm outputs with a density for each
. The probability space is over some kind of perturbation introduced into algorithm
.In this paper,
denotes some randomized machine learning estimator, and
denotes the model parameters that we wish to estimate. Perturbations can be introduced into the original learning system via the (1) input data [37, 8], (2) model parameters [13, 33], (3) objective function [14, 60], or (4) optimization process [51, 56].The formal definition of differential privacy is as follows.
Definition 3 (Dwork et al. [20]).
A randomized algorithm provides differential privacy if, for any two adjacent datasets and that differ by a single entry and for any set ,
where and are the outputs of on the inputs and , respectively.
The privacy loss pair is referred to as the privacy budget/loss and quantifies the privacy risk of algorithm . The intuition is that it is difficult for a potential attacker to infer whether a certain data point has been changed (or added to the dataset ) based on a change in the output distribution. Consequently, the information of any single data point is protected.
Furthermore, note that differential privacy is defined in terms of applicationspecific adjacent input databases. In our setting, it is the pair of each task’s model and dataset to be treated as the “single entry” by Definition 3.
Several mechanisms exist for introducing a specific type of perturbation. A typical type is calibrated to the sensitivity of the original “unrandomized” machine learning estimator . The sensitivity of an estimator is defined as the maximum change in its output due to an replacement of any single data instance.
Definition 4 (Dwork et al. [20]).
The sensitivity of a function is defined as
for all datasets and that differ by at most one instance, where can be any norm, e.g., the or norm.
The use of additive noise, such as Laplace noise [20], Gaussian noise [21] or Wishart noise [34]
, with a standard deviation proportional to
is a common practice for guaranteeing private learning.Because machine learning schemes are usually presented as sequential paradigms with multiple iterations and usually output multiple variables simultaneously, several properties of differential privacy are particularly useful for ensuring privacy in machine learning, such as its postprocessing immunity, group privacy, combination properties and adaptive composition. The details of these properties are introduced in the supplementary material.
3 Methodology
We present our methodology in this section: the modeling of and rationale for our MPMTL framework, two instantiations and utility analyses. Regarding the theoretical results, we present only the main results; the detailed derivations are deferred to the provided supplementary material.
3.1 The General Framework for MPMTL
We first present our main ideas for modeling our MPMTL framework in the noniterative setting before extending it to the general iterative setting.
3.1.1 Noniterative Setting
Formally, we define a noniterative MPMTL algorithm as follows.
Definition 5 (Noniterative MPMTL).
For a randomized MTL algorithm that uses and the datasets as input and outputs , is an noniterative MPMTL algorithm if for all , for neighboring input pairs and that differ only by the th task, such that or , the following holds for some constants and for any set :
(4) 
Remark 1.
Note that the curator requires only the models as input, not the datasets. The datasets are held privately by the owners of their respective tasks and are only theoretically considered as inputs for the entire MTL algorithm.
As such, for the th task (), we assume that potential adversaries may acquire , and we wish to protect both the input model and the dataset . Note that although we ultimately wish to protect , i.e., the output model for the th task, all information related to the th task in comes from the input model and the dataset , and the latter are the only objects that any algorithm can directly control because they are the inputs.
For the sake of intuition, we set and in Definition 5 to and the empty set, respectively; then, the goal of MPMTL is to ensure that when the model for a new task is provided to the curator, an adversary, including another existing task, cannot discover significant anomalous information from the distribution of its view: . Thus, the information on the new task’s model is protected. Note that for different tasks, the views of an adversary are different.
We can redefine the noniterative MPMTL algorithm in the form of differential privacy.
Let be an augmented dataset; i.e., let be treated as the th “data instance” of the augmented dataset , for all . Thus, the datasets and models associated with the tasks are transformed into a single dataset with “data instances”. Then, we define outputs such that for all , denotes the view of an adversary for the th task, which includes . Then, an noniterative MPMTL algorithm should satisfy inequalities: for all , for all neighboring datasets and that differ by the th “data instance”, and for any set , we have
(5) 
STL can be easily shown to be optimal for avoiding information leakage across tasks because the individual task models are learned independently.
Lemma 1.
For any STL algorithm that uses and datasets as input, outputs and learns each task independently, is a noniterative MPMTL algorithm.
We learn from this lemma that if there is no information sharing across tasks, then no leakage across tasks occurs.
3.1.2 Iterative Setting
Consider an iterative MPMTL algorithm with a number of iterations . For , a trusted curator collects the models, denoted by , for all tasks. Then, modelprotected MTL is performed, and the updated models are output and sent back to their respective tasks.
In such a setting, for all , for the th task, we wish to protect dataset and the entire input sequence of its model (denoted by for short). For the th task, the output sequence is the view of a potential adversary.
We formally define an iterative MPMTL algorithm as follows.
Definition 6 (Iterative MPMTL).
Let be a randomized iterative MTL algorithm with a number of iterations . In the first iteration, performs the mapping , where includes . For , in the th iteration, performs the mapping , where includes . is an iterative MPMTL algorithm if for all , for all , and for neighboring input pairs and that differ only by the th task, such that or , the following holds for some constants and for any set :
(6) 
where for all , denotes the input for the th iteration and
. 
We can also redefine the iterative MPMTL algorithm in the form of differential privacy by considering an augmented dataset , i.e., by treating as the th “data instance” of the data set , for all , in the th iteration, for all . The details are similar to those in the noniterative setting and are omitted here.
Obviously, any noniterative MPMTL algorithm is an iterative MPMTL algorithm with .
Our MPMTL framework is elaborated in Algorithm 1, which is iterative in nature. As mentioned in Section 2, we choose to protect the tasks’ covariance matrix, which is denoted by or , depending on the chosen MTL method. As previously stated, Wishart noise [34] is added. Fig. 1 illustrates the key concepts of the framework. This framework is generally applicable for many optimization schemes, such as proximal gradient methods [32, 39], alternating methods [5] and FrankWolfe methods [30].
Remark 2.
In Algorithm 1, the datasets are only used in STL algorithms that can be performed locally.
(7) 
or the singular value decomposition of
.3.2 Instantiations of the MPMTL Framework
In this section, we introduce two instantiations of our MPMTL framework described in Algorithm 1. These two instantiations are related to the MTL problems represented by (2) and (3). We focus on the proximal gradient descent methods presented by Ji and Ye [32] and Liu et al. [39] for Problems (2) and (3), respectively.
First, we instantiate the MPMTL framework for Problem (2), the lowrank case, as shown in Algorithm 2. Generally speaking, the algorithm uses an accelerated proximal gradient method. Steps 5 to 9 approximate the following proximal operator:
(8) 
Fig. 2 provides a running example for model leakage and model protection under different settings of Algorithm 2.
Second, we instantiate the MPMTL framework for Problem (3), the groupsparse case, as shown in Algorithm 3. Steps 5 to 8 approximate the following proximal operator:
(9) 
The approximation error bounds for both the proximal operators are provided in Section 3.4.
We use the following result to show that under high noise level, our algorithms degrade to STL methods such that they do not underperform comparing with STL methods.
Proposition 1.
Algorithm 2 degrades to an STL algorithm with no random perturbation if the smallest singular value of satisfies for sufficiently large .
Algorithm 3 degrades to an STL algorithm with no random perturbation if the smallest diagonal element of satisfies for sufficiently large .
We also consider other complex MTL frameworks for instantiation. For example, Gong et al. [26], Chen et al. [17], Jalali et al. [31] and Chen et al. [18]
considered a decomposed parameter/model matrix for handling heterogeneities among tasks, e.g., detecting entrywise outliers in the parameter matrix
[31, 18] and detecting anomalous tasks [26, 17]. These detection procedures are claimed to be beneficial for the knowledge sharing process in the case of heterogeneous tasks. Our MPMTL framework can be naturally extended to such a modeldecomposed setting because the additional procedures are still STL algorithms, and hence, the privacy loss will not increase; see the supplementary material for additional details.is firstly generated from i.i.d. uniform distribution
. Then the last column is multiplied by . We then run Algorithm 2, taking and . The noise matrix is not added in (b) but added in (c). (a) shows ; (b) and (c) show under their respective settings. Columns shown have been divided by their respective norms. In (b), the 10th task results in significantly influences on the parameters of other models, especially on the first and the last features. In (c), the influences from the 10th task are not significant. Meanwhile, the second and the fifth features are shared by most tasks as should be.3.3 Privacy Guarantees
Theorem 1.
Algorithm 1 is an iterative MPMTL algorithm.
3.4 Utility Analyses
Building upon the utility analysis of the Wishart mechanism presented by Jiang et al. [34], the convergence analysis of inexact proximalgradient descent presented by Schmidt et al. [49] and the optimal solutions for proximal operators presented by Ji and Ye [32] and Liu et al. [39], we study the utility bounds for Algorithm 2 and Algorithm 3.
We define the following parameter space for a constant :
We consider . We assume that is convex and has an Lipschitzcontinuous gradient (as defined in Schmidt et al. [49]). Let , where for Algorithm 2 and for Algorithm 3. Without loss of generality, we assume that and . We adopt the notation .
We have studied the utility bounds for three cases of in (7). Here, we report the results for the following case.
Such a case assumes and is suitable for small privacy budgets, such as . The results for the other two cases of can be found in the supplementary material.
The number of tasks are assumed sufficient as follows.
Assumption 1.
For Algorithm 2, assume that for sufficiently large ,
For Algorithm 3, assume that for sufficiently large ,
We first present approximation error bounds for proximal operators with respect to an arbitrary noise matrix .
Lemma 2.
Consider Algorithm 2. For , in the th iteration, let . Let be the rank of . Suppose that there exists an index such that and Assume that for
. Then, for any random matrix
, the following holds:(10) 
Lemma 3.
Consider Algorithm 3. For , in the th iteration, let . Let the indices of nonzero rows of be denoted by , and let . Let . Suppose that there exists an integer such that where is the indicator function. Then, for any random matrix , the following holds:
(11) 
We find that the approximation error bounds for both algorithms depend on (note that ).
Now, we present guarantees regarding both utility and run time. In the following,
is assumed to be the Wishart random matrix defined in each algorithm. We consider heterogeneous privacy budgets and set
for and for the convex case.Theorem 2 (Low rank  Convexity).
Consider Algorithm 2. For an index that satisfies the conditions given in Lemma 2 for all , , and , assume that for .
No acceleration: If we set for , then if we also set
for , we have, with high probability,
(12) 
where
(13) 
Use acceleration: If we set for , then if we also set
for , we have, with high probability,