Multi-task learning (MTL) 
refers to the paradigm of learning multiple related tasks together. By contrast, single-task learning (STL) refers to the paradigm of learning each individual task independently. MTL often leads to better trained models because the commonalities among related tasks may assist in the learning process for each specific task. For example, the capability of an infant to recognize a cat might help in the development of his/her capability to recognize a dog. In recent years, MTL has received considerable interest in a broad range of application areas, including computer vision[38, 62]2] and health informatics [52, 55].
Because MTL approaches explore and leverage the commonalities among related tasks within the learning process, either explicitly or implicitly, they pose a potential security risk. Specifically, an adversary may participate in the MTL process through a participating task, thereby acquiring the model information for another task. Consider the example of adversarial attacks [53, 41, 36], which have recently received significant research attention. In such an attack, adversarial perturbations are added to data samples to fool the models to infer incorrect predictions. Knowledge of the architecture and parameters of the attacked model will make the execution of a successful attack much easier; such a scenario is termed a white-box attack . A black-box attack, in which at least the parameter values are not known to the adversary, is much more complicated and requires more conditions to be satisfied [59, 46]
. Therefore, the leakage of information on one model to adversaries can escalate a black-box attack into a white-box attack. Such attacks are effective not only against deep neural networks but also against linear models such as logistic regression models and support vector machines, and they do not even necessarily require access to the data samples; instead, it may be possible to add one adversarial perturbation to the data-recording device, which may universally flip the predictions of most data samples .
We consider a concrete example in which an adversary hacks the models of interest without requiring any individual’s data. This example involves fooling a face-validation system. An adversary Jack from company A hacks the face-validation model of company B through the joint learning of models related to both companies. Jack then creates a mask containing an adversarial perturbation and pastes it onto the face-validation camera of company B to cause company B’s model to generate incorrect predictions for most employees of B, thereby achieving a successful white-box attack. Furthermore, we assume that Jack may even have an opportunity to act as a user of company B’s model. Given the model of company B, Jack can easily disguise himself to cause the model to predict that he is a valid employee of B, thereby achieving another successful white-box attack. However, if Jack has no information on the model, such attacks will be much more difficult. Similar white-box attacks can be generalized to many other applications.
As another example, we consider personalized predictive modeling [43, 61], which has become a fundamental methodology in health informatics. In this type of modeling, a custom-made model is built for each person. In modern health informatics, such a model may contain sensitive information, such as how and why a specific patient whether or not developed a specific disease. Such information is a different type of privacy from that of any single data record, and also requires protection. However, the joint training of personalized models may allow such causes of disease to be leaked to an adversary whose model is also participating in the joint training.
Because of the concerns discussed above, it is necessary to develop a secure training strategy for MTL approaches to prevent information on each model from leaking to other models.
However, few privacy-preserving MTL approaches have been proposed to date [40, 9, 47, 50, 29]. Moreover, such approaches protect only the security of data instances rather than that of models/predictors. A typical focus of research is distributed learning [40, 9], in which the datasets for different tasks are distributively located. The local task models are trained independently using their own datasets before being aggregated and injected with useful knowledge shared across tasks. Such a procedure mitigates the privacy problem by updating each local model independently. However, these methods do not provide theoretical guarantees of privacy.
Pathak et al.  proposed a privacy-preserving distributed learning scheme with privacy guarantees using tools offered by differential privacy , which provides a strong, cryptographically motivated definition of privacy based on rigorous mathematical theory and has recently received significant research attention due to its robustness to known attacks [15, 24]. This scheme is useful when one wishes to prevent potential attackers from acquiring information on any element of the input dataset based on a change in the output distribution. Pathak et al.  first trained local models distributively and then averaged the decision vectors of the tasks before adding noise based on the output perturbation method of . However, because only averaging is performed in this method, it has a limited ability to cover more complicated relations among tasks, such as low-rank , group-sparse , clustered  or graph-based  task structures. Gupta et al.  proposed a scheme for transforming multi-task relationship learning 
into a differentially private version, in which the output perturbation method is also adopted. However, their method requires a closed-form solution (obtained by minimizing the least-square loss function) to achieve a theoretical privacy guarantee; thus, it cannot guarantee the privacy for methods like logistic regression who needs iterative optimization procedures. Moreover, on their synthetic datasets, their privacy-preserving MTL methods underperformed even compared with non-private STL methods (which can guarantee optimal privacy against the leakage of information across tasks), which suggests that there is no reason to use their proposed methods. In addition, they did not study the additional privacy leakage due to the iterative nature of their algorithm (seeKairouz et al. ), and they did not study the utility bound for their method. The approaches of both Pathak et al.  and Gupta et al.  protect a single data instance instead of the predictor/model of each task, and they involve adding noise directly to the predictors, which is not necessary for avoiding the leakage of information across tasks and may induce excessive noise and thus jeopardize the utility of the algorithms.
To overcome these shortcomings, in this paper, we propose model-protected multi-task learning (MP-MTL), which enables the joint learning of multiple related tasks while simultaneously preventing leakage of the models for each task.
Without loss of generality, our MP-MTL method is designed based on linear multi-task models [32, 39], in which we assume that the model parameters are learned by minimizing an objective that combines an average empirical prediction loss and a regularization term. The regularization term captures the commonalities among the different tasks and couples their model parameters. The solution process for this type of MTL method can be viewed as a recursive two-step procedure. The first step is a decoupled learning procedure in which the model parameters of each task are estimated independently using some pre-computed shared information among tasks. The second step is a centralized transfer procedure in which some of the information shared among tasks is extracted for distribution to each task for the decoupled learning procedure in the next step. Our MP-MTL mechanism protects the models by adding perturbations during this second step. Note that we assume a curator that collects models for joint learning but never needs to collect task data. We develop a rigorous mathematical definition of the MP-MTL problem and propose an algorithmic framework to obtain the solution. We add perturbations to the covariance matrix of the parameter matrix because the tasks’ covariance matrix is widely used as a fundamental ingredient from which to extract useful knowledge to share among tasks [39, 63, 32, 5, 64, 52]. Consequently, our technique can cover a wide range of MTL algorithms. Fig. 1 illustrates the key ideas of the main framework.
Several methods have been proposed for the private release of the covariance matrix [34, 21, 10]. Considering an additive noise matrix, according to our utility analysis, the overall utility of the MTL algorithm depends on the spectral norm of the noise matrix. A list of the bounds on the spectral norms of additive noise matrices can be found in Jiang et al. . We choose to add Wishart noise  to the covariance matrix for four reasons: (1) For a given privacy budget, this type of noise matrix has a better spectral-norm bound than that of Laplace noise matrix . (2) Unlike Gaussian noise matrix which enables an -private method with a positive , this approach enables an -private method and can be used to build an iterative method that is entirely -private, which provides a stronger privacy guarantee. (3) Unlike in the case of Gaussian and Laplace noise matrices, Wishart noise matrix is positive definite. Thus, it can be guaranteed that our method will not underperform compared with STL methods under high noise level, which means that participation in the joint learning process will have no negative effect on the training of any task model. (4) This approach allows arbitrary changes to any task, unlike the method of Blocki et al. . The usage of the perturbed covariance matrix depends on the specific MTL method applied.
We further develop two concrete approaches as instantiations of our framework, each of which transforms an existing MTL algorithm into a private version. Specifically, we consider two popular types of basic MTL models: 1) a model that learns a low-rank subspace by means of a trace norm penalty 
and 2) a model that performs shared feature selection by means of a group( norm) penalty 
. In both cases, the covariance matrix is used to build a linear transform matrix with which to project the models into new feature subspaces, among which the most useful subspaces are selected. Privacy guarantees are provided. Utility analyses are also presented for both convex and strongly convex prediction loss functions and for both the basic and accelerated proximal-gradient methods. To the best of our knowledge, we are the first to present a utility analysis for differentially private MTL algorithms that allow heterogeneous tasks. By contrast, the distributed tasks studied byPathak et al.  are homogeneous; i.e., the coding procedures for both features and targets are the same for different tasks. Furthermore, heterogeneous privacy budgets are considered for different iterations of our algorithms, and a utility analysis for budget allocation is presented. The utility results under the best budget allocation strategies are summarized in Table I (the notations used in the table are defined in the next section). In addition, we analyze the difference between our MP-MTL scheme and the existing privacy-preserving MTL strategies, which ensure the security of data instances. We also validate the effectiveness of our approach on benchmark and real-world datasets. Since existing privacy-preserving MTL methods only protect single data instances, they are transformed into methods that can protect task models using our theoretical results. The experimental results demonstrate that our algorithms outperform existing privacy-preserving MTL methods on the proposed model-protection problem.
|Low rank||Group sparse|
The contributions of this paper are highlighted as follows.
We are the first to propose and address the model-protection problem in MTL setting.
We develop a general algorithmic framework to solve the MP-MTL problem to obtain secure estimates of the model parameters. We derive concrete instantiations of our algorithmic framework for two popular types of MTL models, namely, models that learn the low-rank and group-sparse patterns of the model matrix. It can be guaranteed that our algorithms will not underperform in comparison with STL methods under high noise level.
Privacy guarantees are provided. To the best of our knowledge, we are the first to provide privacy guarantees for differentially private MTL algorithms that allow heterogeneous tasks and that minimize the logistic loss and other loss functions that do not have closed-form solutions.
Utility analyses are also presented for both convex and strongly convex prediction loss functions and for both the basic and accelerated proximal-gradient methods. To the best of our knowledge, we are the first to present utility bounds for differentially private MTL algorithms that allow heterogeneous tasks. Heterogeneous privacy budgets are considered, and a utility analysis for budget allocation is presented.
Existing privacy-preserving MTL methods that protect data instances are transformed into methods that can protect task models. Experiments demonstrate that our algorithms significantly outperform them on the proposed model-protection problem.
The remainder of this paper is organized as follows. Section 2 introduces the background on MTL problems and the definition of the proposed model-protection problem. The algorithmic framework and concrete instantiations of the proposed MP-MTL method are presented in Section 3, along with the analyses of utility and privacy-budget allocation. Section 4 presents an empirical evaluation of the proposed approaches, followed by the conclusions in Section 5.
2 Preliminaries and the Proposed Problem
In this section, we first introduce the background on MTL problems and then introduce the definition of model protection.
The notations and symbols that will be used throughout the paper are summarized in Table II.
|the index set|
|the index set with index removed|
the trace norm of a matrix (sum of the singular values of the matrix)
|the norm of a matrix (sum of the norms of the row vectors of the matrix)|
|the trace of a matrix (sum of the diagonal elements of the matrix)|
|the -th largest singular value of a matrix,|
Extensive MTL studies have been conducted on linear models using regularized approaches. The basic MTL algorithm that we will consider is as follows:
where is the number of tasks. The datasets for the tasks are denoted by , where for each , , where and denote the data matrix and target vector of the -th task with samples and dimensionality , respectively. is the prediction loss function for the -th task. In this paper, we will focus on linear MTL models, where denotes the predictor/decision vector for task and is the model parameter matrix. is a regularization term that represents the structure of the information shared among the tasks, for which is the pre-fixed hyper-parameter. As a special case, STL can be described by (1) with .
The key to multi-task learning is to relate the tasks via a shared representation, which, in turn, benefits the tasks to be learned. Each possible shared representation encodes certain assumptions regarding the task relatedness.
A typical assumption is that the tasks share a latent low-dimensional subspace [4, 16, 58]. The formulation also leads to a low-rank structure of the model matrix. Because optimization problems involving rank functions are intractable, a trace norm regularization method is typically used [3, 32, 48]:
Another typical assumption is that all tasks share a subset of important features. Such task relatedness can be captured by imposing a group-sparse penalty on the predictor matrix to select shared features across tasks [54, 57, 39]. One commonly used group-sparse penalty is the group penalty [39, 44]:
Now, we will present a compact definition of the model-protection problem in the context of MTL and discuss the general approach without differential privacy. As can be seen from (1), as a result of the joint learning process, may contain some information on , for and . Then, it is possible for the owner of task to use such information to attack task . Thus, we define the model-protection problem as follows.
Definition 1 (Model-Protection Problem for MTL).
The model-protection problem for MTL has three objectives:
1) minimizing the information on that can be inferred from , for all ;
2) maximizing the prediction performance of , for all ; and
3) sharing useful predictive information among tasks.
Now, we consider two settings:
1) A non-iterative setting, in which a trusted curator collects independently trained models, denoted by , for all tasks without their associated data to be used as input. After the joint learning procedure, the curator outputs the updated models, denoted by , and sends each updated model to each task privately. In this setting, the model collection process and the joint learning process are each performed only once.
2) An iterative setting, in which the model collection and joint learning processes are performed iteratively. Such a setting is more general.
One may observe that we assume the use of a trusted curator that collects the task models; this assumption will raise privacy concerns in settings in which the curator is untrusted. Such concerns are related to the demand for secure multi-party computation (SMC) [47, 25], the purpose of which is to avoid the leakage of data instances to the curator. Our extended framework considering SMC is presented in the supplementary material.
We note that the two types of problems represented by (2) and (3) are unified in the multi-task feature learning framework, which is based on the covariance matrix of the tasks’ predictors [6, 22, 7]. Many other MTL methods also fall under this framework, such as the learning of clustered structures among tasks [27, 64] and the inference of task relations [63, 23, 11]. As such, we note that the tasks’ covariance matrix constitutes a major source of shared knowledge in MTL methods, and hence, it is regarded as the primary target for model protection.
Therefore, we address the model-protection problem by first rephrasing the first objective in Definition 1 as follows: minimizing the changes in and the tasks’ covariance matrix ( or ) when task participates in the joint learning process, for all . Thus, the model of this new task is protected.
Then, we find that the concept of differential privacy (minimizing the change in the output distribution) can be adopted to further rephrase this objective as follows: minimizing the changes in the distribution of and the tasks’ covariance matrix when task participates in the joint learning process, for all .
In differential privacy, algorithms are randomized by introducing some kind of perturbation.
Definition 2 (Randomized Algorithm).
A randomized algorithm is associated with some mapping . Algorithm outputs with a density for each . The probability space is over some kind of perturbation introduced into algorithm
. The probability space is over some kind of perturbation introduced into algorithm.
In this paper,
denotes some randomized machine learning estimator, anddenotes the model parameters that we wish to estimate. Perturbations can be introduced into the original learning system via the (1) input data [37, 8], (2) model parameters [13, 33], (3) objective function [14, 60], or (4) optimization process [51, 56].
The formal definition of differential privacy is as follows.
Definition 3 (Dwork et al. ).
A randomized algorithm provides -differential privacy if, for any two adjacent datasets and that differ by a single entry and for any set ,
where and are the outputs of on the inputs and , respectively.
The privacy loss pair is referred to as the privacy budget/loss and quantifies the privacy risk of algorithm . The intuition is that it is difficult for a potential attacker to infer whether a certain data point has been changed (or added to the dataset ) based on a change in the output distribution. Consequently, the information of any single data point is protected.
Furthermore, note that differential privacy is defined in terms of application-specific adjacent input databases. In our setting, it is the pair of each task’s model and dataset to be treated as the “single entry” by Definition 3.
Several mechanisms exist for introducing a specific type of perturbation. A typical type is calibrated to the sensitivity of the original “unrandomized” machine learning estimator . The sensitivity of an estimator is defined as the maximum change in its output due to an replacement of any single data instance.
Definition 4 (Dwork et al. ).
The sensitivity of a function is defined as
for all datasets and that differ by at most one instance, where can be any norm, e.g., the or norm.
, with a standard deviation proportional tois a common practice for guaranteeing private learning.
Because machine learning schemes are usually presented as sequential paradigms with multiple iterations and usually output multiple variables simultaneously, several properties of differential privacy are particularly useful for ensuring privacy in machine learning, such as its post-processing immunity, group privacy, combination properties and adaptive composition. The details of these properties are introduced in the supplementary material.
We present our methodology in this section: the modeling of and rationale for our MP-MTL framework, two instantiations and utility analyses. Regarding the theoretical results, we present only the main results; the detailed derivations are deferred to the provided supplementary material.
3.1 The General Framework for MP-MTL
We first present our main ideas for modeling our MP-MTL framework in the non-iterative setting before extending it to the general iterative setting.
3.1.1 Non-iterative Setting
Formally, we define a non-iterative MP-MTL algorithm as follows.
Definition 5 (Non-iterative MP-MTL).
For a randomized MTL algorithm that uses and the datasets as input and outputs , is an non-iterative MP-MTL algorithm if for all , for neighboring input pairs and that differ only by the -th task, such that or , the following holds for some constants and for any set :
Note that the curator requires only the models as input, not the datasets. The datasets are held privately by the owners of their respective tasks and are only theoretically considered as inputs for the entire MTL algorithm.
As such, for the -th task (), we assume that potential adversaries may acquire , and we wish to protect both the input model and the dataset . Note that although we ultimately wish to protect , i.e., the output model for the -th task, all information related to the -th task in comes from the input model and the dataset , and the latter are the only objects that any algorithm can directly control because they are the inputs.
For the sake of intuition, we set and in Definition 5 to and the empty set, respectively; then, the goal of MP-MTL is to ensure that when the model for a new task is provided to the curator, an adversary, including another existing task, cannot discover significant anomalous information from the distribution of its view: . Thus, the information on the new task’s model is protected. Note that for different tasks, the views of an adversary are different.
We can redefine the non-iterative MP-MTL algorithm in the form of differential privacy.
Let be an augmented dataset; i.e., let be treated as the -th “data instance” of the augmented dataset , for all . Thus, the datasets and models associated with the tasks are transformed into a single dataset with “data instances”. Then, we define outputs such that for all , denotes the view of an adversary for the -th task, which includes . Then, an non-iterative MP-MTL algorithm should satisfy inequalities: for all , for all neighboring datasets and that differ by the -th “data instance”, and for any set , we have
STL can be easily shown to be optimal for avoiding information leakage across tasks because the individual task models are learned independently.
For any STL algorithm that uses and datasets as input, outputs and learns each task independently, is a non-iterative MP-MTL algorithm.
We learn from this lemma that if there is no information sharing across tasks, then no leakage across tasks occurs.
3.1.2 Iterative Setting
Consider an iterative MP-MTL algorithm with a number of iterations . For , a trusted curator collects the models, denoted by , for all tasks. Then, model-protected MTL is performed, and the updated models are output and sent back to their respective tasks.
In such a setting, for all , for the -th task, we wish to protect dataset and the entire input sequence of its model (denoted by for short). For the -th task, the output sequence is the view of a potential adversary.
We formally define an iterative MP-MTL algorithm as follows.
Definition 6 (Iterative MP-MTL).
Let be a randomized iterative MTL algorithm with a number of iterations . In the first iteration, performs the mapping , where includes . For , in the -th iteration, performs the mapping , where includes . is an iterative MP-MTL algorithm if for all , for all , and for neighboring input pairs and that differ only by the -th task, such that or , the following holds for some constants and for any set :
where for all , denotes the input for the -th iteration and
We can also redefine the iterative MP-MTL algorithm in the form of differential privacy by considering an augmented dataset , i.e., by treating as the -th “data instance” of the data set , for all , in the -th iteration, for all . The details are similar to those in the non-iterative setting and are omitted here.
Obviously, any non-iterative MP-MTL algorithm is an iterative MP-MTL algorithm with .
Our MP-MTL framework is elaborated in Algorithm 1, which is iterative in nature. As mentioned in Section 2, we choose to protect the tasks’ covariance matrix, which is denoted by or , depending on the chosen MTL method. As previously stated, Wishart noise  is added. Fig. 1 illustrates the key concepts of the framework. This framework is generally applicable for many optimization schemes, such as proximal gradient methods [32, 39], alternating methods  and Frank-Wolfe methods .
In Algorithm 1, the datasets are only used in STL algorithms that can be performed locally.
or the singular value decomposition of.
3.2 Instantiations of the MP-MTL Framework
In this section, we introduce two instantiations of our MP-MTL framework described in Algorithm 1. These two instantiations are related to the MTL problems represented by (2) and (3). We focus on the proximal gradient descent methods presented by Ji and Ye  and Liu et al.  for Problems (2) and (3), respectively.
First, we instantiate the MP-MTL framework for Problem (2), the low-rank case, as shown in Algorithm 2. Generally speaking, the algorithm uses an accelerated proximal gradient method. Steps 5 to 9 approximate the following proximal operator:
The approximation error bounds for both the proximal operators are provided in Section 3.4.
We use the following result to show that under high noise level, our algorithms degrade to STL methods such that they do not underperform comparing with STL methods.
Algorithm 2 degrades to an STL algorithm with no random perturbation if the smallest singular value of satisfies for sufficiently large .
Algorithm 3 degrades to an STL algorithm with no random perturbation if the smallest diagonal element of satisfies for sufficiently large .
considered a decomposed parameter/model matrix for handling heterogeneities among tasks, e.g., detecting entry-wise outliers in the parameter matrix[31, 18] and detecting anomalous tasks [26, 17]. These detection procedures are claimed to be beneficial for the knowledge sharing process in the case of heterogeneous tasks. Our MP-MTL framework can be naturally extended to such a model-decomposed setting because the additional procedures are still STL algorithms, and hence, the privacy loss will not increase; see the supplementary material for additional details.
is firstly generated from i.i.d. uniform distribution. Then the last column is multiplied by . We then run Algorithm 2, taking and . The noise matrix is not added in (b) but added in (c). (a) shows ; (b) and (c) show under their respective settings. Columns shown have been divided by their respective norms. In (b), the 10-th task results in significantly influences on the parameters of other models, especially on the first and the last features. In (c), the influences from the 10-th task are not significant. Meanwhile, the second and the fifth features are shared by most tasks as should be.
3.3 Privacy Guarantees
Algorithm 1 is an iterative MP-MTL algorithm.
3.4 Utility Analyses
Building upon the utility analysis of the Wishart mechanism presented by Jiang et al. , the convergence analysis of inexact proximal-gradient descent presented by Schmidt et al.  and the optimal solutions for proximal operators presented by Ji and Ye  and Liu et al. , we study the utility bounds for Algorithm 2 and Algorithm 3.
We define the following parameter space for a constant :
We consider . We assume that is convex and has an -Lipschitz-continuous gradient (as defined in Schmidt et al. ). Let , where for Algorithm 2 and for Algorithm 3. Without loss of generality, we assume that and . We adopt the notation .
We have studied the utility bounds for three cases of in (7). Here, we report the results for the following case.
Such a case assumes and is suitable for small privacy budgets, such as . The results for the other two cases of can be found in the supplementary material.
The number of tasks are assumed sufficient as follows.
For Algorithm 2, assume that for sufficiently large ,
For Algorithm 3, assume that for sufficiently large ,
We first present approximation error bounds for proximal operators with respect to an arbitrary noise matrix .
Consider Algorithm 3. For , in the -th iteration, let . Let the indices of non-zero rows of be denoted by , and let . Let . Suppose that there exists an integer such that where is the indicator function. Then, for any random matrix , the following holds:
We find that the approximation error bounds for both algorithms depend on (note that ).
Now, we present guarantees regarding both utility and run time. In the following,
is assumed to be the Wishart random matrix defined in each algorithm. We consider heterogeneous privacy budgets and setfor and for the convex case.
Theorem 2 (Low rank - Convexity).
No acceleration: If we set for , then if we also set
for , we have, with high probability,
Use acceleration: If we set for , then if we also set
for , we have, with high probability,