## 1 Introduction

Machine learning techniques usually require a large number of training samples to learn an accurate learner. For example, deep learning models, which build on neural networks, usually need millions of labeled samples to train neural networks with tens or even hundreds of layers which contain a huge number of model parameters. However, in some applications such as medical image analysis, this requirement cannot be fulfilled since (labeled) samples are hard to collect. In this case, limited training samples are not enough to learn shallow models, let alone deep models. For this data insufficient problem, Multi-Task Learning (MTL)

[1] is a good solution when there are multiple related tasks each of which has limited training samples.In MTL, there are multiple learning tasks each of which can be a general learning task such as supervised tasks (e.g., classification or regression problems), unsupervised tasks (e.g., clustering problems), semi-supervised tasks, reinforcement learning tasks, multi-view learning tasks or graphical models. Among these learning tasks, all of them or at least a subset of them are assumed to be related to each other. In this case, it is found that learning these tasks jointly can lead to much performance improvement compared with learning them individually. This observation leads to the birth of MTL. Hence MTL aims to improve the generalization performance of multiple tasks when they are related.

MTL is inspired by human learning activities where people often apply the knowledge learned from previous tasks to help learn a new task. For example, for a person who learns to ride the bicycle and tricycle together, the experience in learning to ride a bicycle can be utilized in riding a tricycle and vice versa. Similar to human learning, it is useful for multiple learning tasks to be learned jointly since the knowledge contained in a task can be leveraged by other tasks.

The setting of MTL is similar to that of transfer learning

[2] but also they have significant difference. In MTL, there is no distinction among different tasks and the objective is to improve the performance of all the tasks. However, in transfer learning which is to improve the performance of a target task with the help of source tasks, the target task plays a more important role than source tasks. Hence, MTL treats all the tasks equally but in transfer learning the target task attracts most attentions among all the tasks. In [3, 4, 5], a new MTL setting called asymmetric multi-task learning is investigated and this setting considers a different scenario where a new task is arrived when multiple tasks have been learned jointly via some MTL method. A simple solution is to learn the old and new tasks together from scratch but it is computationally demanding. Instead the asymmetric multi-task learning only learns the new task with the help of old tasks and hence the core problem is how to transfer the knowledge contained in the old tasks to the new task. In this sense, this setting is more similar to transfer learning than to MTL.In this paper, we give a survey on MTL. After giving a definition for MTL, we classify different MTL algorithms into several categories: feature learning approach which can be further categorized into feature transformation and feature selection approaches, low-rank approach, task clustering approach, task relation learning approach, and decomposition approach. We discuss the characteristics of each approach. MTL can be combined with other learning paradigms to further improve the performance of learning tasks and hence we discuss the combinations of MTL with other learning paradigms including semi-supervised learning, active learning, unsupervised learning, reinforcement learning, multi-view learning and graphical models. When the number of tasks is large, the number of training data in all the tasks can be very large, which makes the online and parallel computation of MTL models necessary. In this case, the training data of different tasks could locate in different machines and hence distributed MTL models are a good solution. Moreover, dimensionality reduction and feature hashing are vital tools to reduce the data dimension when facing high-dimensional data in MTL. Hence, we review those techniques that are helpful when handling big data in multiple tasks. As a general learning paradigm, MTL has many applications in various areas and here we briefly review its applications in computer vision, bioinformatics, health informatics, speech, natural language processing, web applications and ubiquitous computing. Besides algorithmic development and real-world applications of MTL, we review theoretical analyses and discuss several future directions for MTL.

The remainder of this paper is organized as follows. Section 2 introduces several categories of MTL models. In Section 3, the combinations of MTL with other learning paradigms are reviewed. Section 4 overviews online, parallel, and distributed MTL models as well as dimensionality reduction and feature hashing. Section 5 presents the applications of MTL in various areas. Section 6 gives an overview on theoretical analyses and finally we make conclusions in Section 7 with some discussions on future directions in MTL.^{1}^{1}1For an introduction to MTL without technical details, please refer to [6].

## 2 MTL Models

In order to fully characterize MTL, we first give the definition of MTL.

###### Definition 1

(Multi-Task Learning) Given learning tasks where all the tasks or a subset of them are related, *multi-task learning* aims to help improve the learning of a model for by using the knowledge contained in all or some of the tasks.

Based on the definition of MTL, we focus on supervised learning tasks in this section since most MTL studies fall in this setting. For other types of tasks, we review them in the next section. In the setting of supervised learning tasks, usually a task is accompanied by a training dataset consisting of training samples, i.e., , where is the th training instance in and is its label. We denote by the training data matrix for , i.e., . When different tasks share the same training data samples, i.e., for , MTL reduces to multi-label learning or multi-output regression. Here we consider a general setting for MTL that at least two out of all the ’s are different or a more general setting that all the ’s are different from each other. When different tasks lie in the same feature space implying that equals for any , this setting is the homogeneous-feature MTL, and otherwise it corresponds to heterogeneous-feature MTL. Without special explanation, the default MTL setting is the homogeneous-feature MTL. Here we need to distinguish the heterogeneous-feature MTL from the heterogeneous MTL. In [7], the heterogeneous MTL is considered to consist of different types of supervised tasks including classification and regression problems, and here we generalize it to a more general setting that the heterogeneous MTL consists of tasks with different types including supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, multi-view learning and graphical models. The opposite to the heterogeneous MTL is the homogeneous MTL which consist of tasks with only one type. In a word, the homogeneous and heterogeneous MTL differ in the type of learning tasks while the homogeneous-feature MTL is different from the heterogeneous-feature MTL in terms of the original feature representations. Similarly, without special explanation, the default MTL setting is the homogeneous MTL.

In order to characterize the relatedness in the definition of MTL, there are three issues to be addressed: when to share, what to share, and how to share.

The ‘when to share’ issue is to make choices between single-task and multi-task models for a multi-task problem. Currently such decision is made by human experts and there are few learning approaches to study it. A simple computational solution is to formulated such decision as a model selection problem and then use model selection techniques, e.g., cross validation, to make decisions, but this solution is usually computational heavy and may require much more training data. Another solution is to use multi-task models which can degenerate to the single-task counterparts, for example, problem (34) where the learning of different tasks can be decoupled when becomes diagonal. In this case, we can let the training data determine the form of to make an implicit choice.

‘What to share’ needs to determine the form through which knowledge sharing among all the tasks could occur. Usually, there are three forms for ‘what to share’, including feature, instance and parameter. Feature-based MTL aims to learn common features among different tasks as a way to share knowledge. Instance-based MTL wants to identify useful data instances in a task for other tasks and then shares knowledge via the identified instances. Parameter-based MTL uses model parameters (e.g., coefficients in linear models) in a task to help learn model parameters in other tasks in some ways, for example, the regularization. Existing MTL studies mainly focus on feature-based and parameter-based methods and few works belong to the instance-based method. A representative instance-based method is the multi-task distribution matching method proposed in [8]

, which first estimates density ratios between probabilities that each instance as well as its label belongs to both its own task and a mixture of all the tasks and then uses all the weighted training data from all the tasks based on the estimated density ratios to learn model parameters for each task. Since the studies on instance-based MTL are few, we mainly review feature-based and parameter-based MTL models.

After determining ‘what to share’, ‘how to share’ specifies concrete ways to share knowledge among tasks. In feature-based MTL, there is a primary approach: feature learning approach. The feature learning approach focuses on learning common feature representations for multiple tasks based on shallow or deep models, where the learned common feature representation can be a subset or a transformation of the original feature representation.

In parameter-based MTL, there are four main approaches: low-rank approach, task clustering approach, task relation learning approach, and decomposition approach. The low-rank approach interprets the relatedness of multiple tasks as the low rankness of the parameter matrix of these tasks. The task clustering approach assumes that all the tasks form a few clusters where tasks in a cluster are related to each other. The task relation learning approach aims to learn quantitative relations between tasks from data automatically. The decomposition approach decomposes the model parameters of all the tasks into two or more components, which are penalized by different regularizers.

In summary, there are mainly five approaches in the feature-based and parameter-based MTL. In the following sections, we review these approaches in a chronological order to reveal the relations and evolutions among different models in them.

### 2.1 Feature Learning Approach

Since tasks are related, it is intuitive to assume that different tasks share a common feature representation based on the original features. One reason to learn common feature representations instead of directly using the original ones is that the original representation may not have enough expressive power for multiple tasks. With the training data in all the tasks, a more powerful representation can be learned for all the tasks and this representation can bring the improvement on the performance.

Based on the relationship between the original feature representation and the learned one, we can further classify this category into two sub-categories. The first sub-category is the feature transformation approach where the learned representation is a linear or nonlinear transformation of the original representation and in this approach, each feature in the learned representation is different from original features. Different from this approach, the feature selection approach, the second sub-category, selects a subset of the original features as the learned representation and hence the learned representation is similar to the original one by eliminating useless features based on different criteria. In the following, we introduce these two approaches.

#### 2.1.1 Feature Transformation Approach

The multi-layer feedforward neural network [1], which belongs to the feature transformation approach, is one of the earliest model for multi-task learning. To see how the multi-layer feedforward neural network is constructed for MTL, in Figure 1 we show an example with an input layer, a hidden layer, and an output layer. The input layer receives training instances from all the tasks and the output layer has output units with one for each task. Here the outputs of the hidden layer can be viewed as the common feature representation learned for the

tasks and the transformation from the original representation to the learned one depends on the weights connecting the input and hidden layers as well as the activation function adopted in the hidden units. Hence, if the activation function in the hidden layer is linear, then the transformation is a linear function and otherwise it is nonlinear. Compared with multi-layer feedforward neural networks used for single-task learning, the difference in the network architecture lies in the output layers where in single-task learning, there is only one output unit while in MTL, there are

ones. In [9], the radial basis function network, which has only one hidden layer, is extended to MTL by greedily determining the structure of the hidden layer. Different from these neural network models, Silver et al.

[10] propose a context-sensitive multi-task neural network which has only one output unit shared by different tasks but has a task-specific context as an additional input.Different from multi-layer feedforward neural networks which are connectionist models, the multi-task feature learning (MTFL) method [11, 12] is formulated under the regularization framework with the objective function as

(1) |

where

denotes a loss function such as the hinge loss or square loss,

is a vector of offsets in all the tasks,

is a square transformation matrix, , the th column in , contains model parameters for the th task after the transformation, the norm of a matrix denoted by equals the sum of the norm of rows in ,denotes an identity matrix with an appropriate size, and

is a positive regularization parameter. The first term in the objective function of problem (1) measures the empirical loss on the training sets of all the tasks, the second one is to enforce to be row-sparse via the norm which is equivalent to selecting features after the transformation, and the constraint enforcesto be orthogonal. Different from the multi-layer feedforward neural network whose hidden representations may be redundant, the orthogonality of

can prevent the MTFL method from it. It is interesting to find out that problem (1) is equivalent to(2) |

where denotes the total training loss, denotes the trace of a square matrix, is the model parameter for , , denotes a zero vector or matrix with an appropriate size, for any square matrix denotes its inverse when it is nonsingular or otherwise its pseudo inverse, and means that is positive semidefinite. Based on this formulation, we can see that the MTFL method is to learn the feature covariance for all the tasks, which will be interpreted in Section 2.8 from a probabilistic perspective. Given , the learning of different tasks can be decoupled and this can facilitate the parallel computing. When given , has an analytical solution as and by plugging this solution into problem (2), we can see that the regularizer on is the squared trace norm. Then Argyriou et al. [13] extend problem (2) to a general formulation where the second term in the objective function becomes with operating on the spectrum of and discuss the condition on to make the whole problem convex.

Similar to the MTFL method, the multi-task sparse coding method [14]

is to learn a linear transformation on features with the objective function formulated as

(3) |

where is the th column in , for an integer denotes a set of integers from 1 to , denotes the norm of a vector or matrix and equals the sum of the absolute value of its entries, and denotes the norm of a vector. Here the transformation is also called the dictionary in sparse coding and shared by all the tasks. Compared with the MTFL method where in problem (1) is a orthogonal matrix, in problem (3) is overcomplete, which implies that is larger than , with each column having a bounded norm. Another difference is that in problem (1) is enforced to be row-sparse but in problem (3) it is only sparse via the first constraint. With a similar idea to the multi-task sparse coding method, Zhu et al. [15]

propose a multi-task infinite support vector machine via the Indian buffet process and the difference is that in

[15] the dictionary is sparse and model parameters are non-sparse. In [16], the spike and slab prior is used to learn sparse model parameters for multi-output regression problems where transformed features are induced by Gaussian processes and shared by different outputs.Recently deep learning becomes popular due to its capacity to learn nonlinear features in many applications and deep models have been used as basic models in MTL. Different from the aforementioned models which are shallow since there is only one level of the feature transformation, there are some deep MTL models which can have many layers of the feature transformations. For example, similar to the multi-task neural network shown in Fig. 1, many deep MTL methods [17, 18, 19, 20, 21] assume that different tasks share the first several hidden layers and then have task-specific parameters in the subsequent layers. Different from these deep MTL methods, a more advanced way is to use a learning-based approach to determine the inputs of hidden layers in different tasks, e.g., the cross-stitch network proposed in [22] to learn task relations in terms of the hidden feature representation, which is a bit similar to the task relation learning approach introduced later. Specifically, given two tasks and with an identical network architecture, () denotes the hidden feature outputted by the th unit of the th hidden layer for task (). Then we can define the cross-stitch operation on and as , where and are new hidden features after learning the two tasks jointly. When both and equal 0, training the two networks jointly is equivalent to training them independently. The network architecture of the cross-stitch network is shown in Fig. 2. Here matrix

encodes the task relations between the two tasks and it can be learned via the backpropagation method. Different from the task relation learning approach whose task relations are defined based on the model parameters,

is based on hidden features. Moreover, the adversarial learning is applied to learn common features for MTL in [23].#### 2.1.2 Feature Selection Approach

One way to do feature selection in MTL is to use the group sparsity based on the norm denoted by , which is equal to , where denotes the th row of and denotes the norm of a vector. Obozinski et al. [24, 25] are among the first to study the multi-task feature selection (MTFS) problem based on the norm with the objective function formulated as

(4) |

The regularizer on in problem (4) is to enforce to be row-sparse, which in turn helps select important features. In [24, 25], a path-following algorithm is proposed to solve problem (4) and then Liu et al. [26] employ an optimal first-order optimization method to solve it. Compared with problem (1), we can see that problem (4) is similar to the MTFL method without learning the transformation . Lee et al. [27] propose a weighted norm for multi-task feature selection where the weights can be learned as well and problem (4) is extended in [28] to a general case where feature groups can overlap with each other. In order to make problem (4

) more robust to outliers, a square-root loss function is investigated in

[29]. Moreover, in order to make speedup, a safe screening method is proposed in [30] to filter out useless features corresponding to zero rows in before optimizing problem (4).Liu et al. [31] propose to use the norm to select features with the objective function formulated as

(5) |

A block coordinate descent method is proposed to solve problem (5). In general, we can use the norm to select features for MTL and the objective function is formulated as

(6) |

In order to keep the convexity of problem (6), it is required that and . For the optimization of problem (6), Vogt and Roth [32] propose an active set algorithm to solve the norm regularization efficiently for arbitrary .

In order to attain a more sparse subset of features, Gong et al. [33] propose a capped- penalty for multi-task feature selection where or 2 and the objective function is formulated as

(7) |

where denotes the th row of . With the given threshold , the capped- penalty (i.e., the second term in problem (7)) focuses on rows with smaller norms than , which is more likely to be sparse. When becomes large enough, the second term in problem (7) becomes and hence problem (7) degenerates to problem (4) or (5) when equals 2 or .

Lozano and Swirszcz [34] propose a multi-level Lasso for MTL where the th entry in the parameter matrix is defined as . When is equal to 0, becomes 0 for and hence the th feature is not selected by the model. In this sense, controls the global sparsity for the th feature among the tasks. Moreover, when becomes 0, is also 0 for only, implying that the th feature is not useful for task , and so is a local indicator for the sparsity in task . Based on these observations, and are expected to be sparse, leading to the objective function formulated as

(8) |

where , , and the nonnegative constraint on is to keep the model identifiability. It has been proven in [34] that problem (8) leads to a regularizer , the square root of the norm regularization. Moreover, Wang et al. [35] extend problem (8) to a general situation where the regularizer becomes . By utilizing a priori information describing the task relations in a hierarchical structure, Han et al. [36] propose a multi-component product based decomposition for where the number of components in the decomposition can be arbitrary instead of only 2 in [34, 35]. Similar to [34], Jebara [37, 38] proposes to learn a binary indicator vector to do multi-task feature selection based on the maximum entropy discrimination formalism.

Similar to [36] where a priori information is given to describe task relations in a hierarchical/tree structure, Kim and Xing [39] utilize the given tree structure to design a regularizer on as , where denotes the set of nodes in the given tree structure, denotes the set of leaf nodes (i.e., tasks) in a sub-tree rooted at node , and denotes a subvector of the th row of indexed by . This regularizer not only enforces each row of to be sparse as the norm did in problem (4), but also induces sparsity in subsets of each row in based on the tree structure.

Different from conventional multi-task feature selection methods which assume that different tasks share a set of original features, Zhou et al. [40] consider a different scenario where useful features in different tasks have no overlapping. In order to achieve this, an exclusive Lasso model is proposed with the objective function formulated as , where the regularizer is the squared norm on .

Another way to select common features for MTL is to use sparse priors to design probabilistic or Bayesian models. For -regularized multi-task feature selection, Zhang et al. [41] propose a probabilistic interpretation where the regularizer corresponds to a prior: , where

denotes the generalized normal distribution. Based on this interpretation, Zhang et al.

[41] further propose a probabilistic framework for multi-task feature selection, in which task relations and outlier tasks can be identified, based on the matrix-variate generalized normal prior.In [42], a generalized horseshoe prior is proposed to do feature selection for MTL as:

where denotes a univariate or multivariate normal distribution with as the mean and

as the variance or covariance matrix,

and are the th entries in and , respectively, andare hyperparameters. Here

shared by all the tasks denotes the feature correlation matrix to be learned from data and it encodes an assumption that different tasks share identical feature correlations. When becomes an identity matrix which means that features are independent, this prior degenerates to the horseshoe prior which can induce sparse estimations.Hernández-Lobato et al. [43] propose a probabilistic model based on the horseshoe prior as:

(9) | |||||

where is the probability mass function at zero and denotes the density function of non-zero coefficients. In Eq. (9), indicates whether feature is an outlier () or not () and indicates whether task is an outlier () or not (). Moreover, and indicate whether feature is relevant for the prediction in () or not (), and indicates whether the non-outlier feature is relevant () for the prediction or not () in all non-outlier tasks. Based on the above definitions, the three terms in the right-hand side of Eq. (9

) specify probability density functions of

based on different situations of features and tasks. So this model can also handle outlier tasks but in a way different from [41].#### 2.1.3 Comparison between Two Sub-categories

The two sub-categories have different characteristics where the feature transformation approach learns a transformation of the original features as the new representation but the feature selection approach selects a subset of the original features as the new representation for all the tasks. Based on the characteristics of those two approaches, the feature selection approach can be viewed as a special case of the feature transformation approach when the transformation matrix is a diagonal matrix where the diagonal entries with value 1 correspond to the selected features. From this perspective, the feature transformation approach usually can fit the training data better than the feature selection approach since it has more capacity and hence if there is no overfitting when using the feature transformation approach, its generalization performance will have a certain probability to be better than that of the feature selection approach. On the other hand, by selecting a subset of the original features as the new representation, the feature selection approach has a better interpretability. In a word, if an application needs better performance, the feature transformation approach is more preferred and if the application needs some decision support, the feature selection approach may be the first choice.

### 2.2 Low-Rank Approach

The relatedness among multiple tasks can imply the low-rank of , leading to the low-rank approach.

Ando and Zhang [44] assume that the model parameters of different tasks share a low-rank subspace in part and more specifically, takes the following form as

(10) |

Here is the shared low-rank subspace by multiple tasks where . Then we can write in a matrix form as . Based on the form of , the objective function proposed in [44] is formulated as

(11) |

where denotes the Frobenius norm. The orthonormal constraint on in problem (11) makes the subspace non-redundant. When is large enough, the optimal

can become a zero matrix and hence problem (

11) is very similar to problem (1) except that there is no regularization on in problem (11) and that has a smaller number of rows than columns. Chen et al. [45] generalize problem (11) as(12) |

When setting to be 0, problem (12) reduces to problem (11). Even though problem (12) is non-convex, with some convex relaxation technique, it can be relaxed to the following convex problem as

(13) |

where and . One advantage of problem (13) over problem (12) is that the global optimum of the convex problem (13) is much easier to be obtained than that of the non-convex problem (12). Compare with the alternative objective function (2) in the MTFL method, problem (13) has a similar formulation where models the feature covariance for all the tasks. Problem (11) is extended in [46] to a general case where different ’s lie in a manifold instead of a subspace. Moreover, in [47, 48], a latent variable model is proposed for with the same decomposition as Eq. (10) and it can provide a framework for MTL by modeling more cases than problem (11) such as task clustering, sharing sparse representation, duplicate tasks and evolving tasks.

It is well known that using the trace norm as a regularizer can make a matrix have low rank and hence this regularization is suitable for MTL. Specifically, an objective function with the trace norm regularization is proposed in [49] as

(14) |

where denotes the

th smallest singular value of

and denotes the trace norm of matrix . Based on the trace norm, Han and Zhang [50] propose a variant called the capped trace regularizer with the objective function formulated as(15) |

With the use of the threshold , the capped trace regularizer only penalizes small singular values of , which is related to the determination of the rank of . When is large enough, the capped trace regularizer will become the trace norm and hence in this situation, problem (15) reduces to problem (14). Moreover, a spectral -support norm is proposed in [51] as an improvement over the trace norm regularization.

The trace norm regularization has been extended to regularize model parameters in deep learning models. Specifically, the weights in the last several fully connected layers of deep feedforward neural networks can be viewed as the parameters of different learners of all the tasks. In this view, the weights connecting two consecutive layers for one task can be organized in a matrix and hence the weights of all the tasks can form a tensor. Based on such tensor representations, several tensor trace norms that are based on the matrix trace norm

[49], are used in [52] as regularizers to identify the low-rank structure of the parameter tensor.### 2.3 Task Clustering Approach

The task clustering approach assumes that different tasks form several clusters each of which consists of similar tasks. As indicated by its name, this approach has a close connection to clustering algorithms and it can be viewed an extension of clustering algorithms to the task level while the conventional clustering algorithms are on the data level.

Thrun and Sullivan [53] propose the first task clustering algorithm by using a weighted nearest neighbor classifier for each task, where the initial weights to define the weighted Euclidean distance are learned by minimizing pairwise within-class distances and maximizing pairwise between-class distances simultaneously within each task. Then they define a task transfer matrix whose th entry records the generalization accuracy obtained for task by using task ’s distance metric via the cross validation. Based on , tasks can be grouped into clusters by maximizing , where denotes the cardinality of a set. After obtaining the cluster structure among all the tasks, the training data of tasks in a cluster will be pooled together to learn the final weighted nearest neighbor classifier. This approach has been extended to an iterative process in [54] in a way similar to -means clustering.

Bakker and Heskes [55] propose a multi-task Bayesian neural network model with the network structure similar to Fig. 1 where input-to-hidden weights are shared by all the tasks but hidden-to-output weights are task-specific. By defining as the vector of hidden-to-output weights for task , the multi-task Bayesian neural network assigns a mixture of Gaussian prior to it: , where , and specify the prior, the mean and the covariance in the

th cluster. For tasks in a cluster, they will share a Gaussian distribution. When

equals 1, this model degenerates to a case where model parameters of different tasks share a prior, which is similar to several Bayesian MTL models such as [56, 57, 58] that are based on Gaussian processes and processes.Xue et al. [3] deploy the Dirichlet process to do clustering on task level. Specifically, it defines the prior on as

where denotes a Dirichlet process with as a positive scaling parameter and a base distribution. To see the clustering effect, by integrating out , the conditional distribution of , given model parameters of other tasks , is

where denotes the distribution concentrated at a single point . So can be equal to either () with probability , which corresponds to the case that those two tasks lie in the same cluster, or a new sample from with probability , which is the case that task forms a new task cluster. When is large, the chance to form a new task cluster is large and so will affect the number of task clusters. This model is extended in [59, 60] to a case where different tasks in a task cluster share useful features via a matrix stick-breaking process and a beta-Bernoulli hierarchical prior, respectively, and in [61] where each task is a compressive sensing task. Moreover, a nested Dirichlet process is proposed in [62, 63]

to use Dirichlet processes to learn both task clusters and the state structure of an infinite hidden Markov model, which handles sequential data in each task. In

[64], is decomposed as similar to Eq. (10), where and are sampled according to a Dirichlet process.Different from [55, 3], Jacob et al. [65] aim to learn task clusters under the regularization framework by considering three orthogonal aspects, including a global penalty to measure on average how large the parameters, a measure of between-cluster variance to quantify the distance among different clusters, and a measure of within-cluster variance to quantify the compactness of task clusters. By combining these three aspects and adopting some convex relaxation technique, the convex objective function is formulated as

(16) |

where denotes the centering matrix, denotes a column vector of all ones with its size depending on the context, and are hyperparameters.

Kang et al. [66] extend the MTFL method [11, 12], which treats all the tasks as a whole cluster, to the case with multiple task clusters and aim to minimize the squared trace norm in each cluster. A diagonal matrix, , is defined as a cluster indicator matrix for the th cluster. The th diagonal entry of is equal to 1 if task lies in the th cluster and otherwise 0. Since each task can belong to only one cluster, it is easy to see that . Based on these considerations, the objective function is formulated as

When equals 1, this method reduces to the MTFL method.

Han and Zhang [67] devise a structurally sparse regularizer to cluster tasks with the objective function as

(17) |

Problem (17) is a special case of the method proposed in [67] with only one level of task clusters. The regularizer on enforces any pair of columns in to have a chance to be identical and after solving problem (17), the cluster structure can be discovered by comparing columns in . One advantage of this structurally sparse regularizer is that the convex problem (17) can automatically determine the number of task clusters.

Barzilai and Crammer [68] propose a task clustering method by defining as where and . With an assumption that each task belongs to only one cluster, the objective function is formulated as

where denotes the th column of . When using the hinge loss or logistic loss, this non-convex problem can be relaxed to a min-max problem, which has a global optimum, by utilizing the dual problem with respect to and and discarding some non-convex constraints.

Zhou and Zhao [69] aim to cluster tasks by identifying representative tasks which are a subset of the given tasks. If task is selected by task as a representative task, then it is expected that model parameters for are similar to those of . is defined as the probability that task selects task as its representative task. Then based on , the objective function is formulated as

(18) |

The third term in the objective function of problem (18) enforces the closeness of each pair of tasks based on and the last term employs the norm to enforce the row sparsity of which implies that the number of representative tasks is limited. The constraints in problem (18) guarantees that entries in define valid probabilities. Problem (18) is related to problem (17) since the regularizer in problem (17) can be reformulated as , where both the regularizer and constraint on are different from those on in problem (18).

Previous studies assume that each task can belong to only one task cluster and this assumption seems too restrictive. In [70], a GO-MTL method relaxes this assumption by allowing a task to belong to more than one cluster and defines a decomposition of similar to [68] as where denotes the latent basis with and contains linear combination coefficients for all the tasks. is assumed to be sparse since each task is generated from only a few columns in the latent basis or equivalently belongs to a small number of clusters. The objective function is formulated as

(19) |

Compared with the objective function of multi-task sparse coding, i.e., problem (3), we can see that when the regularization parameters take appropriate values, these two problems are almost equivalent except that in multi-task sparse coding, the dictionary is overcomplete, implying that the number of columns in is larger than that of rows, while here the number of columns in is smaller than that of its rows. This method has been extended in [71] to decompose the parameter tensor in the fully connected layers of deep neural networks.

Among the aforementioned methods, the method in [53] first identifies the cluster structure and then learns the model parameters of all the tasks separately, which is not preferred since the cluster structure learned may be suboptimal for the model parameters, hence follow-up works learn model parameters and the cluster structure together. An important problem in clustering is to determine the number of clusters and this is also important for this approach. Out of the above methods, only methods in [3, 67] can automatically determine the number of task clusters, where the method in [3] depends on the capacity of the Dirichlet process while the method in [67] relies on the use of a structurally sparse regularizer. For the aforementioned methods, some of them belong to Bayesian learning, i.e., [55, 3], while the rest models are regularized models. Among these regularized methods, only the objective function proposed in [67] is convex while others are originally non-convex.

The task clustering approach is related to the low-rank approach. To see that, suppose that there are task clusters where and all the tasks in a cluster share the same model parameters, making the parameter matrix low-rank with the rank at most . From the perspective of modeling, by setting to be a zero vector in Eq. (10), we can see that the decomposition of in [44] becomes similar to those in [70, 68], which in some aspect demonstrates the relations between these two approaches. Moreover, the equivalence between problems (13) and (16), two typical methods in the low-rank and task clustering approaches, has been proved in [72]. The task clustering approach can visualize the cluster structure, which is an advantage over the low-rank approach.

### 2.4 Task Relation Learning Approach

In MTL, tasks are related and the task relatedness can be quantitated via task similarity, task correlation, task covariance and so on. Here we use task relations to include all the quantitative relatedness.

In earlier studies on MTL, the task relations are assumed to be known as a priori information. In [73, 74], each task is assumed to be similar to any other task and so model parameters of each task will be enforced to approach the average model parameters of all the tasks. In [75, 76, 77], task similarities for each pair of tasks are given and these studies utilize the task similarities to design regularizers to guide the learning of multiple tasks in a principle that the more similar two tasks are, the closer the corresponding model parameters are expected to be. Moreover, given a tree structure describing relations among tasks in [78], model parameters of a task corresponding to a node in the tree is enforced to be similar to those of its parent node.

However, in most applications, task relations are not available. In this case, learning task relations from data automatically is a good option. Bonilla et al. [79] propose a multi-task Gaussian process (MTGP) by defining a prior on , the functional value for , as , where contains the functional values for all the training data. , the covariance matrix, defines the covariance between and as , where denotes a kernel function and describes the covariance between tasks and . In order to keep positive definite, a matrix containing as its th entry is also required to be positive definite, which makes the task covariance to describe the similarities between tasks. Then based on the Gaussian likelihood for labels given , the analytically marginal likelihood by integrating out can be used to learn from data. In [80], the learning curve and generalization bound of the multi-task Gaussian process are studied. Since in MTGP has a point estimation which may lead to the overfitting, based on a proposed weight-space view of MTGP, Zhang and Yeung [81] propose a multi-task generalized process by placing an inverse-Wishart prior on as , where

denotes the degree of freedom and

is the base covariance for generating . Since models the covariance between pairs of tasks, it can be determined based on the maximum mean discrepancy (MMD) [82].Different from [79, 81] which are Bayesian models, Zhang and Yeung [4, 5] propose a regularized multi-task model called multi-task relationship learning (MTRL) by placing a matrix-variate normal prior on as

(20) |

where denotes a matrix-variate normal distribution with as the mean, the row covariance, and the column covariance. Based on this prior as well as some likelihood function, the objective function for the maximum a posterior solution is formulated as

(21) |

where the second term in the objective function is to penalize the complexity of , the last term is due to the matrix-variate normal prior, and the constraints control the complexity of the positive definite covariance matrix . It has been proved in [4, 5] that problem (21) is jointly convex with respect to , , and . Problem (21) has been extended to multi-task boosting [83] and multi-label learning [84] by learning label correlations. Problem (21) can also been interpreted from the perspective of reproducing kernel Hilbert spaces for vector-valued functions [85, 86, 87, 88]. Moreover, Problem (21) is extended to learn sparse task relations in [89] via the regularization on when the number of tasks is large. A model similar to problem (21) is proposed in [90] via a matrix-variate normal prior on : , where the inverses of and are assumed to be sparse. The MTRL model is extended in [91] to use the symmetric matrix-variate generalized hyperbolic distribution to learn block sparse structure in and in [92] to use the matrix generalized inverse Gaussian prior to learn low-rank and . Moreover, the MTRL model is generalized to the multi-task feature selection problem [41] by learning task relations via the matrix-variate generalized normal distribution. Since the prior defined in Eq. (20) implies that follows , where denotes a Wishart distribution, Zhang and Yeung [93] generalize it as

(22) |

where is a positive integer to model high-order task relationships. Eq. (22) can induce a new prior, which is a generalization of the matrix-variate normal distribution, on and based on this new prior, a new regularized method is devised to learn the task relations in [93]. As a special case of MTL, multi-output regression problems, where each output is treated as a task and all the tasks share the training data, are investigated in [92, 94, 95, 96] to not only learn the relations among different outputs/tasks in a way similar to problem (21) but also model the structure contained in noises via some matrix-variate priors. The MTRL method has been extended to deep neural networks in [97] by placing a tensor norm distribution as a prior on the parameter tensor in the fully connected layers.

Different from the aforementioned methods which investigate the use of global learning models in MTL, Zhang [98] aims to learn the task relations in local learning methods such as the -nearest-neighbor (NN) classifier by defining the learning function as a weighted voting of neighbors:

(23) |

where denotes the set of task indices and instance indices for the nearest neighbors of , i.e., meaning that is one of the nearest neighbors of , defines the similarity between and , and represents the contribution of task to when has some data points to be neighbors of a data point in . can be viewed as the similarity from to . When for all and , Eq. (23) reduces to the decision function of the NN classifier for all the tasks. Then the objective function to learn , which is a matrix with as its th entry, can be formulated as

(24) |

The first regularizer in problem (24) enforces to be nearly symmetric depending on and the second one is to penalize the complexity of . The constraints in problem (24) make sure that the similarity from one task to itself is positive and also the largest. Similarly, a multi-task kernel regression is proposed in [98] for regression problems.

While the aforementioned methods whose task relations are symmetric except [98], Lee et al. [99] focus on learning asymmetric task relations. Since different tasks are assumed to be related, can lie in the space spanned by , i.e., , and hence we have . Here matrix can be viewed as asymmetric task relations between pairs of tasks. By assuming that is sparse, the objective function is formulated as

(25) |

where denotes the th row of by deleting . The term before the training loss of each task, i.e., , not only enforces to be sparse but also allows asymmetric information transfer from easier tasks to difficult ones. The regularizer in problem (25) can make approach with the closeness depending on . To see the connection between problems (25) and (21), we rewrite the regularizer in problem (25) as . Based on this reformulation, the regularizer in problem (25) is a special case of that in problem (21) by assuming where is a nonnegative matrix. Even though is asymmetric, from the perspective of the regularizer, the task relations here are symmetric and act as the task precision matrix which has a restrictive form.

### 2.5 Decomposition Approach

The decomposition approach assumes that the parameter matrix can be decomposed into two or more component matrices where , i.e., . The objective functions of most methods in this approach can be unified as

(26) |

where the regularizer is decomposable with respect to ’s and denotes a set of constraints for component matrices. To help understand problem (26), we introduce several instantiations.

In [100] where equals 2 and is an empty set, and are defined as

where and are positive regularization parameters. Similar to problem (5), each row of is likely to be a zero row and hence can help select important features. Due to the norm regularization, makes sparse. Because of the characteristics of two regularizers, the parameter matrix can eliminate unimportant features for all the tasks when the corresponding rows in both and are sparse. Moreover, can identify features for tasks which have their own useful features and may be outliers for other tasks. Hence this model can be viewed as a ‘robust’ version of problem (5).

With two component matrices, Chen et al. [101] define

(27) |

where . Similar to problem (14), makes low-rank. With a sparse regularizer , makes the entire model matrix more robust to outlier tasks in a way similar to [100]. When is large enough, will become a zero matrix and hence problem (27) will act similarly to problem (14).

’s in [102] with are defined as

(28) |

Different from the above two models which assume that is sparse, here enforces to be column-sparse. For related tasks, their columns in are correlated via the trace norm regularization and the corresponding columns in are zero. For outlier tasks which are unrelated to other tasks, the corresponding columns in can take arbitrary values and hence model parameters for them in have no low-rank structure even though those in may have.

In [103], these functions are defined as

(29) |

Similar to problem (4), makes row-sparse. Here is identical to that in [102] and it makes column-sparse. Hence helps select useful features while non-zero columns in capture outlier tasks.

With , Zhong and Kwok [104] define

where with as the th entry in a matrix . Due to the sparse nature of the norm, enforces corresponding entries in different columns of to be identical, which is equivalent to clustering tasks in terms of individual model parameters. Both the squared Frobenius norm regularizations in and penalize the complexities of and . The use of improves the model flexibility when not all the tasks exhibit clear cluster structure.

Different from the aforementioned methods which have only two component matrices, an arbitrary number of component matrices are considered in [105] with

(30) |

where . According to Eq. (30), is assumed to be both sparse and row-sparse for all . Based on different regularization parameters on the regularizer of , we can see that when increases, is more likely to be sparse than to be row-sparse. Even though each is sparse or row-sparse, the entire parameter matrix can be non-sparse and hence this model can discover the latent sparse structure among multiple tasks.

In the above methods, different component matrices have no direct connection. When there is a dependency among component matrices, problem (26) can model more complex structure. For example, Han and Zhang [106] define

where denotes the th column of . Note that the constraint set relates component matrices and the regularizer makes each pair of and has a chance to become identical. Once this happens for some , , , then based on the constraint set , and will always have the same value for . This corresponds to sharing all the ancestor nodes for two internal nodes in a tree and hence this method can learn a hierarchical structure to characterize task relations. When the constraints are removed, this method reduces to the multi-level task clustering method [67], which is a generalization of problem (17).

Another way to relate different component matrices is to use a non-decomposable regularizer as [107] did, which is slightly different from problem (26) in terms of the regularizer. Specifically, given tasks, there are possible and non-empty task clusters. All the task clusters can be organized in a tree, where the root node represents a dummy node, nodes in the second level represents groups with a single task, and the parent-child relations are the ‘subset of’ relation. In total, there are component matrices each of which corresponds to a node in the tree and hence an index is used to denote both a level and the corresponding node in the tree. The objective function is formulated as

(31) |

where takes a value between 1 and 2, denotes the set of all the descendants of