1 Introduction
There have been two main lines of work in multitask learning: First, learn a shared feature representation across all the tasks, leveraging lowdimensional subspaces in the feature space [1, 9, 17, 23]. Second, learn the relationship between the tasks to improve the performance of the related tasks [29, 28, 19, 26]. Pairwise task relationships provide very useful information for characterizing and transferring information to similar tasks.
Despite the expressive power of these two different research directions, the learning space is restricted to a single kernel (per task), chosen by the user, that corresponds to a space. Multiple Kernel Learning (MKL), on the other hand, allows the user to specify a family of base kernels related to an application, and to use the training data to automatically learn the optimal combination of these kernels. We learn the weights of the base kernels along with the model parameters in a single joint optimization problem. There is a large body of work in the recent years addressing several aspects of this problem, such as efficient learning of the kernel weights, fast optimization and providing better theoretical guarantees [22, 11, 12, 15, 4, 2, 18].
Recent work in multiple kernel learning in a multitask framework focuses on sharing common representations and assumes that the tasks are all related [10]. The motivation for this approach stems from multitask feature learning that learns joint feature representation shared across multiple tasks [1, 23]. Unfortunately, the assumption that all the tasks are related and share a common feature representation is too restrictive for many realworld applications. Similarly, based on previous work [29], one can extend the traditional multitask relationship learning MTRL with multiple taskspecific base kernels. There are two main problems with such naive approach: First, the unknown variables (task model parameters, kernel weights and task relationship matrix) are intertwined in the optimization problem, and thus making it difficult to learn for largescale applications. Furthermore, the task relationship matrix is learned in the original feature space rather than in the kernel spaces. We show in this paper, that learning the relationship between the kernel spaces empirically performs better than relations among the original feature spaces.
There have been a few attempts to imposing higherorder relationship between kernel spaces using kernel weights. Kloft et al. [12] propose a nonisotropic norm such as on kernels weights to induce the relationship between the base kernels in Reproducing Kernel Hilbert Spaces. For example, in neuroimaging, a set of base kernels are derived from several medical imaging modalities such as MRI, PET etc., or image processing methods such as morphometric or anatomical modeling, etc. Since some of the kernel functions share similar parameters such as patient information, disease progression stage, etc., we can expect that these base kernels are correlated based on how they were constructed. Such information can be obtained from medical domain experts as a part of the disease prognosis which then can be used as a prior knowledge . Previous work either assumes as a diagonal matrix or requires prior knowledge from the experts on the interaction of kernels [12, 8]. Unfortunately, such prior knowledge is not easily available in many applications either because it is timeconsuming or it is expensive to elicit. [13]. In such applications, we want to induce this relationship matrix from the data along with the kernel weights and model parameters.
This paper addresses these problems with a novel regularizationbased approach for multitask multiple kernel learning framework, called multitask multiple kernel relationship learning (MKMTRL), which models the task relationship matrix from the weights learned from the latent feature spaces of taskspecific base kernels. The idea is to automatically infer task relationships in (RKHS) spaces from their base kernels. We first propose an alternating minimization algorithm to learn the model parameters, kernel weights and task relationship matrix. The method uses a wrapper approach which efficiently uses any offtheshelf SVM solver (or any kernel machine) to learn the task model parameters. However, like previous work, the proposed iterative algorithm suffers from scalability challenges. The runtime complexity of the algorithm increases with the number of tasks and the number of base kernels per task, as it needs these base kernels in memory to learn the kernel weights and the task relationship matrix.
For largescale applications such as object detection, we introduce a novel twostage online learning algorithm based on recent work [14]
that learns the kernel weights independently from the model parameters. The first stage learns a good combination of base kernels in an online setting and the second stage uses the learned weights to estimate a linear combination of the base kernels, which can be readily used with a standard kernel method such as
SVMor kernel ridge regression
[6, 5]. We provide strong empirical evidence that learning the task relationship matrix in the RKHS space is beneficial for many applications such as stock market prediction, visual object categorization, etc. On all these applications, our proposed approach outperforms several stateoftheart multitask learning baselines. It is worth noting that the proposed multitask multiple kernel relationship learning can be readily applied for heterogeneous and multiview data with no modification to the proposed framework [7, 27].The rest of the paper is organized as follows: we provide a brief overview of multitask multiple kernel learning in the next section. In section 3, we discuss the proposed model MKMTRL, followed by our twostage online learning approach in section 4. We then show comprehensive evaluations of the proposed model against the six baselines on several benchmark datasets in section 6.
2 Preliminaries
Before introducing our approach, we briefly review the multitask multiple kernel learning framework in this section. Suppose there are learning tasks available with the training set , where is the samples from the task and it’s corresponding output . Let be a set of taskspecific base kernels, induced by the kernel mapping function on task data. The objective of multitask multiple kernel learning problem is to learn a good linear combination of the taskspecific base kernels using the relationship between the tasks.
In addition to the nonnegative constraints on , we need to impose an additional constraint or penalty to ensure that the units in which the margins are measured are meaningful (assuming that the base kernels are properly normalized). Recent work in MKL employs to constrain the kernel weights. A direct extension of regularized MKL to multitask framework is given as follows ^{1}^{1}1For clarity, we use binary classification tasks to explain the preliminaries and the proposed approach. They can be easily applied to multiclass tasks and also to regression tasks via kernel ridge regression.:
(1)  
where is the reproducing kernel Hilbert space associated with the kernel function and is the squared RKHS norm.
Similarly, we can use a general norm constraint with on the kernel weights . This can be thought of as a simple extension of MKL to multitask setting [11]. Without any additional structural constraints on , the kernel weights are learned independently for each task and thus does not efficiently use the relationship between the tasks. Hence, we call the model in equation (1) as Independent Multiple Kernel Learning (IMKL).
Jawanpuria and Nath [10] proposed Multitask Multiple Kernel Feature Learning (MKMTFL), that employs mixed norm regularizer over the RKHS norm of the feature loadings corresponding to the tasks and the base kernels. The mixed norm regularization promotes a shared feature representations to combine the given set of taskspecific base kernels. The norm regularizer learns the unequal weighting across the tasks, where as, norm regularizer over the norm leads to learning the shared kernel among the tasks.
The objective function for MKMTFL is given as follows:
(2)  
s.t., 
Note that the above objective function employs an norm across the base kernels and norm across tasks. The above optimization problem can be equivalently written in the dual space as follows:
(3)  
where,
Here
is a vector of Langragian multipliers for the
task, and corresponds to constraints on the task data. is a diagonal matrix with entries as and is the gram matrix of the task data w.r.t the kernel function. More specifically, selects the base kernels that are important for all the tasks, where as selects the base kernels that are specific to individual tasks. With this representation, MKMTFL can be seen as a multiple kernel generalization to the multilevel multitask learning proposed by Lozano and Swirszcz (2012) [23].3 Multitask Multiple Kernel Relationship Learning (MkMtrl)
This section presents the details of the proposed model MKMTRL. Since multitask learning seeks to improve performance of each task with the help of other related tasks, it is desirable in multiple kernel learning for the multitask framework to have a structural constraints on the task kernel weights to promote sharing of information from other related tasks. Note that the proposed approach is significantly different from the traditional MTRL, as explained in the introduction.
When prior knowledge on task relationship is available, the multiple kernel multitask learning model should incorporate this information for simultaneously learning several related tasks. Neither the IMKL or MKMTFL consider the pairwise task relationship such as positive task correlation, negative task correlation, and task independence when learning the kernel weights for combining the base kernels. Based on the assumption that similar tasks are likely to give similar importance to their base kernels (and thereby, their respective RKHS spaces), we consider a regularization on the task kernel weights , where, for notational convenience, we write . Mathematically, the proposed MKMTRL formulation is written as follows:
(4)  
The key difference from the IMKL model is that the standard (squared) norm on is replaced with a more meaningful structural penalty that incorporates the task relationship. Unlike in MKMTFL, the shared information among the task is separate from the core problem ( SVMs). Here, encodes the task relationship such that similar tasks are forced to have similar kernel weights. It is easy to see that when , the above problem reduces to equation (1).
3.1 MKMTRL in Dual Space
In this section, we consider the proposed approach in the dual space. By writing the above objective function in Lagrangian form and introducing Lagrangian multiplier for the constraints, we can write the corresponding dual objective function as:
(5)  
s.t.,  
where,
The above objective function is a biconvex optimization problem. Note that we can further reduce the problem by eliminating , then the dual problem becomes:
(6)  
s.t.,  
where which corresponds to in the primal space and we write . We will use this representation for deriving closedform solution for the task kernel weights
3.2 Optimization
We use an alternating minimization procedure for learning the kernel weights and the model parameters iteratively. We implement a twolayer wrapper approach commonly used in these MKL solvers for our problem. The wrapper methods alternate between minimizing the primal problem (4) w.r.t via a simple analytical update step and minimizing all other variables in terms of the dual variables from equation (5).
When are fixed, MKMTRL equation (5) reduces to independent subproblems. One can use any conventional SVM solver (or any kernel method) to optimize for independently. We focus on optimizing the kernel coefficients and next.
Optimizing w.r.t when are fixed
Given , we find by setting the gradient of equation (4) w.r.t to zero and we get:
(7) 
where , and is an elementwise product operation.
By incorporating the last term in equation (4) into the constraint set, we can eliminate the regularization parameter to obtain an analytical solution for . Because and , the constraint must be active at optimality. We can now use the above equation to solve for .
(8) 
Since the task relationship matrix is independent of the number of base kernels , one may use the above closedform solution when the number of tasks is small. For some applications, it may be desirable to employ an iterative approach such as firstorder method (FISTA) or secondorder method (Newton’s). The parameter can be easily learned by crossvalidation.
Optimizing w.r.t when are fixed
In the final step of the optimization, we fix and and solve the problem w.r.t . By taking the partial derivative of the objective function with respect to and setting it zero, we get an analytical solution for [29]:
(9) 
Substituting the above solution in equation 4, we can see that the the objective function of MKMTRL is related to the trace norm regularization. Instead of norm regularization (as in MKL) or mixednorm regularization (as in MKMTFL), our model seeks a lowrank , using , such that similar base kernels are selected among the similar tasks.
4 TwoStage Multitask Multiple Kernel Relationship Learning
The proposed optimization procedure in the previous section involves independent (or kernel ridge regression) calls, followed by two closedform expressions for jointly learning the kernel weights , task relationship matrix and the task parameters . Even though this approach is simple and easy to implement, it requires the precomputed kernel matrices to be loaded into memory for learning the kernel weights. This could add a serious computational burden especially when the number of tasks is large [25].
In this section, we consider an alternative approach to address this problem inspired by [6, 5]. It follows a twostage approach: first, we independently learn the weights of the given taskspecific base kernels using the training data and then, we use the weighted sum of these base kernels in a standard kernel machines such as or kernel ridge regression
to obtain a classifier. This approach significantly reduces the amount of computational overhead involved in the traditional multiple kernel learning algorithms that estimate the kernel weights and the classifier by solving a joint optimization problem.
We propose an efficient binary classification framework for learning the weights of these taskspecific base kernels, based on target alignment [6]. The proposed framework formulates the kernel learning problem as a linear classification in the kernel space (so called classifier). In this space, any task classifier with weight parameters directly corresponds to the task kernel weights.
algocf[ht]
algocf[ht]
For a given set of base kernels ( base kernels per task), we define a binary classification framework over a new instance space (so called space) defined as follows:
(16)  
Any hypothesis for a task induces a similarity function between instances and in the original space:
Suppose we consider a linear function for our task hypothesis with the nonnegative constraints , then the resulting induced kernel is also positive semidefinite. The key idea behind this twostage approach is that if a classifiers is a good classifier in the space, then the induced kernel will likely be positive when and belong to the same class and negative otherwise. Thus the problem of learning a good combination of base kernels can be framed as a problem of learning a good classifier.
With this framework, the optimization problem for learning for each task can be formulated as follows:
(17)  
where and is the regularization function on the kernel weights . Since we are interested in learning task relationships using the task kernel weights , we can directly extend the above formulation to incorporate the regularization on based on MKMTRL.
(18)  
Since the above objective function depends on every pair of observations, we consider an online learning procedure for faster computation that learns the kernel weights and the task relationship matrix sequentially. Due to space limitations, we show the online version of our algorithms in the supplementary section. Note that with the above formulation, one can easily extend the existing approach to jointly learn both the feature and task relationship matrices using matrix normal penalty [28].
5 Algorithms
Algorithm LABEL:alg:mkmtrl shows the pseudocode for MKMTRL. It outlines the update steps explained in Section 3. The algorithm alternates between learning the model parameters, kernel weights and task relationship matrix until it reaches the maximum number of iterations ^{2}^{2}2maxIter is set to or when there are minimal changes in the subsequent .
The twostage, online learning of MKMTRL is given in Algorithm LABEL:alg:2gmkmtrl. The online learning of and is based on the recent work by Saha et. al., 2011 [20]. We set the maximum number of rounds to . Since we construct the examples in kernel space on the fly, there is no need keep the base kernel matrices in memory. This significantly reduces the computational burden required in computing .
We use libSVM to solve the individual SVMs (equation LABEL:eq:libsvm). All the base kernels are normalized to unit trace. Note that equation LABEL:eq:svd
requires computing Singular Value Decomposition (SVD) on
. One may use an efficient decomposition algorithm such as the randomized to speed up the learning process [16].6 Experiments
OLS  Lasso  MRCE  FES  STL  IKL  IMKL  MKMTFL  MKMTRL  

Walmart  0.98  0.42  0.41  0.40  0.44  0.43  0.45  0.44  0.44 
Exxon  0.39  0.31  0.31  0.29  0.34  0.32  0.33  0.32  0.32 
GM  1.68  0.71  0.71  0.62  0.82  0.62  0.60  0.61  0.56 
Ford  2.15  0.77  0.77  0.69  0.91  0.56  0.53  0.55  0.49 
GE  0.58  0.45  0.45  0.41  0.43  0.41  0.40  0.40  0.39 
ConocoPhillips  0.98  0.79  0.79  0.79  0.84  0.81  0.83  0.80  0.80 
Citigroup  0.65  0.66  0.62  0.59  0.64  0.66  0.62  0.62  0.60 
IBM  0.62  0.49  0.49  0.51  0.48  0.47  0.45  0.45  0.43 
AIG  1.93  1.88  1.88  1.74  1.91  1.94  1.88  1.89  1.83 
AVG  1.11  0.72  0.71  0.67  0.76  0.69  0.68  0.68  0.65 
We evaluate the performance of our proposed model on several benchmark datasets. We compare our proposed model with five stateoftheart baselines in multitask learning and in multitask multiple kernel learning. All reported results in this section are averaged over random runs of the training data. Unless otherwise specified, all model parameters are chosen via 5fold cross validation. The best model and models with statistically comparable results are shown in bold.
6.1 Compared Models
We compare the following models for our evaluation.

SingleTask Learning (STL) learns the tasks independently. STL uses either SVM (in case of binary classification tasks) or Kernel Ridge regression (in case of regression tasks) to learn the individual models.

Multitask Feature Learning (MTFL [1]) learns a shared feature representation from all the tasks using regularization. It learns this shared feature representation along with the task model parameters alternatively^{3}^{3}3The source code for this baseline is available at http://ttic.uchicago.edu/~argyriou/code/mtl_feat/mtl_feat.tar.

Multitask Relationship Learning (MTRL [29]) learns task relationship matrix under a regularization framework. This model can be viewed as a multitask generalization for singletask learning. It learns the task relationship matrix and the task parameters in an iterative fashion^{4}^{4}4The source code for this baseline is available at https://www.cse.ust.hk/~zhangyu/codes/MTRL.zip.

Singletask Multiple Kernel Learning (IMKL) learns independent MKL for each task. This baseline does not use any shared information between the tasks. We use MKL for each task. We tune the value of from using fold cross validation.

Multitask Multiple Kernel Feature Learning (MKMTFL [10]) learns a shared kernel for feature representation from all tasks. This is a multiple kernel generalization of multitask feature learning problem. Again, we tune the value of from using fold cross validation^{5}^{5}5The source code for this baseline is available at http://www.cse.iitb.ac.in/saketh/research/MTFL.tgz.
Unless otherwise specified, the kernels for STL, MTFL and MTRL are chosen (via cross validation) from either a Gaussian RBF kernel with different bandwidth or a linear kernel for each dataset. The value for is chosen from . We tune the value of from . We use Newton’s method to learn the task kernel weight matrix for the alternating minimization algorithm. We compare our models on several applications: Asset Return Prediction, Landmine Detection and Object Recognition ^{6}^{6}6See supplementary material for additional experiments. It is worth noting that different applications require different types of base kernels and there is no common set of kernel functions that will work for all applications. We choose these base kernels based on the application and the type of data.
6.2 Asset Return Prediction
We begin our experiments with asset return prediction data used in [19] ^{7}^{7}7http://cran.rproject.org/web/packages/MRCE/index.html. It consists of weekly log returns of stocks from the year . This dataset is considered in linear multivariate regression with output covariance estimation techniques [19]. We consider firstorder vector autoregressive models of the form where corresponds to the dimensional vector of weekly logreturns from companies as shown in table 1. The dataset is split evenly such that the first weeks of the year is used as the training set and the next weeks is used as the test set. Following [21], we use univariate Gaussian kernels with varying bandwidth, generated from each feature, as base kernels. The total number of base kernels sums to .
Performance is measured by the average meansquared prediction error over the test set for each task. The experimental setup for this dataset follows exactly [19]
. We compare the results from our proposed and baseline model with the results from Ordinary Least Square (
OLS), Lasso, Multivariate Regression with Covariate Estimation (MRCE) and Factor Estimation and Selection (FES) models reported in [19] (See [19] for more details about the models). In addition to the standard baselines, we include Input Kernel Learning (IKL), which learns a vector of kernel weights shared by all tasks [24].After running MKMTRL on these base kernels, the model sets most of them to except for base kernels corresponding to bandwidths (). These bandwidth selections represent the longterm and shortterm dependencies common in temporal data. We reran the model with these selected nonzero bandwidths and report the results for these selected base kernels. We can see that the proposed model MKMTRL performs better than all the baselines.
30 samples  50 Samples  80 Samples  

STL  0.6315 0.032  0.6540 0.026  0.6542 0.027 
MTFL  0.6387 0.037  0.6968 0.015  0.7051 0.020 
MTRL  0.6555 0.034  0.6933 0.023  0.7074 0.024 
IMKL  0.6857 0.024  0.7138 0.011  0.7278 0.011 
MKMTFL  0.6866 0.018  0.7145 0.009  0.7305 0.009 
MKMTRL  0.6870 0.033  0.7242 0.011  0.7405 0.014 
dataset. The table reports the mean and standard errors over 10 random runs.
6.3 Landmine Detection
This dataset ^{8}^{8}8http://www.ee.duke.edu/~lcarin/LandmineData.zip consists of tasks collected from different landmine fields. Each task is a binary classification problem: landmines or clutter and each example consists of features extracted from radar images with four momentbased features, three correlationbased features, one energy ratio feature and a spatial variance feature. Landmine data is collected from two different terrains: tasks are from highly foliated regions and tasks are from desert regions, therefore tasks naturally form two clusters. Any hypothesis learned from a task should be able to utilize the information available from other tasks belonging to the same cluster.
We choose examples per task for this dataset. We use a polynomial kernel with power for generating our base kernels. Note that we intentionally kept the size of the training data small to drive the need for learning from other tasks, which diminishes as the training sets per task become large. Due to classimbalance issue (with few examples compared to examples), we use average Area Under the ROC Curve () as the performance measure. This dataset has been previously used for jointly learning feature correlation and task correlation [28]. Hence, landmine dataset is an ideal dataset for evaluating all the models.
Table 2 reports the results from the experiment. We can see that MKMTRL performs better in almost all cases. When the number of training examples is small, MKMTRL has difficulty in learning the task relationship matrix as it depends on the kernel weights. On the other hand, MKMTFL performs equally well as it shares the feature representation among the tasks which is especially useful when the number of training is relatively low. As we get more and more training data, MKMTRL performs significantly better than all the other baselines.
6.4 Object Recognition
In this section, we evaluate our two proposed algorithms for MKMTRL
with computer vision datasets,
Caltech101^{9}^{9}9http://www.vision.ee.ethz.ch/~pgehler/projects/iccv09 and Oxford flowers ^{10}^{10}10http://www.robots.ox.ac.uk/~vgg/data/flowers/17/datasplits.mat in terms of accuracy and training time. Caltech101 dataset consists of images from categories of objects such as faces, watches, animals, etc. The minimum, average and maximum number of images per category are , and respectively. The Caltech101 base kernels for each task are generated from feature descriptors such as geometric blur, PHOW gray/color, selfsimilarity, etc. For each of the classes, we select examples (for a total of examples per task) and then split these examples into testing and training folds, which ensures matching training and testing distributions. Oxford flowers consists of varieties of flowers and the number of images per category is . The Oxford base kernels for each task are generated from a subset of feature values. Each onevsall binary classification problem is considered as a single task, which amount to and tasks with and base kernels per task, respectively. Following the previous work, we set the value of for Caltech101 dataset.In addition to the baselines used before, we compare our algorithms with Multiple Kernel Learning by Stochastic Approximation (MKLSA) [3]. MKLSA has a similar formulation to that of (MKMTFL), except that it sets in equation 3. At each time step, it samples one task, according to the multinomial distribution , to update it’s model parameter, making it suitable for multitask learning with large number of tasks.
The results for Caltech101 and Oxford are shown in Figure 1. The left plots show how the mean accuracy varies with respect to different training set sizes. The right plots show the average training time taken by each model with varying training set sizes. From the plots, we can see that MKMTRL outperforms all the other stateofthe art baselines in both Caltech101 and Oxford datasets. But one may notice that the runtime of MKMTFL and MKMTRL grows steeply in the number of samples per class. Similar results are observed when we increase the number of tasks or number of base kernels per task.
Since both MKMTFL and MKMTRL require the base kernels in memory to learn the kernel weights and the task relationship matrix iteratively, this poses a serious computational burden and explains our need for efficient learning algorithm for multitask multiple kernel learning problems. We report MKMTRL with twostage, online procedure as one of the baselines. On both Caltech101 and Oxford, the twostage procedure yields comparable performance to that of MKMTRL.
The runtime complexity of twostage, online MKMTRL learning is significantly better than almost all the baselines. Since AVG takes the average of the taskspecific base kernels, it has the lowest computational time. It is interesting to see that twostage, online MKMTRL performs better than MKLSA both in terms of accuracy and running time. We believe that since MKLSA updates the kernel weights after learning a single model parameter, it takes more iterations to converge (in term of model parameters and the kernel weights).
STL  IMKL  MKMTRL  

1st DOF 




2nd DOF 




3rd DOF 




4th DOF 




5th DOF 




6th DOF 




7th DOF 




AVG 



6.5 Robot Inverse Dynamics
We consider the problem of learning the inverse dynamics of a DOF SARCOS anthropomorphic data ^{11}^{11}11http://www.gaussianprocess.org/gpml/data/
. The dataset consists of 28 dimensions, of which first 21 dimensions are considered as features and the last 7 dimensions are used as outputs. We add an additional feature to account for the bias. There are 7 regression tasks and use kernel ridge regression to learn the task parameters and kernel weights. The feature set includes seven joint positions, seven joint velocities and seven joint accelerations, which is used to predict seven joint torques for the seven degrees of freedom (DOF). We randomly sample
examples, of which are used for training sets and the rest of the examples are used for test set.This dataset has been previous shown to include positive correlation, negative correlation and task unrelatedness and will be a challenging problem for baselines that doesn’t learn the task correlation.
Following [29], we use normalized Mean Squared error (nMSE), which is the mean squared error divided by the variance of the ground truth. We generate base kernels from multivariate Gaussian kernels with varying bandwidth (based on the range of the data) and featurewise linear kernel on each of the dimensions. We use linear kernel for single task learning. The results calculated for different training set size is reported in Figure 3. We can see that MKMTRL performs better than all the baselines. Contrary to the results report in [10], MKMTFL performs the worst. As the model sees more data, it struggles to learn the task relationship and even performs worse than the single task learning.
Moreover, we report the individual nMSE for each DOF in Table 3. It shows that MKMTRL consistently outperforms in all the tasks. Comparing the results to the one reported in [29], we can see that MTMTRL (with AVG nMSE score) performs better than MTFL and MTRL (with and AVG nMSE scores respectively).
6.6 Exam Score Prediction
For completeness, we include the results for benchmark dataset in multitask regression ^{12}^{12}12http://ttic.uchicago.edu/~argyriou/code/mtl_feat/school_splits.tar. The school dataset consists of examination scores of students from
schools in London. Each school is considered as a task and the feature set includes the year of the examination, four schoolspecific and three studentspecific attribute. We replace each categorical attribute with one binary variable for each possible attribute value, as in
[1]. This results in attributes with additional attribute to account for the bias term. We generate univariate Gaussian kernel with 13 varying bandwidths from each of the attributes as our base kernels. Training and test set are obtained by dividing examples of each task into %%. We use explained variance as in [1], which is defined as one minus nMSE. We can see that MKMTRL is better than both IMKL and MKMTFL.7 Conclusion
We proposed a novel multiple kernel multitask learning algorithm that uses intertask relationships to efficiently learn the kernel weights. The key idea is based on the assumption that the related tasks will have similar weights for the taskspecific base kernels. We proposed an iterative algorithm to jointly learn this task relationship matrix, kernel weights and the task model parameters. For largescale datasets, we introduced a novel twostage online learning algorithm to learn kernel weights efficiently. The effectiveness of our algorithm is empirically verified over several benchmark datasets. The results showed that both multiple kernel learning and task relationship learning for multitask problems significantly helps in boosting the performance of the model.
Acknowledgements
We thank the anonymous reviewers for their helpful comments.
References
 [1] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multitask feature learning. Machine Learning, 73(3):243–272, 2008.
 [2] Francis R Bach, Gert RG Lanckriet, and Michael I Jordan. Multiple kernel learning, conic duality, and the smo algorithm. In Proceedings of the twentyfirst international conference on Machine learning, page 6. ACM, 2004.
 [3] Serhat Bucak, Rong Jin, and Anil K Jain. Multilabel multiple kernel learning by stochastic approximation: Application to visual object recognition. In Advances in Neural Information Processing Systems, pages 325–333, 2010.
 [4] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Generalization bounds for learning kernels. In Proceedings of the 27th International Conference on Machine Learning (ICML10), pages 247–254, 2010.
 [5] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Twostage learning kernel algorithms. In Proceedings of the 27th International Conference on Machine Learning (ICML10), pages 239–246, 2010.
 [6] N Cristianini. On kerneltarget alignment. Advances in Neural Information Processing Systems, 2002.
 [7] Jingrui He and Rick Lawrence. A graphbased framework for multitask multiview learning. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 25–32, 2011.
 [8] Chris Hinrichs, Vikas Singh, Jiming Peng, and Sterling Johnson. Qmkl: Matrixinduced regularization in multikernel learning with applications to neuroimaging. In Advances in neural information processing systems, pages 1421–1429, 2012.
 [9] Ali Jalali, Sujay Sanghavi, Chao Ruan, and Pradeep K Ravikumar. A dirty model for multitask learning. In Advances in Neural Information Processing Systems, pages 964–972, 2010.
 [10] Pratik Jawanpuria and J Saketha Nath. Multitask multiple kernel learning. In Society for Industrial and Applied Mathematics. Proceedings of the SIAM International Conference on Data Mining, page 828. Society for Industrial and Applied Mathematics, 2011.
 [11] Marius Kloft, Ulf Brefeld, Pavel Laskov, KlausRobert Müller, Alexander Zien, and Sören Sonnenburg. Efficient and accurate lpnorm multiple kernel learning. In Advances in neural information processing systems, pages 997–1005, 2009.
 [12] Marius Kloft, Ulf Brefeld, Sören Sonnenburg, and Alexander Zien. Lpnorm multiple kernel learning. The Journal of Machine Learning Research, 12:953–997, 2011.
 [13] Meghana Kshirsagar, Jaime Carbonell, and Judith KleinSeetharaman. Multitask learning for host–pathogen protein interactions. Bioinformatics, 29(13):i217–i226, 2013.
 [14] Abhishek Kumar, Alexandru NiculescuMizil, Koray Kavukcuoglu, and Hal Daume III. A binary classification framework for twostage multiple kernel learning. In Proceedings of the 29th International Conference on Machine Learning (ICML12), 2012.
 [15] Gert RG Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I Jordan. Learning the kernel matrix with semidefinite programming. The Journal of Machine Learning Research, 5:27–72, 2004.
 [16] Edo Liberty, Franco Woolfe, PerGunnar Martinsson, Vladimir Rokhlin, and Mark Tygert. Randomized algorithms for the lowrank approximation of matrices. Proceedings of the National Academy of Sciences, 104(51):20167–20172, 2007.

[17]
Jun Liu, Shuiwang Ji, and Jieping Ye.
Multitask feature learning via efficient l 2, 1norm minimization.
In
Proceedings of the twentyfifth conference on uncertainty in artificial intelligence
, pages 339–348. AUAI Press, 2009.  [18] Alain Rakotomamonjy, Francis Bach, Stéphane Canu, and Yves Grandvalet. Simplemkl. Journal of Machine Learning Research, 9:2491–2521, 2008.
 [19] Adam J Rothman, Elizaveta Levina, and Ji Zhu. Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics, 19(4):947–962, 2010.
 [20] Avishek Saha, Piyush Rai, Suresh Venkatasubramanian, and Hal Daume. Online learning of multiple tasks and their relationships. In International Conference on Artificial Intelligence and Statistics, pages 643–651, 2011.
 [21] Vikas Sindhwani, Minh Ha Quang, and Aurélie C Lozano. Scalable matrixvalued kernel learning for highdimensional nonlinear multivariate regression and granger causality. In Proceedings of the twentyninth conference on uncertainty in artificial intelligence. AUAI Press, 2013.
 [22] Zhaonan Sun, Nawanol Ampornpunt, Manik Varma, and Svn Vishwanathan. Multiple kernel learning and the smo algorithm. In Advances in neural information processing systems, pages 2361–2369, 2010.
 [23] Grzegorz Swirszcz and Aurelie C Lozano. Multilevel lasso for sparse multitask regression. In Proceedings of the 29th International Conference on Machine Learning (ICML12), pages 361–368, 2012.
 [24] Lei Tang, Jianhui Chen, and Jieping Ye. On multiple kernel learning with multiple labels. In IJCAI, pages 1255–1260, 2009.
 [25] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1113–1120. ACM, 2009.
 [26] Ya Xue, Xuejun Liao, Lawrence Carin, and Balaji Krishnapuram. Multitask learning for classification with dirichlet process priors. The Journal of Machine Learning Research, 8:35–63, 2007.
 [27] Jintao Zhang and Jun Huan. Inductive multitask learning with multiple view data. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 543–551. ACM, 2012.
 [28] Yi Zhang and Jeff G Schneider. Learning multiple tasks with a sparse matrixnormal penalty. In Advances in Neural Information Processing Systems, pages 2550–2558, 2010.
 [29] Yu Zhang and DitYan Yeung. A regularization approach to learning task relationships in multitask learning. ACM Transactions on Knowledge Discovery from Data (TKDD), 8(3):12, 2014.
Comments
There are no comments yet.