1 Introduction
Multitask learning leverages shared structures among the tasks to jointly build a better model for each task. Most existing work in multitask learning focuses on how to take advantage of task similarities, either by learning the relationship between the tasks via crosstask regularization techniques (Zhang & Yeung, 2014; Zhang & Schneider, 2010; Rothman et al., 2010; Xue et al., 2007) or by learning a shared feature representation across all the tasks, leveraging lowdimensional subspaces in the feature space (Argyriou et al., 2008; Jalali et al., 2010; Liu et al., 2009; Swirszcz & Lozano, 2012). Learning task relationships has been shown beneficial in (positive and negative) transfer of knowledge from informationrich tasks to informationpoor tasks (Zhang & Yeung, 2014), whereas the shared feature representation has been shown to perform well when each task has a limited number of training instances (observations) compared to the total number across all tasks (Argyriou et al., 2008). Existing research in multitask learning considers either the first approach and learns a task relationship matrix in addition to the task parameters, or relies on the latter approach and learns a shared latent feature representation from the task parameters. To the best of our knowledge, there is no prior work that utilizes both principles jointly for multitask learning. In this paper, we propose a new approach that learns a shared feature representation along with the task relationship matrix jointly to combine the advantages of both principles into a general multitask learning framework.
Early work on latent shared representation includes (Zhang et al., 2005)
, which proposes a model based on Independent Component Analysis (ICA) for learning multiple related tasks. The task parameters are assumed to be generated from independent sources.
(Argyriou et al., 2008) consider sparse representations common across many learning tasks. Similar in spirit to PCA for unsupervised tasks, their approach learns a low dimensional representation of the observations (Ding & He, 2004). More recently, (Kumar & Daume, 2012) assume that relationships among tasks are sparse to enforce that each observed task is obtained from only a few of the latent features, and from there learn the overlapping group structure among the tasks. (Crammer & Mansour, 2012)propose a Kmeanslike procedure that simultaneously clustering different tasks and learning a small pool of
shared models. Specifically, each task is free to choose a model from the pool that better classifies its own data, and each model is learned from pooling together all the training data that belongs to the same cluster.
(Barzilai & Crammer, 2015) propose a similar approach that clusters the tasks into taskclusters with hard assignments.These methods compute the factorization of the task weight matrix to learn the shared feature representation and the task structure. This matrix factorization induces the simultaneous clustering of both the tasks and the features in the dimensional latent subspace (Li & Ding, 2006)
. One of the major disadvantages of this assumption is that it restricts the model to define both the tasks and the features to have same number of clusters. For example, in the case of sentiment analysis, where each task belongs to a certain domain or a product category such as books, automobiles, etc., and each feature is simply a word from the vocabulary of the product reviews. Clearly, assuming both the features and the tasks have same number of clusters is an unjustified assumption, as the number of feature clusters are typically more than the number of task clusters, but the latter increase more than the former, as new products are introduced. Such a restrictive assumption may (and often does) hurt the performance of the model.
Unlike in the previous work, our proposed approach provides a flexible way to cluster both the tasks and the features. We introduce an additional degree of freedom that allows the number of task clusters to differ from the number of features clusters
(Ding et al., 2006; Wang et al., 2011). In addition, our proposed models learns both the task relationship matrix and the feature relationship matrix along with the coclustering of both the tasks and the features (Gu & Zhou, 2009; Sindhwani et al., 2009). Our proposed approach is closely related to Output Kernel Learning (OKL) where we learn the kernel between the components of the output vector for problems such as multioutput learning, multitask learning, etc
(Dinuzzo et al., 2011; Sindhwani et al., 2013). The key disadvantage of OKL is that it requires the computation of kernel matrix between every pair of instances from all the tasks. This results in scalability constraint especially when the number of tasks/features is large (Weinberger et al., 2009). Our proposed models achieve the similar effect by learning a shared feature representation common across the tasks.A key challenge in factoring with the extra degree of freedom is optimizing the resulting objective function. Previous work on coclustering for multitask learning requires strong assumptions on the task parameters. (Zhong & Kwok, 2012) or not scalable to largescale applications (Xu et al., 2015). We propose an efficient algorithm that scales well to largescale multitask learning and utilizes the structure of the objective function to learn the factorized task parameters. We formulate the learning of latent variables in terms of a generalized Sylvester equation which can be efficiently solved using the conjugate gradient descent algorithm. We start from the mathematical background and then motivate our approach in Section 2. Then we introduce our proposed models and their learning procedures in Section 3. Section 4 reports the empirical analysis of our proposed models and shows that learning both the task clusters and the feature clusters along with the task parameters gives significant improvements compared to the stateoftheart baselines in multitask learning.
2 Background
Suppose we have tasks and is the training set for each task . Let represent the weight vector for a task indexed by . These task weight vectors are stacked as columns of a matrix , which is of size , with being the feature dimension. Traditional multitask learning imposes additional assumptions on such as lowrank, norm, norm, etc to leverage the shared characteristics among the tasks. In this paper, we consider a similar assumption based on the factorization of the task weight matrix .
In factored models, we decompose the weight matrix as , where can be interpreted as a feature cluster matrix of size with feature clusters and, similarly, as a task cluster matrix of size with task clusters. If we consider squared error losses for all the tasks, then the objective function for learning and can be given as follows:
(1) 
In the above objective function, the latent feature representation is captured by the matrix and the grouping structure on the tasks is determined by the matrix . The predictor for task can then be computed from , where is row of matrix . In the above objective function, is a regularization term that penalizes the unknown matrix with regularization parameter . Similarly, is a regularization term that penalizes the unknown matrix with regularization parameter . and are their corresponding constraint spaces. Without these additional constraints on and , the objective function reduces to solving each task independently, since any task weight matrix from and can also be attained by .
Several assumptions can be enforced on these unknown factors and . Below we discuss some of the previous models that make some wellknown assumptions on and and can be written in terms of the above objective function.
(1) Factored Multitask Learning (FMTL) (Amit et al., 2007) considers a squared frobenius norm on both and .
(2) 
It can be shown that the above problem can equivalently written as the multitask learning with trace norm constraint on the task weight matrix .
(2) Multitask Feature Learning (MTFL) (Argyriou et al., 2008) assumes that the matrix learns sparse representations common across many tasks. Similar in spirit to PCA for unsupervised tasks, MTFL learns a low dimensional representation of the observations for each task, using such that .
(3) 
where is usually set to . It considers an norm on to force all the tasks to have a similar sparsity pattern such that the tasks select the same latent features (columns of ). It is worth noting that the Equation 3 can be equivalently written as follows:
(4) 
which then can be rewritten as multitask learning with a trace norm constraint on the task weight matrix as before.
(3) Group Overlap MTL (GOMTL) (Kumar & Daume, 2012) assumes that the matrix is sparse to enforce that each observed task is obtained from only a few of the latent features, indexed by the nonzero pattern of the corresponding rows of the matrix .
(5) 
The above objective function can be compared to dictionary learning where each column of is considered as a dictionary atom and each row of is considered as their corresponding sparse codes (Maurer et al., 2013).
(4) Multitask Learning by Clustering (CMTL) (Barzilai & Crammer, 2015) assumes that the tasks can be clustered into taskclusters with hard assignment. For example, if the th element of is one, and all other elements of are zero, we say that task is associated with cluster .
(6) 
The constraints ensure that is a proper clustering matrix. Since the above problem is computationally expensive as it involves solving a combinatorial problem, the constraint on is relaxed as .
These four methods require the number of task clusters to be same as the number of features clusters, which as mentioned earlier, is a restrictive assumption that may and often does hurt performance. In addition, these methods do not leverage the inherent relationship between the features (via ) and the relationship between the tasks (via ). Note that these objective functions are biconvex problems where the optimization is convex in when fixing and vice versa. We cannot achieve globally optimal solution but one can show that algorithm reaches the locally optimal solution in a fixed number of iterations.
3 Proposed Approach
3.1 BiFactor MTL
Existing models do not take into consideration both the relationship between the tasks and the relationship between the features. Here we consider a more general formulation that in addition to estimating the parameters
and , we learn their task relationship matrix and the feature relationship matrix . We call this framework BiFactor multitask learning, following the factorization of the task parameters into two lowrank matrices and .(7)  
In the above objective function, we consider and to learn task relationship and feature relationship matrices and . The motivation for these regularization terms is based on (Argyriou et al., 2008; Zhang & Yeung, 2014) where they considered separately either the task relationship matrix or the feature relationship matrix . Note that the value of is typically set to value less than .
It is easy to see that by setting the value of to ^{1}^{1}1identity matrix of size (assuming that the rank is set to ) , our objective function reduces to multitask feature learning () discussed in the previous section. Similarly, by setting the value of to ^{2}^{2}2identity matrix of size (assuming that the rank is set to ) , our objective function reduces to multitask relationship learning () (Zhang & Yeung, 2014). If we set and , we obtain the factored multitask learning setting () defined in Equation 2. Hence the prior art can be cast as special cases of our more general formulation by imposing certain limiting restrictions.
3.2 Optimization for BiFactor Mtl
We propose an efficient learning algorithm for solving the above objective functionBiFactor MTL. Consider an alternating minimization algorithm, where we learn the shared representation while fixing the task structure and we learn the task structure while fixing the shared representation . We repeat these steps until we converge to the locally optimal solution.
Optimizing w.r.t gives an equation called generalized Sylvester equation of the form for the unknown . We will show in the next section on how to solve these linear equation efficiently. From the objective function, we have:
(8) 
Optimizing w.r.t for squared error loss results in the similar linear equation:
(9) 
Optimizing w.r.t and : The optimization of the above function w.r.t and while fixing the other unknowns can be learned easily with the following closedform solutions (Zhang & Yeung, 2014):
3.3 TriFactor MTL
As mentioned earlier, one of the restrictions in BiFactor MTL and factored models is that both the number of feature clusters and task clusters should be set to . This poses a serious model restriction, by assuming both the latent task and feature representation live in a same subspace. Such assumption can significantly hinder the flexibility of the model search space and we address this problem with a modification to our previous framework.
Following the previous work in matrix trifactorization, we introduce an additional factor such that we write as where is a feature cluster matrix of size with feature clusters and is a task cluster matrix of size with task clusters and is the matrix that maps feature clusters to task clusters. With this representation, latent features lie in a dimensional subspace and the latent tasks lie in a dimensional subspace.
(10)  
The cluster mapping matrix introduces an additional degree of freedom in the factored models and addresses the realistic assumptions encountered in many applications. Note that we do not consider any regularization on in this paper, but one may impose additional constraint on such as (sparse penalty), (ridge penalty), nonnegative constraints, etc, to further improve performance.
3.4 Optimization for TriFactor Mtl
We introduce an efficient learning algorithm for solving TriFactor MTL, similar to the optimization procedure for BiFactor MTL. As before, we consider an alternating minimization algorithm, where we learn the shared representation while fixing the and , we learn the task structure while fixing the and and we learn the cluster mapping matrix , by fixing and . We repeat these steps until we converge to a locally optimal solution.
Optimizing w.r.t gives a generalized Sylvester equation as before.
(11) 
Optimizing w.r.t gives the following linear equation:
(12) 
for all .
Optimizing w.r.t : Solving for results in the following equation:
(13) 
Optimizing w.r.t and : The optimization of the above function w.r.t and while fixing the other unknowns can be learned as in BiFactorMTL. Note that one may consider regularization on and to learn the sparse relationship between the tasks and the features (Zhang & Schneider, 2010).
3.5 Solving the Generalized Sylvester Equations
We give some details on how to solve the generalized Sylvester equations (8,9,11,12,13) encountered in BiFactor and TriFactor MTL optimization steps. The generalized Sylvester equation of the form has a unique solution under certain regularity conditions which can be exactly obtained by an extended version of the classical BartelsStewart method whose complexity is for matrix variable , compared to the naive matrix inversion which requires .
Alternatively one can solve the linear equation using the properties of the Kronecker product: where is the Kronecker product and vectorizes in a column oriented way. Below, we show the alternative form for TriFactor MTL equations:
(14)  
(15)  
(16) 
We can do the same for BiFactor MTL, enabling us to use conjugate gradient descent (CG) to learn our unknown factors whose complexity depends on the condition number of the matrix . To optimize , and , we iteratively run conjugate gradient descent for each factor while fixing the other unknowns until a convergence condition (tolerance ) is met. In addition, CG can exploit the solution from the previous iteration, low rank structure in the equation and the fact that the matrix vector products can be computed relatively efficiently. From our experiments. We find that our algorithm converges fast, i.e. in a few iterations.
4 Experiments
In this section, we report on experiments on both synthetic datasets and three real world datasets to evaluate the effectiveness of our proposed MTL methods. We compare both our models with several stateoftheart baselines discussed in Section 2. We include the results for Shared Multitask learning (SHAMO) (Crammer & Mansour, 2012), which uses a Kmeans like procedure that simultaneously clusters different tasks using a small pool of shared model. Following (Barzilai & Crammer, 2015), we use gradientprojection algorithm to optimize the dual of the objective function (Equation 6). In addition, we compare our results with Singletask learning (STL), which learns a single model by pooling together the data from all the tasks and Independent task learning (ITL) which learns each task independently.
Model  syn1  syn2  syn3  syn4  syn5 
STL  4.79 (0.04)  5.71 (0.05)  5.5 (0.04)  4.02 (0.02)  5.72 (0.06) 
ITL  1.98 (0.08)  2.10 (0.09)  2.01 (0.06)  1.95 (0.06)  2.14 (0.07) 
SHAMO  3.63 (0.22)  4.37 (0.27)  3.56 (0.23)  2.76 (0.13)  4.27 (0.31) 
MTFL  1.91 (0.08)  1.95 (0.07)  1.64 (0.06)  1.47 (0.05)  1.91 (0.07) 
GOMTL  1.84 (0.10)  1.90 (0.08)  1.72 (0.04)  1.63 (0.06)  1.84 (0.06) 
BiFactorMTL  1.85 (0.08)  1.85 (0.06)  1.68 (0.08)  1.37 (0.08)  1.74 (0.07) 
TriFactorMTL  1.78 (0.08)  1.83 (0.07)  1.46 (0.05)  1.31 (0.02)  1.68 (0.10) 
Performance results (RMSE) on synthetic datasets. The table reports the mean and standard errors over
random runs. The best model and the statistically competitive models (by paired ttest with ) are shown in boldface.The parameters for the proposed formulations and several stateoftheart baselines are chosen from fold cross validation. We fix the value of to in order to reduce the search space. The value for is chosen from the search grid . The value for , and are chosen from . We evaluate the models using Root Mean Squared Error (RMSE) for the regression tasks and using measure for the classification tasks. For our experiments, we consider the squared error loss for each task. We repeat all our experiments times to compensate for statistical variability. The best model and the statistically competitive models (by paired ttest with ) are shown in boldface.
4.1 Synthetic Data
We evaluate our models on five synthetic datasets based on the assumptions considered in both the baselines and the proposed methods. We generate examples from with for each task . All the datasets consist of tasks with training examples per task. Each task is constructed using . The task parameters for each synthetic dataset is generated as follows:

syn1 dataset consists of groups of tasks with tasks in each group without any overlap. We generate latent features from and each is constructed from linearly combining latent features from .

syn2 dataset is generated with overlapping groups of tasks. As before, we generate latent features from but tasks in group are constructed from features , tasks in group are constructed from features and the tasks in group are constructed from features .

syn3 dataset simulates the BiFactor MTL. We randomly generate task covariance matrix and feature covariance matrix . We sample and and compute .

syn4 dataset simulates the TriFactor MTL. We randomly generate task covariance matrix and feature covariance matrix . We sample , and . We compute the task weight matrix by .

syn5
dataset simulates the experiment with task weight matrix drawn from a matrix normal distribution
(Zhang & Schneider, 2010). We randomly generate task covariance matrix and feature covariance matrix . We sample .
Models  STL  ITL  SHAMO  MTFL  GOMTL  BiFactor  TriFactor 
20%  12.19 (0.03)  12.00 (0.04)  11.91 (0.05)  11.25 (0.05)  11.15 (0.05)  10.68 (0.08)  10.54 (0.09) 
30%  12.09 (0.07)  12.01 (0.05)  10.92 (0.05)  10.85 (0.02)  10.53 (0.10)  10.38 (0.11)  10.22 (0.08) 
40%  12.00 (0.10)  11.88 (0.06)  11.82 (0.06)  10.61 (0.06)  10.31 (0.14)  10.20 (0.13)  10.12 (0.10) 
We compare the proposed methods BiFactor MTL and TriFactor MTL against the baselines. We can see in Table 1 that BiFactor and TriFactor MTL outperforms all the baselines in all the synthetic datasets. STL performs the worst since it combines the data from all the tasks. We can see that the SHAMO performs better than STL but worse than ITL which shows that learning these tasks separately is beneficial than combining them to learn a fewer models.
As mentioned earlier, since MTFL is similar to FMTL in Equation 2, we can see how the results of BiFactor MTL improve when it learns both the task relationship matrix and the feature relationship matrix. Note that the syn1 and syn2 datasets are based on assumptions in GOMTL, hence, it performs better than the other baselines. BiFactor MTL and TriFactor MTL models are equally competent with GOMTL which shows that our proposed methods can easily adapt to these assumptions. Synthetic datasets syn3, syn4 and syn5 are generated with both the task covariance matrix and the feature covariance matrix. Since both BiFactor MTL and TriFactor MTL learns task and feature relationship matrix along with the task weight parameters, they performs significantly better than other baselines.
4.2 Exam Score Prediction
We evaluate the proposed methods on examination score prediction data, a benchmark dataset in multitask regression reported in several previous articles (Argyriou et al., 2008; Kumar & Daume, 2012; Zhang & Yeung, 2014) ^{3}^{3}3http://ttic.uchicago.edu/~argyriou/code/mtl_feat/school_splits.tar. The school dataset consists of examination scores of students from schools in London. Each school is considered as a task and we need to predict exam scores for students from these
schools. The feature set includes the year of the examination, four schoolspecific and three studentspecific attributes. We replace each categorical attribute with one binary variable for each possible attribute value, as suggested in
(Argyriou et al., 2008). This results in attributes with an additional attribute to account for the bias term.Clearly, the dataset has the school and student specific feature clusters that can help in learning the shared feature representation better than the other factored baselines. In addition, there must be several task clusters in the data to account for the differences among the schools. The training and test sets are obtained by dividing examples of each task into many small datasets, by varying the size of the training data with , and , in order to evaluate the proposed methods on many tasks with limited numbers of examples.
Table 2 shows the experimental results for school data. All the factorized MTL methods outperform STL and ITL. We can see that both TriFactor MTL and BiFactor MTL outperform other baselines significantly. It is interesting to see that TriFactor MTL performs considerably well even when the tasks have limited numbers of examples. When there is more training data, the result the advantage of TriFactor MTL over the strongest baseline GOMTL is reduced.
4.3 Sentiment Analysis
We follow the experimental setup in (Crammer & Mansour, 2012; Barzilai & Crammer, 2015) and evaluate our algorithm on product reviews from amazon^{4}^{4}4http://www.cs.jhu.edu/~mdredze/datasets/sentiment. The dataset contains product reviews from domains such as books, dvd, etc. We consider each domain as a binary classification task. The reviews are stemmed and stopwords are removed from the review text. We represent each review as a bag of unigrams/bigrams with TFIDF scores. We choose reviews from each domain and each review is associated with a rating from . The reviews with rating is not included in this experiment as such sentiments were ambiguous and therefore cannot be reliably predicted. h
We ran several experiments on this dataset to test the importance of learning shared feature representation and coclustering of tasks and features. In Experiment i@, we construct classification tasks with reviews labeled positive when rating and labeled negative when rating . We use training examples for each task and the remaining for test set. Since all the tasks are essentially same, ITL perform better than all the other models (with an measure of ) by combining data from all the other tasks. The results for our proposed methods BiFactor MTL () and TriFactor MTL () are comparable to that of ITL. See supplementary material for the results of Experiment i@.
For Experiment ii@, we split each domain into two equal sets, from which we create two prediction tasks based on the two different thresholds: whether the rating for the reviews is or not and whether the rating for the reviews is or not. Obviously, combining all the tasks together will not help in this setting. Experiments iii@ and iv@ are similar to Experiment ii@, except that each task is further divided into or subtasks.
Experiment v@ splits each domain into three equal sets to construct three prediction tasks based on three different thresholds: whether the rating for the reviews is or not, whether the rating for the reviews is or not and whether the rating for the reviews is or not. This setting captures the reviews with different levels of sentiments. As before, we build the dataset for Experiments vi@ and vii@ by further dividing the three prediction tasks from Experiment v@ into or subtasks.
The results from our experiments are reported in Table 3. The first four rows in the table show the number of tasks in each experiment, number of thresholds considered for the ratings, number of splits constructed from each domain and the total number of training examples in each task. The general trend is that factorized models performs significantly better than the other baselines. Since MTFL, BiFactorMTL and TriFactorMTL learn feature relationship matrix in addition to the task parameter, they achieve better results than CMTL, which considers only the task clusters.
We notice that as we increase the number of tasks, the gap between the performances of TriFactorMTL and BiFactorMTL (and GOMTL) widens, since the assumption that the the number of feature and task clusters should be same is clearly violated. On the other hand, TriFactorMTL learns with a different number of feature and task clusters and, hence achieves a better performance than all the other methods considered in these experiments.
Data  ii@  iii@  iv@  v@  vi@  vii@  

Tasks  28  56  84  42  86  126  

2 (2)  2 (4)  2 (6)  3 (3)  3 (6)  3 (9)  
Train Size  120  60  40  80  40  26  
STL  0.429 (0.002)  0.432 (0.001)  0.429 (0.002)  0.400 (0.002)  0.399 (0.003)  0.397 (0.001)  
ITL  0.433 (0.001)  0.440 (0.002)  0.431 (0.001)  0.499 (0.001)  0.486 (0.002)  0.479 (0.001)  
SHAMO  0.423 (0.002)  0.437 (0.006)  0.429 (0.002)  0.498 (0.006)  0.460 (0.002)  0.496 (0.013)  
CMTL  0.557 (0.016)  0.436 (0.007)  0.429 (0.004)  0.508 (0.002)  0.486 (0.002)  0.476 (0.002)  
MTFL  0.482 (0.004)  0.473 (0.002)  0.432 (0.007)  0.522 (0.002)  0.487 (0.003)  0.481 (0.002)  
GOMTL  0.582 (0.012)  0.526 (0.013)  0.516 (0.007)  0.587 (0.004)  0.540 (0.005)  0.539 (0.008)  
BiFactor  0.611 (0.018)  0.561 (0.013)  0.598 (0.002)  0.643 (0.013)  0.578 (0.020)  0.574 (0.052)  
TriFactor  0.627 (0.008)  0.588 (0.006)  0.603 (0.012)  0.655 (0.013)  0.606 (0.020)  0.632 (0.029) 
Models  Task 1  Task 2  Task 3  Task 4  Task 5 
GOMTL  0.42 (0.09)  0.57 (0.06)  0.42 (0.04)  0.47 (0.06)  0.40 (0.03) 
BiFactorMTL  0.42 (0.09)  0.60 (0.05)  0.41 (0.04)  0.49 (0.03)  0.36 (0.01) 
TriFactorMTL  0.49 (0.03)  0.63 (0.02)  0.54 (0.02)  0.54 (0.02)  0.51 (0.02) 
4.4 Transfer Learning
Finally, we evaluate our proposed models on 20Newsgroups
dataset for transfer learning
^{5}^{5}5http://qwone.com/~jason/20Newsgroups/. The dataset contains postings from 20 Usenet newsgroups. As before, the postings are stemmed and the stopwords are removed from the text. We represent each posting as a bag of unigrams/bigrams with TFIDF scores. We construct tasks from the postings of the newsgroups. We randomly select a pair of newsgroup classes to build each onevsone classification task. We follow the holdout experiment suggested by (Raina et al., 2006) for the transfer learning setup. For each of the tasks (target task), we learn ( and in case of TriFactorMTL) from the remaining tasks (source tasks).With ( and ) known from the source tasks, we select of the data from the target task to learn . This experiment shows how well the learned latent feature representation from the source tasks in a dimensional subspace (dimensional subspace for TriFactorMTL) adapt to the new task. We evaluate our results on the remaining data from the target task. We select GOMTL as our baseline to compare our results. Since CMTL doesn’t explicitly learn , we did not include it in this experiment.
Table 4 shows the results for this experiment. We report the first tasks here. See supplementary material for the performance results of all the tasks. We see that both GOMTL and BiFactorMTL perform almost the same, since both of them learn the latent feature representation in a dimensional space. As is evident from the table, TriFactorMTL outperforms both GOMTL and BiFactorMTL, which shows that learning both the factors and improves information transfer from the source tasks to the target task.
5 Conclusions
In this paper, we proposed a novel framework for multitask learning that factors the task parameters into a shared feature representation and a task structure to learn from multiple related tasks. We formulated two approaches, motivated from recent work in multitask latent feature learning. The first (BiFactor MTL), decomposes the task parameters into two lowrank matrices: latent feature representation and task structure . As this approach is restrictive on the number of clusters in the latent feature and task space, we proposed a second method (TriFactor MTL), which introduces an additional degree of freedom to permit different clusterings in each. We developed a highly scalable and efficient learning algorithm using conjugate gradient descent and generalized Sylvester equations. Extensive empirical analysis on both synthetic and real datasets show that Trifactor multitask learning outperforms the other stateoftheart multitask baselines, thereby demonstrating the effectiveness of the proposed approach.
References

Amit et al. (2007)
Amit, Yonatan, Fink, Michael, Srebro, Nathan, and Ullman, Shimon.
Uncovering shared structures in multiclass classification.
In
Proceedings of the 24th international conference on Machine learning
, pp. 17–24. ACM, 2007.  Argyriou et al. (2008) Argyriou, Andreas, Evgeniou, Theodoros, and Pontil, Massimiliano. Convex multitask feature learning. Machine Learning, 73(3):243–272, 2008.

Barzilai & Crammer (2015)
Barzilai, Aviad and Crammer, Koby.
Convex multitask learning by clustering.
In
Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS15)
, 2015.  Crammer & Mansour (2012) Crammer, Koby and Mansour, Yishay. Learning multiple tasks using shared hypotheses. In Advances in Neural Information Processing Systems, pp. 1475–1483, 2012.

Ding & He (2004)
Ding, Chris and He, Xiaofeng.
Kmeans clustering via principal component analysis.
In Proceedings of the twentyfirst international conference on Machine learning, pp. 29. ACM, 2004.  Ding et al. (2006) Ding, Chris, Li, Tao, Peng, Wei, and Park, Haesun. Orthogonal nonnegative matrix tfactorizations for clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 126–135. ACM, 2006.
 Dinuzzo et al. (2011) Dinuzzo, Francesco, Ong, Cheng S, Pillonetto, Gianluigi, and Gehler, Peter V. Learning output kernels with block coordinate descent. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pp. 49–56, 2011.
 Gu & Zhou (2009) Gu, Quanquan and Zhou, Jie. Coclustering on manifolds. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 359–368. ACM, 2009.
 Jalali et al. (2010) Jalali, Ali, Sanghavi, Sujay, Ruan, Chao, and Ravikumar, Pradeep K. A dirty model for multitask learning. In Advances in Neural Information Processing Systems, pp. 964–972, 2010.
 Kumar & Daume (2012) Kumar, Abhishek and Daume, Hal. Learning task grouping and overlap in multitask learning. In Proceedings of the 29th International Conference on Machine Learning (ICML12), pp. 1383–1390, 2012.
 Li & Ding (2006) Li, Tao and Ding, Chris. The relationships among various nonnegative matrix factorization methods for clustering. In Data Mining, 2006. ICDM’06. Sixth International Conference on, pp. 362–371. IEEE, 2006.
 Liu et al. (2009) Liu, Jun, Ji, Shuiwang, and Ye, Jieping. Multitask feature learning via efficient l 2, 1norm minimization. In Proceedings of the twentyfifth conference on uncertainty in artificial intelligence, pp. 339–348. AUAI Press, 2009.
 Maurer et al. (2013) Maurer, Andreas, Pontil, Massimiliano, and RomeraParedes, Bernardino. Sparse coding for multitask and transfer learning. In ICML (2), pp. 343–351, 2013.
 Raina et al. (2006) Raina, Rajat, Ng, Andrew Y, and Koller, Daphne. Constructing informative priors using transfer learning. In Proceedings of the 23rd International Conference on Machine Learning, pp. 713–720. ACM, 2006.
 Rothman et al. (2010) Rothman, Adam J, Levina, Elizaveta, and Zhu, Ji. Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics, 19(4):947–962, 2010.
 Sindhwani et al. (2009) Sindhwani, Vikas, Hu, Jianying, and Mojsilovic, Aleksandra. Regularized coclustering with dual supervision. In Advances in Neural Information Processing Systems, pp. 1505–1512, 2009.
 Sindhwani et al. (2013) Sindhwani, Vikas, Minh, Ha Quang, and Lozano, Aurélie C. Scalable matrixvalued kernel learning for highdimensional nonlinear multivariate regression and granger causality. In Proceedings of the TwentyNinth Conference on Uncertainty in Artificial Intelligence, pp. 586–595. AUAI Press, 2013.
 Swirszcz & Lozano (2012) Swirszcz, Grzegorz and Lozano, Aurelie C. Multilevel lasso for sparse multitask regression. In Proceedings of the 29th International Conference on Machine Learning (ICML12), pp. 361–368, 2012.
 Wang et al. (2011) Wang, Hua, Nie, Feiping, Huang, Heng, and Makedon, Fillia. Fast nonnegative matrix trifactorization for largescale data coclustering. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence, pp. 1553, 2011.
 Weinberger et al. (2009) Weinberger, Kilian, Dasgupta, Anirban, Langford, John, Smola, Alex, and Attenberg, Josh. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1113–1120. ACM, 2009.
 Xu et al. (2015) Xu, Linli, Huang, Aiqing, Chen, Jianhui, and Chen, Enhong. Exploiting taskfeature coclusters in multitask learning. In Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence, pp. 1931–1937. AAAI Press, 2015.
 Xue et al. (2007) Xue, Ya, Liao, Xuejun, Carin, Lawrence, and Krishnapuram, Balaji. Multitask learning for classification with dirichlet process priors. The Journal of Machine Learning Research, 8:35–63, 2007.
 Zhang et al. (2005) Zhang, Jian, Ghahramani, Zoubin, and Yang, Yiming. Learning multiple related tasks using latent independent component analysis. In Advances in neural information processing systems, pp. 1585–1592, 2005.
 Zhang & Schneider (2010) Zhang, Yi and Schneider, Jeff G. Learning multiple tasks with a sparse matrixnormal penalty. In Advances in Neural Information Processing Systems, pp. 2550–2558, 2010.
 Zhang & Yeung (2014) Zhang, Yu and Yeung, DitYan. A regularization approach to learning task relationships in multitask learning. ACM Transactions on Knowledge Discovery from Data (TKDD), 8(3):12, 2014.
 Zhong & Kwok (2012) Zhong, Wenliang and Kwok, James T. Convex multitask learning with flexible task clusters. In Proceedings of the 29th International Conference on Machine Learning (ICML12), pp. 49–56, 2012.
Sensitivity Analysis
Figure 1 shows the hyperparameter sensitivity analysis for GOMTL, BiFactorMTL and TriFactorMTL. As before, we fix . GOMTL and BiFactorMTL have two hyperparameters to tune and TriFactorMTL have three hyperparameters and to tune. We can see from the plots that our proposed models yield stable results even when we change the and . On the other hand, GOMTL results are sensitive to the values of , regularization parameter for sparse penalty on .
Additional Results
Tables 5 and 6 show the complete experimental results for sentiment analysis and transfer learning experiments.
List of onevsone classification tasks used in Table 6
(Task 1) comp.windows.x vs comp.os.mswindows.misc (Task 2) soc.religion.christian vs rec.sport.hockey (Task 3) misc.forsale vs talk.politics.guns (Task 4)sci.med vs rec.autos (Task 5) comp.sys.mac.hardware vs talk.politics.misc (Task 6) sci.space vs alt.atheism (Task 7) comp.graphics vs comp.sys.ibm.pc.hardware (Task 8) talk.politics.mideast vs sci.electronics (Task 9) rec.motorcycles vs talk.religion.misc (Task 10) rec.sport.baseball vs sci.crypt
Data  i@  ii@  iii@  iv@  v@  vi@  vii@  
Tasks  14  28  56  84  42  86  126  

1 (1)  2 (2)  2 (4)  2 (6)  3 (3)  3 (6)  3 (9)  
Train Size  240  120  60  40  80  40  26  
STL  0.749 (0.003)  0.429 (0.002)  0.432 (0.001)  0.429 (0.002)  0.400 (0.002)  0.399 (0.003)  0.397 (0.001)  
ITL  0.713 (0.002)  0.433 (0.001)  0.440 (0.002)  0.431 (0.001)  0.499 (0.001)  0.486 (0.002)  0.479 (0.001)  
SHAMO  0.721 (0.005)  0.423 (0.002)  0.437 (0.006)  0.429 (0.002)  0.498 (0.006)  0.460 (0.002)  0.496 (0.013)  
CMTL  0.713 (0.002)  0.557 (0.016)  0.436 (0.007)  0.429 (0.004)  0.508 (0.002)  0.486 (0.002)  0.476 (0.002)  
MTFL  0.711 (0.002)  0.482 (0.004)  0.473 (0.002)  0.432 (0.007)  0.522 (0.002)  0.487 (0.003)  0.481 (0.002)  
GOMTL  0.638 (0.006)  0.582 (0.012)  0.526 (0.013)  0.516 (0.007)  0.587 (0.004)  0.540 (0.005)  0.539 (0.008)  
BiFactorMTL  0.722 (0.006)  0.611 (0.018)  0.561 (0.013)  0.598 (0.002)  0.643 (0.013)  0.578 (0.020)  0.574 (0.052)  
TriFactorMTL  0.733 (0.006)  0.627 (0.008)  0.588 (0.006)  0.603 (0.012)  0.655 (0.013)  0.606 (0.020)  0.632 (0.029) 
Models  Task 1  Task 2  Task 3  Task 4  Task 5  Task 6  Task 7  Task 8  Task 9  Task 10 
GOMTL  0.42 (0.09)  0.57 (0.06)  0.42 (0.04)  0.47 (0.06)  0.40 (0.03)  0.37 (0.02)  0.35 (0.02)  0.70 (0.01)  0.38 (0.00)  0.42 (0.05) 
BiFactorMTL  0.42 (0.09)  0.60 (0.05)  0.41 (0.04)  0.49 (0.03)  0.36 (0.01)  0.42 (0.02)  0.37 (0.02)  0.64 (0.02)  0.38 (0.00)  0.46 (0.04) 
TriFactorMTL  0.49 (0.03)  0.63 (0.02)  0.54 (0.02)  0.54 (0.02)  0.51 (0.02)  0.67 (0.01)  0.47 (0.02)  0.66 (0.01)  0.59 (0.03)  0.62 (0.01) 
Comments
There are no comments yet.