Introduction
Multitask learning (MTL) is a branch of machine learning that aims at exploiting the correlation among tasks. To achieve this, the learning of different tasks is performed jointly. It has been shown that learning task relationships can transfer knowledge from informationrich tasks to informationpoor tasks
[Zhang and Yeung2014] so that overall generalization error can be reduced. With this characteristic, MTL has been successfully applied in use cases ranging from Transportation [Deng et al.2017] to Biomedicine [Li, He, and Borgwardt2018].Various multitask learning algorithms have been proposed in the literature, Zhang and Yang [Zhang and Yang2017] wrote a comprehensive survey on stateoftheart (SoA) methods. For instance, feature learning approach [Argyriou, Evgeniou, and Pontil2007] and lowrank approach [Ji and Ye2009, Chen, Zhou, and Ye2011] assume all the tasks are related, which may not be true in realworld applications. Task clustering approaches [Bakker and Heskes2003, Jacob, Vert, and Bach2009, Kumar and Daume III2012] can deal with the situation where different tasks form clusters. However, despite being accurate, these latter methods are computationally expensive for problems with a large number of tasks.
In realworld regression applications, the number of tasks can be tremendous. For instance, in the retail market, shops’ owners would like to forecast the sales amount of all the products based on the historical sales information and external factors. For a typical shop, there are thousands of products, where each product can be modeled as a separate regression task. In addition, if we consider large retail chains, where the number of shops in a given region is in the order of hundreds, the total number of learning tasks easily grows up to hundreds of thousands. A similar scenario can also be found in other applications, e.g. demand prediction in transportation, where each task is one public transport station demand in a city at a certain time per line. Here, at least tens of thousands of tasks are expected.
In all these scenarios, MTL approaches that exploit relationships among tasks are appropriate. Unfortunately most of existing SoA multitask learning methods cannot be applied, because either they are not able to cope with task heterogeneity or are too computationally expensive, scaling superlinearly in the number of tasks.
To tackle these issues, this paper introduces a novel algorithm, named Convex Clustering MultiTask regression Learning (CCMTL). It integrates the objective of convex clustering [Hocking et al.2011] into the multitask learning framework. CCMTL is efficient with linear runtime in the number of tasks, yet provides accurate prediction. The detailed contributions of this paper are fourfold:

[leftmargin=2]

Model: A new model for multitask regression that integrates with convex clustering on the nearest neighbor graph of the prediction models;

Optimization Method: A new optimization method for the proposed problem that is proved to converge to the global optimum;

Accurate Prediction: Accurate predictions on both synthetic and realworld datasets in retail and transportation domains which outperform eight SoA multitask learning methods;

Efficient and Scalable Implementation: An efficient and linearly scalable implementation of the algorithm. On a realworld retail dataset with tasks, the algorithm requires only around seconds to terminate, whereas SoA methods typically need up to hours or even days.
Over the remainder of this paper, we introduce the proposed algorithm, present the mathematical proofs and a comprehensive experimental setup that benchmarks CCMTL against the SoA on multiple datasets. Lastly, we discuss related works and conclude the paper.
The Proposed Method
Let us consider a dataset with regression tasks for each task . consists of samples and features, while is the target for task . There are totally samples, where . We consider linear models in this paper. Let , where
represent the weight vector for task
.Model
Task clustering MTL methods learns the task relationships by integrating with means [Jacob, Vert, and Bach2009, Zhou, Chen, and Ye2011b] or matrix decomposition [Kumar and Daume III2012]. Unfortunately, these methods are expensive to train, making them impractical for problems with a massive number of tasks.
Recently, convex clustering [Hocking et al.2011] has attracted much attention. It solves clustering of data as a regularized reconstruction problem:
(1) 
where is the new representation for the th sample, and is the nearest neighbor graph on . It is critical to note that the edge terms involve the norm, not the squared norm, which is essential to achieve clustering. Further, it has been shown that convex clustering is efficient with linear scalability [Chi and Lange2015, Shah and Koltun2017, He and MoreiraMatias2018].
Suppose we have a noisy observation of the true prediction models ; we would like to learn less noisy models by convex clustering: and would take the role of and in Problem (1), respectively. Since is not available in practice and the target of MTL is the prediction, we use the prediction error instead of the reconstruction error and form the problem as follows:
(2) 
where is a regularization parameter and is the nearest neighbor graph on the prediction models learned independently for each task.
Note that if we use norm as the regularizer, Problem (2) equals to the Fused Multitask Learning (FuseMTL) [Zhou et al.2012, Chen et al.2010]. However, it has been shown that norm works better for clustering in most cases [Hocking et al.2011]. This improvement is also confirmed for MTL in the experimental section.
Optimization
The Problem (2) lies in the general framework of NetworkLasso [Hallac, Leskovec, and Boyd2015]
, and hence it could be solved by NetworkLasso, which is based on the general alternating direction method of multipliers (ADMM). However, NetworkLasso is not specifically designed for the multitask regression problem. Further, in our experiments, we find its performance on single thread is rather slow and the convergence is affected by its hyperparameters.
Chi and Lange [Chi and Lange2015] propose an alternating minimization algorithm (AMA) for convex clustering, which has been shown to be faster than the ADMM based method. Even if the approach can also be applied to solve Problem (2), its convergence is only guaranteed under certain conditions. Later, Wang et al. [Wang et al.2016] propose a variation of the AMA method for convex clustering. We test this method to solve Problem (2), but the convergence is not always empirically achieved.
Shah and Koltun [Shah and Koltun2017] propose an efficient method for continuous clustering with a nonconvex regularization function. Inspired by this method, we propose a new efficient algorithm for Problem (2) by iteratively solving a structured linear system.
First we introduce an auxiliary variable for each connection between node and in graph , , and we form the following new problem:
(3) 
Proof.
Intuitively, Problem (3) learns the weights for the squared norm regularizer as in Graph regularized Multitask Learning (SRMTL) [Zhou, Chen, and Ye2011b]. With noisy edges, squared norm forces uncorrelated tasks to be close, while norm is more robust which is confirmed for MTL in the experimental section.
To solve Problem (3), we optimize and alternately. When is fixed, we get the derivative of Problem (3) with respect to , set it to zero and get the update rule as shown in Eq. (4).
When is fixed, Problem (3) equals to
(5) 
In order to solve Problem (5), let us define as a block diagonal matrix
define as a row vector
and define as a column vector
Then, Problem (5) can be rewritten as:
(6) 
where is an indicator vector with the th element set to and others and
is an identity matrix of size
.Setting the derivative of Problem (6) with respect to to zero, the optimal solution of can be obtained by solving the following linear system:
(7) 
where , , , and is derived by reshaping .
Convergence Analysis
Theorem 2.
Proof.
Problem (3) is biconvex on and . Therefore, alternately updating Eq. (4) and Eq. (7) will converge to a local minimum [Beck2015]. Here we prove it in a different way. We show that this method is a contracting map , so that . Let equals to the objective function defined in Problem (3). We use , with . We then demonstrate that
where^{1}^{1}1the superscript *,+ and  indicate the optimal, the starting and the next update values of the variables and . because is a stationary point. We define , where are the variables before and after the mapping. Our mapping is composed of two steps followed by . We show that the first step is a contraction and that the same is true also for the second step, and therefore for their composition. The first step updates the using the gradient of and finds the optimal minimum having fixed , thus
The second step updates such that , thus
where the reduction is both obtained by a gradient descent step or by direct solution since is convex in the two variables separately. By applying composition of the two operations we have
which proves the convergence of the method.
Efficient and Scalable Implementation
CCMTL is summarized in Algorithm 1. Firstly, it initializes the weight vector by performing Linear Regression on each task separately. Then, it constructs the nearest neighbor graph on based on the Euclidean distance. Finally, the optimization problem is solved by iteratively updating and until convergence.
Solving the linear system in Eq. (7) involves inverting a matrix of size , where is the number of features and is the number of tasks. Direct inversion will lead to cubic computational complexity, which will not scale to a large number of tasks. However, there are certain properties of and in Eq. (7) that can be used to derive efficient implementation.
Theorem 3.
and in Eq. (7) are both Symmetric and Positive Semidefinite and is a Laplacian matrix.
Proof.
Clearly, and are Symmetric. For each pair of edge in , is a Laplacian matrix by definition. Since the summation of Laplacian matrices and multiplying a positive value to a Laplacian matrix are still Laplacian, is a Laplacian matrix since and . The Kronecker product of a Laplacian matrix and an identity matrix will lead to a block diagonal matrix, where the diagonals are all Laplacian matrices. Therefore, is a Laplacian matrix and it is Positive Semidefinite. is a dot product, therefore it is Positive Semidefinite as well. ∎
Based on Theorem 3, Eq. (7) can be solved efficiently by Conjugate Gradient (CG) method, which requires the input matrix to be Symmetric and Positive Semidefinite. CG will be faster than direct inversion since and are sparse matrices. In addition, the solution of of the previous iteration can be used to initialize CG for the new iteration, which will increase the convergence speed of CG.
Further, recent studies [Cohen et al.2014, Kelner et al.2013] show that linear systems with sparse Laplacian matrices can be solved in nearlinear time. Among them, [Ioannis, Miller, and Tolliver2011] proposed the Combinatorial MultiGrid (CMG) algorithm, which is also a CG based method that utilizes the hierarchal structure of the Laplacian matrix. CMG empirically scales linearly w.r.t. the nonzero entries in the Laplacian matrix of the linear system.
Based on Theorem 3, is a sparse Laplacian matrix. Therefore, we adopt CMG to solve Eq. (7) in CCMTL. Although, is not a Laplacian matrix anymore, empirically we find out that CMG still converges fast and scales linearly in the number of tasks. We conjecture that this is due to the fact that is a structured block diagonal matrix.
The runtime of CCMTL consists of threefold: 1) Initialization of , 2) nearest neighbor graph construction, and 3) optimization of Problem (3). Clearly, initialization of by linear regression is efficient and scales linearly on the number of tasks . nearest neighbor graph construction naively scales quadratically to . However, it is not the burden when using MATLAB’s pdist2 function even for a synthetic dataset with tasks. CCMTL empirically converges fast within around iterations. The majority of runtime is for solving the linear system in Eq. (7). The adopted CMG method scales empirically linearly in the number of nonzero entries in and in Eq. (7), which is linear to .
Experiments
Comparison Methods
We compare our method CCMTL with several SoA methods. As baselines, we compare with Singletask learning (STL), which learns a single model by pooling together the data from all the tasks and Independent task learning (ITL), which learns each task independently. These baselines represent the two extreme hypothesis, full independence of the tasks (ITL) and complete correlation of all tasks (STL). MTL methods should find the right balance between grouping tasks and isolating groups to achieve learning generalization. We further compare to multitask feature learning method: Joint Feature Learning (L21) [Argyriou, Evgeniou, and Pontil2007] and lowrank methods: Tracenorm Regularized Learning (Trace) [Ji and Ye2009] and Robust Multitask Learning (RMTL) [Chen, Zhou, and Ye2011]. We also compare to the other five clustering approaches: CMTL [Jacob, Vert, and Bach2009], FuseMTL [Zhou et al.2012], SRMTL [Zhou, Chen, and Ye2011b] and two recently proposed models BiFactorMTL and TriFactorMTL [Murugesan, Carbonell, and Yang2017].
All methods are implemented in Matlab and evaluated on a single thread. We implement CCMTL^{2}^{2}2ccmtlaaai.neclab.eu, STL and ITL. We use the implementation of L21, Trace, RMTL, SRMTL and FuseMTL from the Malsar package [Zhou, Chen, and Ye2011b]. We get the CMTL, BiFactorMTL and TriFactorMTL from the authors’ personal website. The number of nearest neighbors is set to to get the initial graph. We use the same graph for FuseMTL, SRMTL and CCMTL generated as in Algorithm 1. CCMTL, STL, ITL, L21, Trace and FuseMTL need one hyperparameter that is selected from . RMTL, CMTL, BiFactorMTL and TriFactorMTL needs two hyperparameters that are selected from . CMTL and BiFactorMTL further need one and TriFactor further needs two hyperparameters for the number of clusters that are chosen from . All these hyperparameters are selected by internal fold cross validation grid search on the training data.
Name  Samples  Features  Num Tasks 

Syn  3000  15  30 
ScaleSyn  [500k,16000k]  10  [50k,160k] 
School  15362  28  139 
Sales  34062  5  811 
TaFeng  2619320  5  23812 
Alighting  33945  5  1926 
Boarding  33945  5  1926 
Syn  

Obj  Time(s)  Obj  Time(s)  Obj  Time(s)  Obj  Time(s)  Obj  Time(s)  
ADMM  1314  8  1329  8  1474  9  2320  49  7055  180 
The proposed  1314  0.5  1329  0.5  1472  0.5  2320  0.5  6454  0.5 
School  

Obj  Time(s)  Obj  Time(s)  Obj  Time(s)  Obj  Time(s)  Obj  Time(s)  
ADMM  664653  605  665611  583  674374  780  726016  4446  776236  5760 
The proposed  664642  0.7  665572  0.8  674229  0.9  725027  1.5  764844  1.9 
20  30  40  
STL  2.905 (0.031)  2.877 (0.025)  2.873 (0.036) 
ITL  1.732 (0.077)  1.424 (0.049)  1.284 (0.024) 
L21  1.702 (0.033)  1.388 (0.014)  1.282 (0.011) 
Trace  1.302 (0.042)  1.222 (0.028)  1.168 (0.023) 
RMTL  1.407 (0.028)  1.295 (0.024)  1.234 (0.039) 
CMTL  1.263 (0.038)  1.184 (0.007)  1.152 (0.017) 
FuseMTL  2.264 (0.351)  1.466 (0.025)  1.297 (0.048) 
SRMTL  1.362 (0.018)  1.195 (0.014)  1.152 (0.012) 
BiFactor  1.219 (0.025)  1.150 (0.020)  1.125 (0.013) 
TriFactor  1.331 (0.239)  1.255 (0.236)  1.126 (0.010) 
CCMTL  1.192 (0.018)  1.161 (0.018)  1.136 (0.015) 
dataset. The table reports the mean and standard errors over
random runs. The best model and the statistical competitive models (by paired ttest with ) are shown in bold.Datasets
We employ both synthetic and realworld datasets. Table 1 shows their statistics. Further details are provided below.
Accuracy Synthetic.
Syn dataset aims at showing the ability of MTL methods to capture tasks structure. It consists of groups of tasks with tasks in each group. We generate features from . Tasks in group are constructed from features in and random features. Similarly, Tasks in group and are constructed from features and in respectively. samples are generated for each task.
Scaling Synthetic.
ScaleSyn datasets aim at showing the computational performance of MTL methods. It has fixed feature size (i.e. =10), but an exponentially growing number of tasks (from to ). Tasks are generated in groups of fixed size (). The latent features for tasks in the same group are sampled from , where the center is sampled from . samples are generated for each task as well.
Exam Score Prediction.
School is a classical benchmark dataset in Multitask regression reported in literatures [Argyriou, Evgeniou, and Pontil2007], [Kumar and Daume III2012], [Zhang and Yeung2014]. It consists of examination scores of students from schools in London. Each school is considered as a task and the aim is to predict the exam scores for all the students. We use the dataset from Malsar package [Zhou, Chen, and Ye2011b].
Retail.
Sales is a dataset contains weekly purchased quantities of products over weeks [Tan and San Lau2014]. We acquired the dataset from UCI repository [Dheeru and Karra Taniskidou2017]. We build the dataset by using the sales quantities of previous weeks for each product to predict the sales for the current week, resulting in samples in total. TaFeng is another grocery shopping large dataset that consists of transactions data of products over months. We build the data in a similar fashion obtaining samples in total.
Transportation.
Demand prediction is an important aspect for Intelligent Transportation Systems (ITS). We used a confidential real dataset consisting of bus arrival time and passenger counting information at each station for two lines of a major European city in both directions with four trip each. A task (total of ) consists on the prediction of the passenger demand at each stop, given the arrival time to the stop and the number of alighting and boarding at the previous two stops. The alighting and boarding datasets contain samples and features.
20  30  40  
STL  10.245 (0.026)  10.219 (0.034)  10.241 (0.068) 
ITL  11.427 (0.149)  10.925 (0.085)  10.683 (0.045) 
L21  11.175 (0.079)  11.804 (0.134)  11.442 (0.137) 
Trace  11.117 (0.054)  11.877 (0.542)  11.655 (0.058) 
RMTL  11.095 (0.066)  10.764 (0.068)  10.544 (0.061) 
CMTL  10.219 (0.056)  10.109 (0.069)  10.116 (0.053) 
FuseMTL  10.372 (0.108)  10.407 (0.269)  10.217 (0.085) 
SRMTL  10.258 (0.022)  10.212 (0.039)  10.128 (0.021) 
BiFactor  10.445 (0.135)  10.201 (0.067)  10.116 (0.051) 
TriFactor  10.551 (0.080)  10.224 (0.070)  10.129 (0.020) 
CCMTL  10.170 (0.029)  10.036 (0.046)  10.020 (0.021) 
Results and Discussion
Comparison with ADMMbased Solver.
Firstly, we compare our solver with an ADMMbased solver when determining a solution to our problem of (2). The ADMMbased solver is implemented using SnapVX python package from NetworkLasso [Hallac, Leskovec, and Boyd2015]. Both solvers are evaluated on the Syn and the School benchmark datasets. The nearest graph, , is generated as described in Algorithm 1 and is used to test both solvers with different values for the regularization parameter . Tables 2 and 3 show the final objective functions and runtime comparison on Syn and School datasets, respectively. It is clear that, for small ( ), both solver achieve similar objective values for the problem (2). When takes larger values (), the objective values of ADMM method tend to monotonically increase, reflecting the increasing importance of the regularization term, but with a smaller slope for our solvers compared to the ADMMbased one. In addition to the lower objective function, our solver is clearly more computationally efficient than the expensive ADMMbased solver. The proposed solver shows stability in runtime, by taking at maximum two seconds for all possible values, compared to a runtime in the range of seconds for the ADMMbased solver, on the School data.
Comparison with SoA MTL methods.
CCMTL is compared with the stateoftheart methods in terms of the Root Mean Squared Error (RMSE). All experiments are repeated times with different shuffling. In all result’s tables, we compare the best performing method with the remaining ones using the paired ttest (with ). The best method and the methods that cannot be statistically outperformed (by the best one) are shown in boldface.
Table 4 presents the prediction error, RMSE, on the Syn dataset with the ratio of training samples ranging from to . Tasks in the Syn dataset are generated to be heterogeneous and well partitioned, therefore, STL performs the worst, since it trains only a single model on all tasks. Similarly, the baseline ITL is also outperformed by the remaining MTL methods. Our approach, CCMTL, is statistically better than all SoA methods, except for the BiFactor which performs as well as CCMTL on the Syn dataset.
The results on the School dataset are depicted in Table 5 with a ratio of training samples ranging from to . It appears to be that, unlike the Syn data, the School data has tasks that are rather homogeneous, therefore, ITL performs the worst and STL shows its superiority on many of the MTL methods (L21, Trace, RMTL, FuseMTL and SRMTL). MTFactor and TriFactor outperform STL only when the training ratio is larger than and , respectively. CCMTL, again, performs better than all competitive methods, on all training rations; CCMTL is also statistically the best performing method, expect for CMTL (with ratio
) where the null hypothesis could not be rejected.
Sales  TaFeng  
RMSE  Time(s)  RMSE  Time(s)  
STL  2.861 (0.02)  0.1  0.791 (0.01)  0.2 
ITL  3.115 (0.02)  0.1  0.818 (0.01)  0.4 
L21  3.301 (0.01)  11.8  0.863 (0.01)  831.2 
Trace  3.285 (0.21)  10.4  0.863 (0.01)  582.3 
RMTL  3.111 (0.01)  3.4  0.833 (0.01)  181.5 
CMTL  3.088 (0.01)  43.4    
FuseMTL  2.898 (0.01)  4.3  0.764 (0.01)  8483.3 
SRMTL  2.854 (0.02)  10.3    
BiFactor  2.882 (0.01)  55.7    
TriFactor  2.857 (0.04)  499.1    
CCMTL  2.793 (0.01)  1.8  0.767 (0.01)  35.3 
Table 6 depicts the results on two retail datasets: Sales and TaFeng; it also depicts the time required (in seconds) for the training using the best found parametrization for each method. Here of samples are used for training. The best runtime for MTL methods is shown in boldface. Tasks in these two datasets are, again, rather homogeneous, therefore, the baseline STL has a competitive performance and outperforms many MTL methods. STL outperforms ITL, L21, Trace^{3}^{3}3The hyperparameters searching range for L21 and Trace are shifted to for TaFeng dataset to get reasonable results., RMTL, CMTL, FuseMTL, SRMTL, BiFactor, and TriFactor on the Sales dataset, and outperforms ITL, L21, Trace and RMTL on the TaFeng data^{4}^{4}4We set a timeout at . CMTL, SRMTL, BiFactor, and TriFactor did not return the result on this timeout for the TaFeng dataset.. CCMTL is the only method that performs better (also statistically better) than STL on both data sets; it also outperforms all MTL methods (with statistical significance) on both data sets, except for FuseMTL which performs slightly better than CCMTL only on the TaFeng data. CCMTL requires the smallest runtime in comparison with the competitor MTL algorithms. On TaFeng dataset, CCMTL requires only around seconds, while the SoA methods need up to hours or even days.
Alighting  Boarding  
RMSE  Time(s)  RMSE  Time(s)  
STL  3.073 (0.02)  0.1  3.236 (0.03)  0.1 
ITL  2.894 (0.02)  0.1  3.002 (0.03)  0.1 
L21  2.865 (0.04)  14.6  2.983 (0.03)  16.7 
Trace  2.835 (0.01)  19.1  2.997 (0.05)  17.5 
RMTL  2.985 (0.03)  6.7  3.156 (0.04)  7.1 
CMTL  2.970 (0.02)  82.6  3.105 (0.03)  91.8 
FuseMTL  3.080 (0.02)  11.1  3.243 (0.03)  11.3 
SRMTL  2.793 (0.02)  12.3  2.926 (0.02)  14.2 
BiFactor  3.010 (0.02)  152.1  3.133 (0.03)  99.7 
TriFactor  2.913 (0.02)  282.3  3.014 (0.03)  359.1 
CCMTL  2.795 (0.02)  4.8  2.928 (0.03)  4.1 
Table 7 depicts the results on the Transportation datasets, using two different target attributes (alighting and boarding); again, the runtime is presented for the bestfound parametrization, and the best runtime achieved by the MTL methods are shown in boldface. The results on this dataset are interesting, especially because both baselines are not competitive as in the previous datasets. This could, safely, lead to the conclusion that the tasks belong to latent groups, where tasks are homogeneous intragroup, and heterogeneous intergroups. All MTL methods (except the FuseMTL) outperform at least one of the baselines (STL and ITL) on both datasets. Our approach, CCMTL, seems to reach the right balance between task independence (ITL) and complete correlation (STL), as confirmed by the results; it achieves, statistically, the lowest RMSE against the baselines and the all other MTL methods (except SRMTL), and it is at least 40% faster than the fastest MTL method (RMTL).
Scalability.
In the scalability analysis, we use the ScaleSyn dataset. We search the hyperparameters for all the methods on the smallest one with tasks and evaluate the runtime of the bestfound hyperparameters on all the others. Figure 1 shows the recorded runtime in seconds while presenting the number of tasks in the logscale. As can be seen, CMTL, MTFactor, and TriFactor were not capable to process 40k tasks in less than 24 hours, therefore, they were stopped (the extrapolation for the minimum needed runtime can be seen as a dashed line). FuseMTL, SRMTL, L21, and Trace tend to show a superlinear growth of the needed runtime in the logscale. Both CCMTL and RMTL show constant behavior in the number of tasks, where only around and seconds are needed for the dataset with
tasks respectively. CCMTL is the fastest method among all the MTL competitors. A regression analysis on the runtime curves is presented in the supplementary material; this analysis shows that only CCMTL and RMTL scales linearly, compared to the other methods that scale quadratically.
Related Work
Zhang and Yang [Zhang and Yang2017]
, in their survey, classify multitask learning into different categories: feature learning, lowrank approaches, and task clustering approaches, among others. These categories are characterized mainly by how the information between the different tasks is shared and which information is subject to sharing.
One type of MTL perform joint feature learning (L21) that assumes all tasks share a common set of features and penalizes it by norm regularization [Argyriou, Evgeniou, and Pontil2007, Argyriou, Evgeniou, and Pontil2008, Liu, Ji, and Ye2009]. Another way to capture the task relationship is to constrain the models from different tasks to share a lowdimensional subspace, i.e. is of lowrank (Trace) [Ji and Ye2009]. Both L21 and Trace assumes all the tasks are relevant, which is usually not true in realworld applications. Chen et al. [Chen, Zhou, and Ye2011] propose robust multitask learning (RMTL) in identifying irrelevant tasks by integrating the lowrank and groupsparse structures. These methods are relatively fast but cannot capture the task relationship when they belong to different latent groups.
The task clustering approaches aim at solving this issue where different tasks form clusters of similar tasks. Jacob et al. [Jacob, Vert, and Bach2009] propose to integrate the objective of means into the learning framework and solve a relaxed convex problem. Zhou et al. [Zhou, Chen, and Ye2011a] use a similar idea for task clustering, but with a different optimization method. Zhou and Zhao [Zhou and Zhao2016] propose to cluster tasks by identifying representative tasks. Another way of performing task clustering is through the decomposition of the weight matrix [Kumar and Daume III2012, Barzilai and Crammer2015]. Later, a similar idea is performed with coclustering of the features and the tasks [Murugesan, Carbonell, and Yang2017]. Despite being effective, these methods are expensive to train and the number of clusters is needed as a hyperparameter which makes the model tuning even more difficult.
Fused Multitask Learning (FuseMTL) [Zhou et al.2012, Chen et al.2010] and Graph regularized Multitask Learning (SRMTL) [Zhou, Chen, and Ye2011b] are the most related works, where norm and squared is used as the regularizer. As shown in the experiments, with norm CCMTL outperforms FuseMTL and SRMTL in most cases. In addition, the underlying optimization method is also different, where the proposed CCMTL runs much faster. Another closely related work is a multilevel clustering method [Han and Zhang2015], where objective function also uses norm. It looks similar to the proposed one if only one layer is considered [Zhang and Yang2017]. However, the method scales quadratically since constraint is on all the pairs of tasks, making it unsuitable for the studied problem in this paper with a massive number of tasks.
Conclusion
In this paper, we study the multitask learning problem with a massive number of tasks. We integrate convex clustering into the multitask regression learning problem that captures tasks’ relationships on the NN graph of the prediction models. Further, we present an approach CCMTL that solves this problem efficiently and is guaranteed to converge to the global optimum. Extensive experiments show that CCMTL makes a more accurate prediction, runs faster than SoA competitors on both synthetic and realworld datasets, and scales linearly in the number of tasks. CCMTL will serve as a method for a wide range of largescale regression applications where the number of tasks is tremendous. In the future, we will explore the use of the proposed method for online learning and highdimensional problem.
Acknowledgments
The authors would like to thank Luca Franceschi for the discussion of the paper and the anonymous reviewers for their helpful comments and Prof. Yiannis Koutis for sharing the implementation of CMG.
References
 [Argyriou, Evgeniou, and Pontil2007] Argyriou, A.; Evgeniou, T.; and Pontil, M. 2007. Multitask feature learning. In NIPS, 41–48.
 [Argyriou, Evgeniou, and Pontil2008] Argyriou, A.; Evgeniou, T.; and Pontil, M. 2008. Convex multitask feature learning. Machine Learning 73(3):243–272.
 [Bakker and Heskes2003] Bakker, B., and Heskes, T. 2003. Task clustering and gating for bayesian multitask learning. Journal of Machine Learning Research 4(May):83–99.
 [Barzilai and Crammer2015] Barzilai, A., and Crammer, K. 2015. Convex multitask learning by clustering. In Artificial Intelligence and Statistics, 65–73.
 [Beck2015] Beck, A. 2015. On the convergence of alternating minimization for convex programming with applications to iteratively reweighted least squares and decomposition schemes. SIAM Journal on Optimization 25(1):185–209.
 [Chen et al.2010] Chen, X.; Kim, S.; Lin, Q.; Carbonell, J. G.; and Xing, E. P. 2010. Graphstructured multitask regression and an efficient optimization method for general fused lasso. arXiv preprint arXiv:1005.3579.
 [Chen, Zhou, and Ye2011] Chen, J.; Zhou, J.; and Ye, J. 2011. Integrating lowrank and groupsparse structures for robust multitask learning. In KDD, 42–50. ACM.
 [Chi and Lange2015] Chi, E. C., and Lange, K. 2015. Splitting methods for convex clustering. Journal of Computational and Graphical Statistics 24(4):994–1013.
 [Cohen et al.2014] Cohen, M.; Kyng, R.; Miller, G.; Pachocki, J.; Peng, R.; Rao, A.; and Xu, S. 2014. Solving sdd linear systems in nearly m log 1/2 n time. In STOC, 343–352. ACM.
 [Deng et al.2017] Deng, D.; Shahabi, C.; Demiryurek, U.; and Zhu, L. 2017. Situation aware multitask learning for traffic prediction. In ICDM, 81–90. IEEE.
 [Dheeru and Karra Taniskidou2017] Dheeru, D., and Karra Taniskidou, E. 2017. UCI machine learning repository.
 [Hallac, Leskovec, and Boyd2015] Hallac, D.; Leskovec, J.; and Boyd, S. 2015. Network lasso: Clustering and optimization in large graphs. In KDD, 387–396. ACM.
 [Han and Zhang2015] Han, L., and Zhang, Y. 2015. Learning multilevel task groups in multitask learning. In AAAI, volume 15, 2638–2644.
 [He and MoreiraMatias2018] He, X., and MoreiraMatias, L. 2018. Robust continuous coclustering. arXiv preprint arXiv:1802.05036.
 [Hocking et al.2011] Hocking, T.; Joulin, A.; Bach, F.; and Vert, J. 2011. Clusterpath an algorithm for clustering using convex fusion penalties. In ICML.

[Ioannis, Miller, and
Tolliver2011]
Ioannis, K.; Miller, G.; and Tolliver, D.
2011.
Combinatorial preconditioners and multilevel solvers for problems in computer vision and image processing.
Computer Vision and Image Understanding 115(12):1638 – 1646.  [Jacob, Vert, and Bach2009] Jacob, L.; Vert, J.p.; and Bach, F. R. 2009. Clustered multitask learning: A convex formulation. In NIPS, 745–752.
 [Ji and Ye2009] Ji, S., and Ye, J. 2009. An accelerated gradient method for trace norm minimization. In ICML, 457–464. ACM.
 [Kelner et al.2013] Kelner, J.; Orecchia, L.; Sidford, A.; and Zhu, Z. 2013. A simple, combinatorial algorithm for solving sdd systems in nearlylinear time. In STOC, 911–920. ACM.
 [Kumar and Daume III2012] Kumar, A., and Daume III, H. 2012. Learning task grouping and overlap in multitask learning. ICML.
 [Li, He, and Borgwardt2018] Li, L.; He, X.; and Borgwardt, K. 2018. Multitarget drug repositioning by bipartite blockwise sparse multitask learning. BMC systems biology 12(4):55.
 [Liu, Ji, and Ye2009] Liu, J.; Ji, S.; and Ye, J. 2009. Multitask feature learning via efficient l 2, 1norm minimization. In UAI, 339–348. AUAI Press.
 [Murugesan, Carbonell, and Yang2017] Murugesan, K.; Carbonell, J.; and Yang, Y. 2017. Coclustering for multitask learning. ICML.
 [Shah and Koltun2017] Shah, S. A., and Koltun, V. 2017. Robust continuous clustering. Proceedings of the National Academy of Sciences 114(37):9814–9819.
 [Tan and San Lau2014] Tan, S. C., and San Lau, J. P. 2014. Time series clustering: A superior alternative for market basket analysis. In Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng2013), 241–248. Springer, Singapore.

[Wang et al.2016]
Wang, Q.; Gong, P.; Chang, S.; Huang, T. S.; and Zhou, J.
2016.
Robust convex clustering analysis.
In ICDM, 1263–1268. IEEE.  [Zhang and Yang2017] Zhang, Y., and Yang, Q. 2017. A survey on multitask learning. arXiv preprint arXiv:1707.08114v2.
 [Zhang and Yeung2014] Zhang, Y., and Yeung, D.Y. 2014. A regularization approach to learning task relationships in multitask learning. ACM Transactions on Knowledge Discovery from Data (TKDD) 8(3):12.
 [Zhou and Zhao2016] Zhou, Q., and Zhao, Q. 2016. Flexible clustered multitask learning by learning representative tasks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2):266–278.
 [Zhou et al.2012] Zhou, J.; Liu, J.; Narayan, V. A.; and Ye, J. 2012. Modeling disease progression via fused sparse group lasso. In KDD, 1095–1103. ACM.
 [Zhou, Chen, and Ye2011a] Zhou, J.; Chen, J.; and Ye, J. 2011a. Clustered multitask learning via alternating structure optimization. In NIPS, 702–710.
 [Zhou, Chen, and Ye2011b] Zhou, J.; Chen, J.; and Ye, J. 2011b. Malsar: Multitask learning via structural regularization. Arizona State University 21.