1 Introduction
Multitask learning (MTL) is an area of machine learning that aims at exploiting relationships among tasks to improve the collective generalization performance of all tasks. In MTL, learning of different tasks is performed jointly, thus, it transfers knowledge from informationrich tasks via task relationship [45] so that the overall generalization error can be reduced. MTL has been successfully applied in various domains ranging from transportation [8] to biomedicine [29]. The improvement with respect to learning each task independently is significant when each task has only a limited amount of training data [2].
Various multitask learning algorithms have been proposed in the literature (see Zhang and Yang [43] for a comprehensive survey on stateoftheart methods). Feature learning approaches [1] and lowrank approaches [6] assume all the tasks are related, which may not be true in realworld applications. Task clustering approaches [27] can deal with the situation where different tasks form clusters. He et al. [19] propose a MTL method that is both accurate and efficient; thus applicable in presence of large number of tasks, as in the retail sector. However, despite being accurate and scalable, these methods lack interpretability, when it comes to task relationship.
Trustworthy Artificial Intelligence, is an EU initiative to capture the main requirements of ethical AI. Transparency and Human oversight are among the seven key requirements developed by the AI Expert Group
[23]. Even if MTL improves performance w.r.t to individual models, predictions made by blackbox methods can not be used as basis for decisions, unless justified by interpretable models [17, 30].Interpretability can be defined locally. LIME [36] and its generalizations [32]
are local method that extract features for each test sample that most contribute to the prediction and aim at finding a sparse model that describes the decision boundary. These methods are applied downstream of an independent blackbox machine learning method that produces the prediction. Inpretability can also achieved globaly. For example linear regression, logistic regression and decision tree
[17, 30] are considered interpretable^{1}^{1}1inteterpretability depends also on the application, where for example it may be associated with weights being integer, and can be defined thus differently, since their parameters are directly interpretable as weights on the input features. Global interpretability, on the other hand, could reduce accuracy. In general, MTL methods [37, 31] are not directly interpretable, unless single tasks are learned as linear models which are considered interpretable due to their simplicity [38, 14, 33]. This property, however, is no longer guaranteed when tasks and their relations are learned simultaneously, mainly because the relative importance of task relationship is not revealed. Since natural phenomena are often characterized by sparse structures, we explore the interpretability resulting from imposing the relationship among tasks to be sparse.To fill the gap of interpretabilty in MTL, this paper introduces a novel algorithm, named Graph Guided MultiTask regression Learning (GGMTL). It integrates the objective of joint interpretable (i.e. sparse) structure learning with the multitask model learning. GGMTL enjoys a closedform hypergradient computation on the edge cost; it also provides a way to learn the graph’s structure by exploiting the linear nature of the regression tasks, without excessively scarifying the accuracy of the learned models. The detailed contribution of this paper is multifold:

Bilevel MTL Model: A new model for the joint learning of sparse graph structures and multitask regression that employs graph smoothing on the prediction models (Sec.3.1);

Closedform hypergradient: Presents a closedform solution for the hypergradient of the graph smoothing multitask problem (Sec.3.4)

Interpretable Graph: The learning of interpretable graph structures;

Accurate Prediction: Accurate predictions on both synthetic and realworld datasets despite the improved interpretability (Sec.4);

Efficient computation: of the hypergradient of the proposed bilevel problem;

Veracity measures: to evaluate the fidelity of learned MTL graph structure (Sec.4.1.1)
2 Related work
2.1 Multitask structure learning
Substantial efforts have been made on estimating model parameters of each task and the mutual relationship (or dependency) between tasks. Usually, such relationship is characterized by a dense task covariance matrix or a task precision matrix (a.k.a., the inverse of covariance matrix). Early methods (e.g.,
[34]) assume that all tasks are related to each other. However, this assumption is overoptimistic and may be inappropriate for certain applications, where different tasks may exhibit different degrees of relatedness. To tackle this problem, more elaborated approaches, such as clustering of tasks (e.g., [24]) or hierarchical structured tasks (e.g., [18]) have been proposed in recent years.The joint convex learning of multiple tasks and a task covariance matrix was initialized in MultiTask Relationship Learning (MTRL) [44]. Later, the Bayesian Multitask with Structure Learning (BMSL) [13] improves MTRL by introducing sparsity constraints on the inverse of task covariance matrix under a Bayesian optimization framework. On the other hand, the recently proposed multitask sparse structure learning (MSSL) [15] directly optimizes the precision matrix using a regularized Gaussian graphical model. One should note that, although the learned matrix carries partial dependency between pairwise tasks, there is no guarantee that the learned task covariance or prediction matrix can be transformed into a valid graph Laplacian [9]. From this perspective, the learned task structures from these works suffer from poor interpretability.
2.2 Bilevel optimization in machine learning
Bilevel problems [7] raise when a problem (outer problem) contains another optimization problem (inner problem) as constraint. Intuitively, the outer problem (master) defines its solution by predicting the behaviour of the inner problem (follower). In machine learning, hyperparameter optimization tries to find the predictive model’s parameters
, with respect to the hyperparameters vector
that minimizes the validation error. This can be mathematically formulated as the bilevel problem(1a)  
(1b) 
The outer objective is the minimization of the generalization error on the hyperparameters and validation data , whereas is the regularized empirical error on the training data , see [11], where . The bilevel optimization formulation has the advantage of allowing to optimize two different cost functions (in the inner and outer problems) on different data (training/validation), thus, alleviating the problem of overfitting and implementing an implicit cross validation procedure.
In the context of machine learning, bilevel optimization has been adopted mainly as a surrogate to the timeconsuming crossvalidation which always requires grid search in highdimensional space. For example, [25]
formulates crossvalidation as a bilevel optimization problem to train deep neural networks for improved generalization capability and reduced test errors.
[12] follows the same idea and applies bilevel optimization to group Lasso [42] in order to determine the optimal group partition among a huge number of options.Given the flexibility of bilevel optimization, it becomes a natural idea to cast multitask learning into this framework. Indeed, [28, Chapter 5]
first presents such a formulation by making each of the individual hyperplanes (of each task) less susceptible to variations within their respective training sets. However, no solid examples or discussions are provided further. This initial idea was significantly improved in
[10], in which the outer problem optimizes a proxy of the generalization error over all tasks with respect to a task similarity matrix and the inner problem estimates the parameters of each task assuming the task similarity matrix is known.3 Graph guided MTL
3.1 Bilevel multitasking linear regression with graph smoothing
We consider the problem of finding regression models for tasks, with input/output data , where , and is the feature size, while is the number of samples for the th task^{2}^{2}2In the following we assume for simplicity for all tasks, but results extend straightforward.. We split the data into validation and training sets and formulate the problem as a bilevel program:
(2a)  
(2b) 
where is the models’ vectors, is the Laplacian matrix defined using the incident matrix , is the edge weight vector with being the adjacent matrix, and is the discrete indicator vector which is zero everywhere except at the th entry. We use for the set . The regularization term in the inner problem is the Dirichlet energy [5]
(3) 
where is the graph whose Laplacian matrix is . is the unnormalized entropy of the edge values.
The inner problem (model learning) aims at finding the optimal model for a given structure (i.e. graph), while the outer problem (structure learning) aims at minimizing a cost function that includes two terms: (1) the learned model’s accuracy on the validation data, and (2) the sparseness of the graph. We capture the sparseness of the graph with three terms: (a) the norm of the edge values, measuring the energy of the graph, (b) the norm measuring the sparseness of the edges, and (c) measuring the entropy of the edges. In the experiment, we limit the edges to have values in the interval , which can be interpreted as a relaxation of the
mixed integer nonlinear programming
problem when as defined in Eq.2. The advantage of formulating the MTL learning as a bilevel program (Eq.2) is the ability to derive a closedform solution for the hypergradient (see Thm.3.1). Moreover, for a proper choice of the regularization parameter (), all edge weights have a closedform solution (see Thm.3.2). For the general case, we propose a gradient descent algorithm (Alg.1). Entropy regularization term has superior sparsification performance to the norm regularization [21], thus, the latter can be ignored during hyperparameter search to reduce the search space at the expense of a improved flexibility.For simplicity, we define the functions:
(4a)  
(4b) 
which allow us to write the bilevel problem in the compact form:
(5) 
The proposed formulation optimally selects the sparser graph among tasks that provides the best generalization performance on the validation dataset.
3.2 The normsquare regularization GGMTL algorithm
We propose an iterative approach that computes the hypergradient of (eq.4a) with respect to the graph edges (the hyperparameters); this hypergradient is then used for updating the hyperparameters based on the gradient descend method, i.e.,
(6) 
where is the hypergradient and is the learning rate. Algorithm Alg.1 depicts the structure of the GGMTL learning method, where . The stopping criterion is evaluated on the convergence of the validation and training errors. As a final step, the tasks’s models are relearned on all training and validation data based on the last discovered edge values.
3.3 The norm regularization GGMTL algorithm
The energy smoothing term Eq.3 in the inner problem of Eq.2 is a quadratic term. However, if two models are unrelated, but connected by an erroneous edge, this term grows quadratically dominating the loss. To reduce this undesirable effect, a term proportional to the distance can be achieved using notsquared norm. Therefore, we extend the inner problem of Eq.(2) of the previous model to become:
(7) 
where the regularization term in the inner problem is the nonsquared . This can be efficiently solved using alternating optimization [19], by defining the vector of edges’ multiplicative weights such that:
(8) 
We can now formulate a new optimization problem equivalent to Eq.7
(9a)  
(9b) 
where is the elementwise product, is the Laplacian matrix whose edge values are the elementwise product of and (), while is a short notation for the elementwise inverse of . Having fixed , the last term of Eq.9 can be ignored, while optimizing the inner problem w.r.t. . The modified algorithm Alg.1 (GGMTL), which can also be found in the supplementary material (Alg.3), uses alternate optimization between the closedform solution in Eq.8 and the solution of Eq.9 over .
3.4 Hypergradient
The proposed method (Alg.1) is based on the computation of the hypergradient of Eq.2. This hypergradient has a closedform as defined by Thm.3.1 and can be computed efficiently.
Theorem 3.1
The hypergradient of problem of Eq.2a is
(10)  
where and is build with only the nonzero edges (i.e. ).. The other variables are , and , , , , . is the element wise logarithm of the vector and is the Hadamard product. is the elementwise sign function of .
3.5 Closedform hyperedges
Alternative to applying gradient descent methods using the hypergradient updates, the optimal edges’ values can also be directly computed. We compute (the edge vector) as the solution of , since the optimal solution has zero hypergradient. In the case when is the only term that has a nonzero weight in Eq.2a, the edge vector has a closedform solution as proven in Thm.3.2.
3.6 Complexity analysis
GGMTL algorithm computes the tasks’ models kNN graph, whose computational complexity can be reduced from
to [3] ^{3}^{3}3or, using Approximate Nearest Neighbour (ANN) methods, to [22] , while one iteration of GGMTL algorithm computes the hypergradient. A naive implementation of this step requires inverting a system of dimension , whose complexity is . It would thus come to surprise that the actual computational complexity of the GGMTL method is , where the second two terms follow from Thm.3.3, while is the matrixvector product which can be performed in parallel.Theorem 3.3
The computational complexity of solving hypergradient of Thrm. 3.1 is (or , with constant).
4 Experimental results
We evaluate the performance of GGMTL against four stateoftheart multitask learning methodologies (namely MTRL [44], MSSL [15], BSML [13], and CCMTL [19]) on both synthetic data and realworld applications. Among the four competitors, MTRL learns a graph covariance matrix, MSSL and BSML directly learn a graph precision matrix which can be interpreted as a graph Laplacian. By contrast, CCMTL does not learn task relationship, but uses a fixed NN graph before learning model parameters^{4}^{4}4We performed gridsearch hyperparameter search for all methods.
4.1 Measures
4.1.1 Synthetic dataset measure for veracity
To evaluate the performance of the proposed method on the synthetic dataset, we propose a reformulation of the measures: accuracy, recall and precision by applying the Łukasiewicz fuzzy Tnorm and Tconorm [26], where represent truth values from the interval . Given the ground truth graph and the predicted graph (on tasks) with proper adjacency matrices and (i.e., for all ), we define:
recall  
precision  
accuracy 
s.t is the fuzzy XOR, see [4]. The definition of the
score remains unchanged as the harmonic mean of precision and recall. These measures inform about the overlap between a predicted (weighted) graph and a ground truth sparse structure, in a similar way to imbalanced classification. An alternative and less informative approach would be to compute Hamming distance between the two adjacency matrices (ground truth and induced graph), provided they are both binary.
4.1.2 Regression performance
The generalization performance is measured in terms of the Root Mean Square Error (RMSE) averaged over tasks.
BMSL  MSSL  CCMTL  GGMTL  

accuracy  line  
tree  
star  
recall  line  
tree  
star  
precision  line  
tree  
star  
F1  line  
tree  
star  
RMSE  line  
tree  
star 
4.2 Synthetic data
In order to evaluate the veracity of the proposed method, we generate three synthetic datasets where the underlying structure of the relationship among tasks is known. Each task in these datasets is a linear regression task whose output is controlled by the weight vector . Each input variable , for task , is generated
from an isotropic multivariate Gaussian distribution, and the output is taken by
, where .The first dataset Line mimics the structure of a line, where each task is generated with an overlap to its predecessor task. This dataset contains tasks of input dimensions. The coefficient vector for tasks is , where denotes the pointwise product, is a
dimensional random vector with each element uniformly distributed between
, is also adimensional binay vector whose elements are Bernoulli distributed with
, and .The tasks of the second dataset Tree are created in a hierarchical manner simulating a tree structure such that , where is coefficient vector of the parent task (), and is for the root task. In order to create a proper binary tree, we generate tasks (dimensional) which creates a tree of five levels.
The distribution of the third dataset’s tasks takes a starshaped structure, hence called Star. The Coefficient vector of each task is randomly created () for ), and the center one is a mixture of them , where is an indicator vector with the th element set to and the others to . We evaluate the performance of our method in comparison to the other methods on two aspects: (i) the ability to learn graphs that recover hidden sparse structures, and (ii) the generalization performance. For generalization error we use Root Mean Square Error (RMSE).
Tab.1 depicts the results of comparing the graphs learned by GGMTL with those of CCMTL, and the covariance matrices of MSSL and BMSL when considered as adjacency matrices, after few adjustments ^{5}^{5}5A negative correlation is considered as a missing edge between tasks, hence, negative entries are set to zero. Besides, we normalize each matrix by division over the largest entry after setting the diagonal to zero. It is apparent that GGMTL always achieves the best accuracy except on the Star dataset when MSSL performs best in terms of accuracy; this occurs only because MSSL predicts an extremely sparse matrix leading to poor recall, precision and score. Moreover, GGMTL has always the best score achieved by correctly predicting the right balance between edges (with 2nd best recall) and sparseness (always best precsion), thus, leading to correctly interpreting and revealing the latent structure of task relations. Besides the quantitative measures, interpretability is also confirmed qualitatively in Fig.1 where the discovered edges reveal to a large extent the ground truth structures. The figure also plots the kNN graph next to that of GGMTL, this shows how graphs of GGMTL pose a refinement of those of kNN by removing misplaced edges while still maintaining the relevant ones among tasks. Finally, Tab.1 also shows that GGMTL commits the smallest generalization error in terms of RMSE with a large margin to BMSL and MSSL.
split  MTRL  MSSL  BMSL  CCMTL  GGMTL 

4.3 Realworld applications
4.3.1 Parkinson’s disease assessment.
Parkinson is a benchmark multitask regression dataset ^{6}^{6}6https://archive.ics.uci.edu/ml/datasets/parkinsons+telemonitoring, comprising a range of biomedical voice measurements taken from patients with earlystage Parkinson’s disease. For each patient, the goal is to predict the motor Unified Parkinson’s Disease Rating Scale (UPDRS) score based dimensional record: age, gender, and jitter and shimmer voice measurements. We treat UPDRS prediction for each patient as a task, resulting in tasks and observations in total. We compare the generalization performance of GGMTL with that of the other baselines when different ratios of the data is used for training, ratio . The results depicted in Tab.2 show that GGMTL performance is close to that of CCMTL, outperforming MSSL and BMSL. However, when plotting the learned graphs, see Fig.2(a) and Fig.2(b), GGMTL clearly manage to separate patients into a few distinct groups unlike the heavily connected kNN graph used by CCMTL. Interestingly, these groups are easily distinguished by Markov Clustering [41] when applied on the learned graph; this very same procedure fails to distinguish reasonable clusters when applied on the KNN graph (35 clusters were discovered with only one task in each, and one cluster with five tasks).
4.3.2 Exam score prediction
MTRL  MSSL  BMSL  CCMTL  GGMTL  

School is a classical benchmark dataset in Multitask regression [1, 27, 45]; it consists of examination scores of students from schools in London. Each school is considered as a task and the aim is to predict the exam scores for all the students. The school dataset is available in the Malsar package [46].
Tab.3 reports the RMSE on the School dataset. It is noticeable that both GGMTL and CCMTL have similar but dominating performance over the other methods. As with the Parkinson data, Fig.2(c) and Fig.2(d) compare the graphs induced by GGMTL and CCMTL(kNN), and the results of applying Markov clustering on their nodes. These figures show again that graphs induced by GGMTL are easier to interpret and lead to well separated clusters with only few intercluster edges.
4.3.3 Temperature forecasting in U.S.
Hours  MTRL  MSSL  BMSL  CCMTL  GGMTL 

8  NA  
16  NA 
The Temperature dataset^{7}^{7}7https://www.ncdc.noaa.gov/dataaccess/landbasedstationdata/landbaseddatasets/climatenormals/19812010normalsdata contains hourly temperature data for major cities in the United States, collected from stations for hours in . Data is cleaned and manipulated as described in [20]. The temperature forecasting with a horizon of or hours in advance at each station is model as a task. We select the first observations (roughly weeks) to train and left the remaining observations for test. We use previous hours of temperature as input to the model. Tab.4 reports the RMSE of the methods. Learning the graph structure using GGMTL does not impact performance in term of regression error. Fig.2[ef] shows node clustering on the graph learned on Temerature dataset, where the number of edges is reduced by using GGMTL.
5 Conclusions
In this work, we present a novel formulation of joint multitask and graph structure learning as a bilevel problem, and propose an efficient method for solving it based on a closedform of hypergradient. We also show the interpretability property of the proposed method on synthetic and real world datasets. We additionally analyze the computational complexity of the proposed method.
References
 [1] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multitask feature learning. In NIPS, pages 41–48, 2007.
 [2] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multitask feature learning. Machine Learning, 73(3):243–272, 2008.
 [3] Sunil Arya, David M Mount, Nathan S Netanyahu, Ruth Silverman, and Angela Y Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM (JACM), 45(6):891–923, 1998.
 [4] Benjamín C Bedregal, Renata HS Reiser, and Graçaliz P Dimuro. Xorimplications and eimplications: classes of fuzzy implications based on fuzzy xor. Electronic notes in theoretical computer science, 247:5–18, 2009.
 [5] Mikhail Belkin and Partha Niyogi. Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering. In Advances in Neural Information Processing Systems 14, pages 585–591. MIT Press, 2002.
 [6] Jianhui Chen, Jiayu Zhou, and Jieping Ye. Integrating lowrank and groupsparse structures for robust multitask learning. In KDD, pages 42–50. ACM, 2011.
 [7] Benoît Colson, Patrice Marcotte, and Gilles Savard. An overview of bilevel optimization. Annals of operations research, 153(1):235–256, 2007.
 [8] Dingxiong Deng, Cyrus Shahabi, Ugur Demiryurek, and Linhong Zhu. Situation aware multitask learning for traffic prediction. In ICDM, pages 81–90. IEEE, 2017.
 [9] Xiaowen Dong, Dorina Thanou, Pascal Frossard, and Pierre Vandergheynst. Learning laplacian matrix in smooth graph signal representations. IEEE Transactions on Signal Processing, 64(23):6160–6173, 2016.

[10]
Rémi Flamary, Alain Rakotomamonjy, and Gilles Gasso.
Learning constrained task similarities in graphregularized multitask
learning.
Regularization, Optimization, Kernels, and Support Vector Machines
, page 103, 2014.  [11] Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimilano Pontil. Bilevel Programming for Hyperparameter Optimization and MetaLearning. arXiv:1806.04910 [cs, stat], June 2018.
 [12] Jordan Frecon, Saverio Salzo, and Massimiliano Pontil. Bilevel learning of the group lasso structure. In Advances in Neural Information Processing Systems, pages 8301–8311, 2018.
 [13] Andre Goncalves, Priyadip Ray, Braden Soper, David Widemann, Mari Nygård, Jan F Nygård, and Ana Paula Sales. Bayesian multitask learning regression for heterogeneous patient cohorts. Journal of Biomedical Informatics: X, 4:100059, 2019.
 [14] Andre R. Goncalves, Puja Das, Soumyadeep Chatterjee, Vidyashankar Sivakumar, Fernando J. Von Zuben, and Arindam Banerjee. Multitask Sparse Structure Learning. Proceedings of the 23rd ACM CIKM ’14, 2014.
 [15] André R Gonçalves, Fernando J Von Zuben, and Arindam Banerjee. Multitask sparse structure learning with gaussian copula models. The Journal of Machine Learning Research, 17(1):1205–1234, 2016.
 [16] Andre R Goncalves, Fernando J Von Zuben, and Arindam Banerjee. Multitask Sparse Structure Learning with Gaussian Copula Models. Journal of Machine Learning Research 17 (2016), page 30, 2016.
 [17] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A Survey of Methods for Explaining Black Box Models. ACM Computing Surveys, 51(5):1–42, August 2018.
 [18] Lei Han and Yu Zhang. Learning multilevel task groups in multitask learning. In AAAI, volume 15, pages 2638–2644, 2015.
 [19] Xiao He, Francesco Alesiani, and Ammar Shaker. Efficient and Scalable Multitask Regression on Massive Number of Tasks. In The ThirtyThird AAAI Conference on Artificial Intelligence (AAAI19), 2019.
 [20] Fei Hua, Roula Nassif, Cédric Richard, Haiyan Wang, and Ali H Sayed. Online distributed learning over graphs with multitask graphfilter models. IEEE Transactions on Signal and Information Processing over Networks, 6:63–77, 2020.
 [21] Shuai Huang and Trac D. Tran. Sparse Signal Recovery via Generalized Entropy Functions Minimization. IEEE Transactions on Signal Processing, 67(5):1322–1337, March 2019.
 [22] Ville Hyvönen, Teemu Pitkänen, Sotiris Tasoulis, Elias Jääsaari, Risto Tuomainen, Liang Wang, Jukka Corander, and Teemu Roos. Fast knn search. arXiv:1509.06957, 2015.
 [23] HighLevel Expert Group on Artificial Intelligence. Policy and investment recommendations for trustworthy AI. June 2019. Publisher: European Commission Type: Article; Article/Report.
 [24] Laurent Jacob, Jeanphilippe Vert, and Francis R Bach. Clustered multitask learning: A convex formulation. In NIPS, pages 745–752, 2009.

[25]
Simon Jenni and Paolo Favaro.
Deep bilevel learning.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pages 618–633, 2018.  [26] Erich Peter Klement, Radko Mesiar, and Endre Pap. Triangular Norms. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2000.
 [27] Abhishek Kumar and Hal Daume III. Learning task grouping and overlap in multitask learning. ICML, 2012.
 [28] Gautam Kunapuli. A bilevel optimization approach to machine learning. PhD thesis, PhD thesis, Rensselaer Polytechnic Institute, 2008.
 [29] Limin Li, Xiao He, and Karsten Borgwardt. Multitarget drug repositioning by bipartite blockwise sparse multitask learning. BMC systems biology, 12, 2018.
 [30] Zachary C. Lipton. The mythos of model interpretability. Communications of the ACM, 61(10):36–43, September 2018.
 [31] Pengfei Liu, Jie Fu, Yue Dong, Xipeng Qiu, and Jackie Chi Kit Cheung. Multitask Learning over Graph Structures. arXiv:1811.10211 [cs], November 2018.
 [32] Scott Lundberg and SuIn Lee. A Unified Approach to Interpreting Model Predictions. arXiv:1705.07874 [cs, stat], November 2017. arXiv: 1705.07874.
 [33] Keerthiram Murugesan, Jaime Carbonell, Hanxiao Liu, and Yiming Yang. Adaptive Smoothed Online MultiTask Learning. page 11, 2016.
 [34] Guillaume Obozinski, Ben Taskar, and Michael I Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231–252, 2010.
 [35] Kaare Brandt Petersen and Michael Syskind Pedersen. The Matrix Cookbook. February 2008.
 [36] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ”Why Should I Trust You?”: Explaining the Predictions of Any Classifier. arXiv:1602.04938 [cs, stat], August 2016.
 [37] Sebastian Ruder. An overview of multitask learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017.
 [38] Avishek Saha, Piyush Rai, Hal Daume Iii, and Suresh Venkatasubramanian. Online Learning of Multiple Tasks and Their Relationships. page 9, 2011.
 [39] Daniel A. Spielman and ShangHua Teng. Solving Sparse, Symmetric, DiagonallyDominant Linear Systems in Time $O (m^{1.31})$. arXiv:cs/0310036, March 2004.
 [40] Daniel A. Spielman and ShangHua Teng. NearlyLinear Time Algorithms for Preconditioning and Solving Symmetric, Diagonally Dominant Linear Systems. arXiv:cs/0607105, September 2012.
 [41] Stijn Marinus Van Dongen. Graph clustering by flow simulation. PhD thesis, 2000.
 [42] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
 [43] Yu Zhang and Qiang Yang. A survey on multitask learning. arXiv preprint arXiv:1707.08114v2, 2017.
 [44] Yu Zhang and DitYan Yeung. A convex formulation for learning task relationships in multitask learning. In Proceedings of the TwentySixth Conference on UAI, pages 733–742, 2010.
 [45] Yu Zhang and DitYan Yeung. A regularization approach to learning task relationships in multitask learning. ACM Transactions on TKDD, 8, 2014.
 [46] Jiayu Zhou, Jianhui Chen, and Jieping Ye. Malsar: Multitask learning via structural regularization. Arizona State University, 21, 2011.
Appendix 0.A Supplementary material
This Supplementary material contains:

The proofs of the Theorems stated in the main part (Annex 0.A.1);

the two versions of the algorithm: and (Annex 0.A.2)

additional datasets on the America climate. This dataset shows also the computational efficiency of the method that we were able to apply to moderate large size dataset (i.e. nodes of the North America dataset) (Annex 0.A.3)
0.a.1 Proof of theorems
Proof (Theorem 3.1)
Since the vector of all models, we define the block matrix of input and the vector of the output, the solution of the lower problem is
or
(12) 
where we define the auxiliary matrix
Using the Sherman Morrison formula
and defining we can write the increment on the edge of as the difference of the models,
(13)  
(14)  
(15) 
if then
(16)  
since (see also the Thm.0.A.2). Thus we have the shape of parameter gradient with respect to the hyperparameters
(17)  
(18) 
where , , and .
We have that
or for all models
The general expression of hypergradient or total derivative is given by
(19) 
where is the partial derivative, while is the total derivative, and we assume the parameters and hyperparameters. We notice that , where is the Hadamard product. We can now write the hypergradient for our problem
(21)  
where . The other variables previously defined are , and , , , , If then .
Theorem 0.A.1
The matrix of Thm. 3.1 defines a Sparse, Symmetric, DiagonallyDominant (SDD) linear system.
Proof (Theorem 0.a.1)
The property comes from inspecting the component of . The first term is a block diagonal matrix, whose block are the Laplacian matrix of the graph. This matrix is symmetric and sparse. The second element is a block diagonal matrix whose blocks are , that are symmetric matrices. Thus the system is a SDD linear system.
Theorem 0.A.2
The directional derivative of along is .
Proof (Theorem 0.a.2)
According to [35] page 19, Ch.3.4 , Eq. 167, we have that when . When now consider the following directional derivative of the function or , when then compute . It follows that the directional derivative is .
Proof (Theorem 3.2)
We notice that by setting we can compute the optimal in closedform. If we assume and , we have
where . Suppose that and , we have
(22)  
(23)  
(24)  
(25)  
Comments
There are no comments yet.