1 Introduction
In machine learning the goal is often to train predictive models for one or more tasks of interest. Making accurate predictions relies heavily on the existence of labeled data for the desired tasks. However, in realworld problems data is often hard to acquire (e.g., medical domains) or expensive to label (e.g., image segmentation). For many tasks, this makes it impractical or impossible to collect large volumes of labeled data. Multitask learning is a learning paradigm that aims to improve generalization performance in a learning task, by learning models for multiple related tasks simultaneously. It has received considerable interest in the past decades (Caruana, 1997; Evgeniou and Pontil, 2004; Argyriou et al., 2007, 2008; Kato et al., 2008; Liu et al., 2009; Jacob et al., 2009; Zhang and Yeung, 2010a; Zhang and Schneider, 2010; Chen et al., 2011, 2012; Li et al., 2015; Jawanpuria et al., 2015; Adel et al., 2017)
. One of the underlying assumptions behind many multitask learning algorithms is that the tasks are related to each other. Hence, a key question is how to define the notion of task relatedness, and how to capture it in the learning formulation. A common assumption is that tasks can be described by weight vectors, and that they are sampled from a shared prior distribution over their space
(Liu et al., 2009; Zhang and Yeung, 2010a, b). Another strand of work assumes common feature representations to be shared among multiple tasks, and the goal is to learn the shared representation as well as taskspecific parameters simultaneously (Thrun, 1996; Caruana, 1997; Evgeniou and Pontil, 2007; Argyriou et al., 2008). Moreover, when structure about multiple tasks is available, e.g., taskspecific descriptors (Bonilla et al., 2007) or a task similarity graph (Evgeniou and Pontil, 2004), regularizers can often be incorporated into the learning formulation to penalize hypotheses that are not consistent with the given structure.There have been several attempts to improve predictions by either learning the relationships between different tasks (Zhang and Yeung, 2010a), or by exploiting the relationships between different features (Argyriou et al., 2008). In this paper we consider a multiconvex framework for multitask learning that improves predictions over tabula rasa learning by assuming that all the task vectors are sampled from a common matrixvariate normal prior. The framework, known as MTFRL (Zhang and Schneider, 2010)
, learns the relationships both between tasks and between features simultaneously via two covariance matrices, i.e., the feature covariance matrix and the task covariance matrix. In this context, learning multiple tasks corresponds to estimating a matrix of model parameters, and learning feature/task relationships corresponds to estimating the row/column covariance matrices of model parameters, respectively. This property is favorable for applications where we not only aim for better generalization, but also seek to have a clear understanding about the relationships among different tasks.
The goal of MTFRL is to optimize over both the task vectors, as well as two covariance matrices in the prior. When the loss function is convex, the regularized problem of MTFRL is multiconvex. Previous approaches
(Zhang and Schneider, 2010; Zhang, 2011; Long et al., 2017) for solving this problem hinge on the classic flipflop algorithm (Dutilleul, 1999) to estimate the two covariance matrices. However, as we point out in Sec. 3, the flipflop algorithm cannot be directly applied as the maximum likelihood estimation (MLE) formulation of the multitask learning problem under this setting is illposed. As a result, in practice, heuristics have to be invented and applied in the algorithm to ensure the positivedefiniteness of both covariance matrices. However, it is not clear whether such a fixed algorithm still converges or not.
Our Contributions
In this paper we propose a variant of the MTFRL framework that is welldefined, as well as design a block coordinatewise minimization algorithm to solve this problem. We term our new formulation FEaTure and Relation learning (FETR). By design, FETR is free of the nonpositivedefinite problem in MTFRL. To solve FETR, we propose efficient and analytic solutions for each of the subproblem, which allows us to get rid of the expensive iterative procedure to optimize the covariance matrices. Specifically, we achieve this by reducing an underlying matrix optimization problem with positive definite constraints into a minimum weight perfect matching problem on a complete bipartite graph, where we are able to solve analytically using combinatorial techniques. To solve the weight learning subproblem, we propose three different strategies, including a closed form solution, a gradient descent method with linear convergence guarantee when the instances are not shared by multiple tasks, and a numerical solution based on Sylvester equation when instances are shared. We demonstrate the efficiency of the proposed optimization algorithm by comparing it with an offtheshelf projected gradient descent algorithm and the classic flipflop algorithm, on both synthetic and realworld data. Experiments show that the proposed optimization method is orders of magnitude faster than its competitors, and it often converges to better solutions. Lastly, we extend FETR to nonlinear setting by combining its regularization scheme with rich nonlinear transformations using neural networks. This combined approach is able to achieve significantly better generalizations than existing methods on realworld datasets.
2 Preliminary
We start by introducing notations used throughout the paper and briefly discussing the MTFRL framework (Zhang and Schneider, 2010). Readers are referred to Zhang and Yang (2017) for a recent survey on various multitask learning algorithms.
2.1 Notation and Setup
We use lowercase letters, such as , to represent scalars, and lowercase bold letters, such as , to denote vectors. Capital letters are reserved for matrices. We use and to denote the dimensional symmetric positive semidefinite cone and the dimensional symmetric positive definite cone, respectively. We write for the trace of a matrix , and
for the multivariate normal distribution with mean
and covariance matrix . Finally, is a weighted bipartite graph with vertex sets , , edge set and weight function . denotes the set . For a matrix , we use to denote its vectorization.We consider the following setup. Suppose we are given learning tasks , where for each learning task we have access to a training set with data instances . For the simplicity of discussion, here we focus on the regression setting where and . Extension to classification setting is straightforward. Let be our model with parameter . In what follows, we will assume our model for each task
to be a linear regression, i.e.,
.2.2 MatrixVariate Normal Distribution
A matrixvariate normal distribution (Gupta and Nagar, 1999) with mean , row covariance matrix and column covariance matrix can be understood as a multivariate normal distribution with . ^{1}^{1}1Probability density: . One advantage of the matrixvariate normal distribution over its equivalent multivariate counterpart is that by imposing structure on the row and column covariance matrices, the former admits a much more compact representation than the latter ( versus ). The MLE of the matrixvariate normal distribution has been well studied in the literature (Dutilleul, 1999). Specifically, given an i.i.d. sample from , the MLE of is . The MLE of and are solutions to the following system:
(1) 
The above system of equations does not have a closed form solution as the two covariance estimates depend on each other. Hence, their estimates must be computed in an iterative fashion until convergence, which is known as the “flipflop” algorithm (Dutilleul, 1999; Glanz and Carvalho, 2013). Furthermore, Dutilleul (1999) showed that the flipflop algorithm is guaranteed to converge to positive definite covariance matrices iff . More properties about the MLE of the matrixvariate normal distribution can be found in (Roś et al., 2016).
2.3 Multitask Feature and Relationship Learning
In linear regression, the likelihood function for task is given by: . Let be the model parameter for different tasks drawn from the matrixvariate normal distribution
. By maximizing the joint distribution and optimize over both the model parameters, as well as the two covariance matrices in the prior, we reach the following optimization problem:
subject to  (2) 
where are the row and column precision matrices of the matrix normal prior distribution, respectively, and is a constant that does not depend on the optimization variables. It is not hard to see that the optimization problem in (2) is not convex due to the coupling between and in the trace term. On the other hand, since the function is concave in the positive definite cone (Boyd and Vandenberghe, 2004), and the trace is linear in terms of its components, it follows that (2) is multiconvex. Zhang and Schneider (2010) propose to use the flipflop algorithm to solve the matrix subproblem in (2), and this approach has been widely applied in following works on multitask learning (Zhang, 2011; Li et al., 2014; Long et al., 2017).
3 Illposed Optimization
In this section we first point out an important issue in the literature on the application of the flipflop algorithm to solve the matrix subproblem in (2). We then proceed to propose a welldefined variant of (2) to fix the problem. Interestingly, the variant we propose admits a closed form solution for each block variable that can be computed efficiently without any iterative procedure, which we will describe and derive in more detail in Sec. 4.
As proved by Dutilleul (1999), one sufficient and necessary condition for the flipflop algorithm to converge to positive definite matrices is that the number of samples from the matrixvariate normal distribution should satisfy . However, in the context of multitask learning, we are essentially dealing with an inference problem, where the goal is to estimate the value of , which is assumed to be an unknown but unique model parameter from the prior. This means that in this case we have , hence the condition for the convergence of the algorithm is violated. Technically, for any where , following the iterative update formula of the flipflop algorithm in (1), for any feasible initialization of and , we will have . Now since , we know that . As a result, after one iteration, we will have
i.e., at least one of and is going to be rank deficient, and in the next iteration the inverse operation is not welldefined on at least one of them. As a fix, Zhang and Schneider (2010) proposed to use an artificial fudge factor to ensure that both covariance matrices stay positive definite after each update:
where is a fixed, small constant. However, since the fudge factor is a fixed constant which does not decrease to 0 in the limit, it also introduce extra biases into the estimation, and it is not clear whether or not the fixed algorithm converges.
Perhaps what is more surprising is that (2) is not even welldefined as an optimization problem. As a counterexample, we can fix and let , with so that both and are feasible. Now let , and it is easy to verify that in this case the objective function goes to . Although we only provide one counterexample, there is no reason to believe that the one we find is the only case that fails (2). In fact, Roś et al. (2016) have recently shown that the MLE of does not exist if , and the only nontrivial sufficient condition known so far to guarantee the existence of the MLE is . However, in the context of multitask learning, the unknown model parameter is unique and hence we have .
Given the wide applications of the above multitask learning framework in the literature, as well as the flipflop algorithm in this setting, we feel it important and urgent to solve the above illposed and nonpositive definite problem. To this end, for some positive constants , we propose a variant of (2) as follows:
subject to  (3) 
The bounded constraints make the feasible set compact. Since the objective function is continuous, by the extreme value theorem, we know that the matrix subproblem of (3) becomes welldefined and can achieve finite lower and upper bounds within the feasible set. Alternatively, one can also understand this constraint as specifying a truncated matrix normal prior over the compact set. As we will see shortly, technically the bounded constraint also allows us to develop an optimization procedure for with linear convergence rate, which is an exponential acceleration over the unbounded case.
4 Multiconvex Optimization
In this section we propose a block coordinatewise minimization algorithm to optimize the objective given in (3). In each iteration, we alternatively minimize over with and fixed, then minimize over with and fixed, and lastly minimize with and fixed. The whole procedure is repeated until a stationary point is found. Due to space limit, we defer all the proofs and derivations to appendix. To simplify the notation, we assume . Let be the target matrix and be the feature matrix shared by all the tasks. Using this notation, the objective can be equivalently expressed in matrix form as:
subject to  (4) 
4.1 Optimization of
In order to minimize over when both and are fixed, we solve the following subproblem:
(5) 
As shown in the last section, this is an unconstrained convex optimization problem. We present three different algorithms to find the optimal solution of this subproblem. The first one guarantees to find an exact solution in closed form in time. The second one does gradient descent with fixed step size to iteratively refine the solution, and we show that in our case a linear convergence rate can be guaranteed. The third one finds the optimal solution by solving the Sylvester equation (Bartels and Stewart, 1972) characterized by the firstorder optimality condition, after a proper transformation.
A closed form solution. It is worth noting that it is not obvious how to obtain a closed form solution directly from the formulation in (5). An application of the first order optimality condition to (5) will lead to: . Hence except for the special case where with a constant, the above equation does not admit an easy closed form solution in its matrix representation. The workaround is based on the fact that the dimensional matrix space is isomorphic to the dimensional vector space, with the operator implementing the isomorphism from to . Using this property, we have:
Proposition 4.1.
(5) can be solved in closed form in time; the optimal solution is: .
The computational bottleneck in the above procedure is in solving an system of equations, which scales as if no further sparsity structure is available.
Gradient descent. The closed form solution shown above scales cubically in both and , and requires us to explicitly form a matrix of size . This can be intractable even for moderate and . In such cases, instead of computing an exact solution to (5), we can use gradient descent with fixed step size to obtain an approximate solution. The objective function in (5) is differentiable and its gradient can be obtained in time as . Note that we can compute in advance both and in time, and cache them so that we do not need to recompute them in each gradient update step. Let be the
th largest eigenvalue of a real symmetric matrix
. Adapted from Nesterov (2013), we provide a linear convergence guarantee for the gradient method in the following proposition:Proposition 4.2.
Let and . Choose . For all , gradient descent with step size converges to the optima within steps.
The computational complexity to achieve an approximate solution using gradient descent is . Compared with the complexity for the exact solution, the gradient descent algorithm scales much better provided the condition number is not too large. As a side note, when the condition number is large, we can effectively reduce it to by using conjugate gradient method (Shewchuk et al., 1994).
Sylvester equation. In the field of control theory, a Sylvester equation (Bhatia and Rosenthal, 1997) is a matrix equation of the form , where the goal is to find a solution matrix given and . For this problem, there are efficient numerical algorithms with highly optimized implementations that can obtain a solution within cubic time. For example, the BartelsStewart algorithm (Bartels and Stewart, 1972) solves the Sylvester equation by first transforming and into Schur forms by QR factorization, and then solves the resulting triangular system via backsubstitution. Our third approach is based on the observation that we can equivalently transform the firstorder optimality equation into a Sylvester equation by multiplying both sides of the equation by : . As a result, finding the optimal solution of the subproblem amounts to solving the above Sylvester equation. Specifically, the solution to the above equation can be obtained using the BartelsStewart algorithm in .
Remark. Both the gradient descent and the BartelsStewart algorithm find the optimal solution in cubic time. However, the gradient descent algorithm is more widely applicable than the BartelsStewart algorithm: the BartelsStewart algorithm only applies to the case where all the tasks share the same instances, so that we can write down the matrix equation explicitly, while gradient descent can be applied in the case where each task has different number of inputs and those inputs are not shared among tasks. On the other hand, as we will see shortly in the experiments, in practice the BartelsStewart algorithm is faster than gradient descent, and provides a more numerically stable solution.
4.2 Optimization of and
Before we delve into the detailed analysis below, we first list the final algorithms used to optimize and in Alg. 1 and Alg. 2, respectively. The hardthresholding function used in Line 2 of Alg. 1 and Alg. 2 is defined as follows:
(6) 
The hardthresholding function essentially keeps the value of its argument if , otherwise it truncates the value of to if respectively. Both algorithms are remarkably simple: each algorithm only involves one SVD, one truncation and two matrix multiplications. The computational complexities of Alg. 1 and Alg. 2 are bounded by and , respectively.
In this section we focus on analyzing the optimization w.r.t. . A symmetric analysis can be applied to solve as well. In order to minimize over when and are fixed, we solve the following subproblem:
(7) 
Although (7) is a convex optimization problem, it is computationally expensive to solve using offtheshelf algorithm, e.g., the interior point method, because of the constraints, as well as the nonlinearity of the objective function. However, as we will show shortly, we can find a closed form optimal solution to this problem, using tools from the theory of doubly stochastic matrices (Dufossé and Uçar, 2016) and perfect bipartite graph matching. Due to space limit, we defer the detailed derivation and proof to appendix, and only show a sketch below.
Without loss of generality, for any feasible , using spectral decomposition, we can reparametrize as
(8) 
where . Similarly, we can represent
(9) 
where . Let and . Set and define to be the Hadamard product of , i.e., . Since both and are orthonormal matrices, it immediately follows that is also an orthonormal matrix. As a result, we have the following two equations hold:
which implies that
is a doubly stochastic matrix. Given
being an orthonormal matrix, we have . On the other hand, it can be readily verified that the following equality holds:(10) 
By combining all the transformations in (8), (9) and (10) and plug them in (7), we have the following equivalent optimization problem:
(11) 
where denotes a vector of all ones with dimension . To solve (11), we make the following key observations:

[noitemsep,topsep=0pt]

The minimization is decomposable in terms of and . Furthermore, the optimization over
is a linear program (LP).

For any bounded LP, there exists at least one extreme point that achieves the optimal solution.

The set of doubly stochastic matrices, denoted as , forms a convex polytope, known as the Birkhoff polytope.

By the Birkhoffvon Neumann theorem, is the convex hull of the set of permutation matrices, i.e., every extreme point of is a permutation matrix.
Combine all the analysis above, it is clear to see that the optimal solution must be a permutation matrix. This motivates us to reduce (11) to a minimumweight perfect matching problem on a weighted complete bipartite graph as follows: for any , we can construct a weighted bipartite graph as follows:

[noitemsep,topsep=0pt]

For each , construct a vertex , .

For each , construct a vertex , .

For each pair , construct an edge with weight .
The following theorem relates the solution of the minimum weight matching to the partial solution of (11) w.r.t. :
Theorem 4.1.
Let and with and . The minimumweight perfect matching on is with the minimum weight . Furthermore, it equals .
Proof sketch. The full proof of Thm. 4.1 is deferred to the appendix, and here we only show a sketch of the highlevel idea. Basically, given a matching in the graph as shown in Fig. 1, if there is an inverse pair (a cross) in the matching, then we can improve the matching by removing the inverse pair. Since there are only at most finitely many number of inverse pairs, an inductive argument shows that the optimal matching is achieved when there is no inverse pair, i.e., is matched to .
The optimal matching in Thm. 4.1 suggests that the optimal doubly stochastic matrix is given by , which also implies and . Now plug in the into (11). The optimization w.r.t. decomposes into independent scalar optimization problems, which can be easily solved. Using the hardthresholding function defined in (6), we can express the optimal solution as . Combine all the analysis given above, we get the algorithms listed at the beginning of this section to optimize and . Interestingly, they have close connection to the proximal method proposed in the literature to solve matrix completion (Cai et al., 2010), or Euclidean projection under trace norm constraint (Chen et al., 2011, 2012). To the best of our knowledge, this is the first algorithm that solves linear function over matrices with negative logdeterminant regularization (e.g. (7)) efficiently.
5 Nonlinear Extension
So far we discuss our FETR framework under the linear regression model, but it can be readily extended to any nonlinear regression/classification settings. One straightforward way to do so is to apply the (orthogonal) random Fourier transformation
(Rahimi and Recht, 2008; Felix et al., 2016) to generate highdimensional random features so that linear FETR in the transformed space corresponds to nonlinear models in the original feature space. However, depending on the dimension of the random features, this approach might lead to a huge covariance matrix that is expensive to optimize.Another more natural and expressive approach is to combine our regularization scheme and optimization method with parametrized nonlinear feature transformations, such as neural networks. More specifically, let be a neural network with learnable parameter that defines a nonlinear transformation of the input features from to . Essentially we can replace the feature matrix in (4) with to create a regularized multitask neural network (Caruana, 1997) where we add one more layer defined by the matrix on top of the nonlinear mapping given by
. To train the model, we can use backpropagation to optimize
, and our proposed approach to optimize the two covariance matrices. We will further explore this nonlinear extension in Sec. 6 to demonstrate its power in statistical modeling.6 Experiments
6.1 Convergence Analysis and Computational Efficiency
We first investigate the efficiency and scalability of the three different algorithms for minimizing w.r.t. on synthetic data sets. For each experiment, we generate a synthetic data set which consists of instances that are shared among all the tasks. All the instances are randomly sampled uniformly from . We gradually increase the dimension of features, , and the number of tasks, to test scalability. The first algorithm implements the closed form solution by explicitly computing the matrix product and then solving the linear system. The second one is the proposed gradient descent, and the last one uses the BartelsStewart algorithm to solve the equivalent Sylvester equation to compute . We use open source toolkit scipy whose backend implementation uses highly optimized Fortran code. For all the synthetic experiments we set and , which corresponds to a condition number of . We fix the coefficients
. We repeat each experiment for 10 times to show both the mean and the variance. The experimental results are shown in Fig.
1(a). As expected, the closed form solution does not scale to problems of even moderate size due to its large memory requirement. In practice the BartelsStewart algorithm is about one order of magnitude faster than the gradient descent method when either or is large. It is also worth pointing out here that the BartelsStewart algorithm is the most numerically stable algorithm among the three based on our observations.SARCOS  School  
Method  1st  2nd  3rd  4th  5th  6th  7th  NMSE 
STL  31.40  22.90  9.13  10.30  0.14  0.84  0.46  0.9882 0.0196 
MTFL  31.41  22.91  9.13  10.33  0.14  0.83  0.45  0.8891 0.0380 
MTRL  31.09  22.69  9.08  9.74  0.14  0.83  0.44  0.9007 0.0407 
MTFRL  31.13  22.60  9.10  9.74  0.13  0.83  0.45  0.8451 0.0197 
FETR  31.08  22.68  9.08  9.73  0.13  0.83  0.43  0.8134 0.0253 
STN  24.81  17.20  8.97  8.36  0.13  0.72  0.34  
MTN  12.01  10.54  5.02  7.15  0.09  0.70  0.27  
MTNFETR  10.77  9.34  4.95  7.01  0.08  0.59  0.24 
We compare our proposed coordinate minimization algorithm with an offtheshelf projected gradient method and the flipflop algorithm to solve the optimization problem (4). Specifically, the projected gradient method updates and in each iteration and then projects and onto the corresponding feasible regions. The flipflop algorithm is implemented as suggested (Zhang and Schneider, 2010) and we use a fudge factor of to avoid the nonpositive definite problem. In each iteration, both covariance matrices are projected onto the feasible region as well. In the SARCOS dataset all the instances are shared among all the tasks, so that the Sylvester solver is used to optimize
in coordinate minimization. We repeat the experiments 10 times and report the mean and standard deviation of the
function values versus the time used by all three algorithms (Fig. 1(b)). It is clear from Fig. 1(b) that our proposed algorithm not only converges much faster than the other two competitors, but also achieves better results. In fact, as we observe in our experiments, the proposed algorithm usually converges in less than 10 iterations.6.2 Datasets
Robot Inverse Dynamics
This data relates to an inverse dynamics problem for a seven degreeoffreedom (DOF) SARCOS anthropomorphic robot arm
(Vijayakumar and Schaal, 2000). The goal of this task is to map from a 21dimensional input space (7 joint positions, 7 joint velocities, 7 joint accelerations) to the corresponding 7 joint torques. Hence there are 7 tasks and the inputs are shared among all the tasks. The training set and test set contain 44,484 and 4,449 examples, respectively.School Data This dataset consists of the examination scores of 15,362 students from 139 secondary schools (Goldstein, 1991). It has 27 input features, and contains 139 tasks. In this dataset, instances are not shared among different tasks, hence we use our gradient descent solver for instead of the Sylvester equation solver. We use 10fold crossvalidation to generate training and test datasets, and for each partition we compute the mean of normalized mean squared error (NMSE) over 139 tasks. The normalized mean squared error is defined as the ratio of the MSE and the variance on a task. We show the mean NMSE and its standard deviation across 10 crossvalidation folds. We compare FETR with multitask feature learning (Evgeniou and Pontil, 2007) (MTFL), multitask relationship learning (Zhang and Yeung, 2010a)
(MTRL), and the MTFRL framework on both the School and the SARCOS datasets. We use ridge regression as our baseline model, and denote it as single task learning (STL).
6.3 Results and Analysis
For the SARCOS dataset, we partition the training set into a development set and a validation set, containing 31,138 and 13,346 instances respectively. To show the power of nonlinear extension, we also include the single task neural network (STN) and the multitask neural network (MTN) into the experiment, based on which we propose our MTNFETR, which incorporates the regularization scheme into the last layer of MTN. STN is a model where we use a separate network for each task, while in MTN except the last output layer, all the other layers are shared among different tasks. In the experiment, STN, MTN and MTNFETR share exactly the same network structure: an input layer with 21 dimensions, followed by two hidden layers with 256 and 100 hidden units. The output of the network in MTN and MTNFETR is a multitask layer that contains 7 output units, while in STN, the output only contains a single unit. All the methods share the same experimental setting, including model selection. In all the experiments we fix and
. The hyperparameters range from
, and we use the validation set for model selection. Note that because the instances are not shared between different tasks for the School dataset, MTN and MTNFETR cannot be directly applied. For each method, the best model on the validation set is selected. The results are summarized in Table 1 (the smaller the better). Among all the methods, FETR consistently achieves lower test set MSEs. More importantly, we observe a significant improvement of both MTN and MTNFETR over all the linear baselines and STN. MTNFETR further improves over MTN on all the tasks. The experimental results confirm that multitask learning usually improves over single task learning when the dataset is small and tasks are related. Furthermore, among all the competitors, we observe that nonlinear models combined with our FETR framework give the overall best results, demonstrating the effectiveness of the proposed approach in both linear and nonlinear settings.One byproduct of the FETR method is that we also have access to the estimated row and column covariance matrices. In Fig. 3 we plot the feature and task covariance matrices respectively, where we can clearly observe a block diagonal structure: the first 4 tasks are negatively correlated with the rest 3, and the 5th and 6th task are positively correlated. Intuitively, these correlations are consistent with the SARCOS dataset where several joints move up and down jointly.
7 Conclusion
In this paper we point out a common misconception in the existing multitask feature and relationship learning framework, and propose a constrained variant to fix it. Our framework admits a multiconvex formulation, which allows us to design an efficient block coordinatewise algorithm to optimize. To solve the weight learning subproblem, we propose three different strategies that can be used no matter whether the instances are shared by multiple tasks or not. To learn the covariance matrices, we reduce the underlying matrix optimization subproblem to a minimum weight perfect matching problem, and solve it exactly in closed form. To the best of our knowledge, all the previous methods have to resort to expensive iterative procedures to solve this problem. At the end, we also discuss several possible extensions of the proposed framework into nonlinear settings. Experimental evidences show that our method is orders of magnitude faster than its competitors, and it demonstrates significantly improved statistical performance on two realworld datasets.
References
 Adel et al. (2017) T. Adel, H. Zhao, and A. Wong. Unsupervised domain adaptation with a relaxed covariate shift assumption. In AAAI, pages 1691–1697, 2017.
 Argyriou et al. (2007) A. Argyriou, M. Pontil, Y. Ying, and C. A. Micchelli. A spectral regularization framework for multitask structure learning. In Advances in neural information processing systems, 2007.
 Argyriou et al. (2008) A. Argyriou, T. Evgeniou, and M. Pontil. Convex multitask feature learning. Machine Learning, 73(3):243–272, 2008.
 Bartels and Stewart (1972) R. H. Bartels and G. Stewart. Solution of the matrix equation AX+ XB= C [F4]. Communications of the ACM, 15(9):820–826, 1972.
 Bertsimas and Tsitsiklis (1997) D. Bertsimas and J. N. Tsitsiklis. Introduction to linear optimization, volume 6. Athena Scientific Belmont, MA, 1997.
 Bhatia and Rosenthal (1997) R. Bhatia and P. Rosenthal. How and why to solve the operator equation ax xb= y. Bulletin of the London Mathematical Society, 29(01):1–21, 1997.

Bonilla et al. (2007)
E. V. Bonilla, F. V. Agakov, and C. Williams.
Kernel multitask learning using taskspecific features.
In
International Conference on Artificial Intelligence and Statistics
, pages 43–50, 2007.  Boyd and Vandenberghe (2004) S. Boyd and L. Vandenberghe. Convex optimization. 2004.

Cai et al. (2010)
J.F. Cai, E. J. Candès, and Z. Shen.
A singular value thresholding algorithm for matrix completion.
SIAM Journal on Optimization, 20(4):1956–1982, 2010.  Caruana (1997) R. Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
 Chen et al. (2011) J. Chen, J. Zhou, and J. Ye. Integrating lowrank and groupsparse structures for robust multitask learning. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 42–50. ACM, 2011.
 Chen et al. (2012) J. Chen, J. Liu, and J. Ye. Learning incoherent sparse and lowrank patterns from multiple tasks. ACM Transactions on Knowledge Discovery from Data (TKDD), 5(4):22, 2012.
 Dufossé and Uçar (2016) F. Dufossé and B. Uçar. Notes on birkhoff–von neumann decomposition of doubly stochastic matrices. Linear Algebra and its Applications, 497:108–115, 2016.
 Dutilleul (1999) P. Dutilleul. The MLE algorithm for the matrix normal distribution. Journal of statistical computation and simulation, 64(2):105–123, 1999.
 Evgeniou and Pontil (2007) A. Evgeniou and M. Pontil. Multitask feature learning. Advances in neural information processing systems, 19:41, 2007.
 Evgeniou and Pontil (2004) T. Evgeniou and M. Pontil. Regularized multi–task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 109–117. ACM, 2004.
 Felix et al. (2016) X. Y. Felix, A. T. Suresh, K. M. Choromanski, D. N. HoltmannRice, and S. Kumar. Orthogonal random features. In Advances in Neural Information Processing Systems, pages 1975–1983, 2016.
 Glanz and Carvalho (2013) H. Glanz and L. Carvalho. An expectationmaximization algorithm for the matrix normal distribution. arXiv preprint arXiv:1309.6609, 2013.
 Goldstein (1991) H. Goldstein. Multilevel modelling of survey data. Journal of the Royal Statistical Society. Series D (The Statistician), 40(2):235–244, 1991.
 Gupta and Nagar (1999) A. K. Gupta and D. K. Nagar. Matrix variate distributions, volume 104. CRC Press, 1999.
 Jacob et al. (2009) L. Jacob, J.p. Vert, and F. R. Bach. Clustered multitask learning: A convex formulation. In Advances in neural information processing systems, pages 745–752, 2009.
 Jawanpuria et al. (2015) P. Jawanpuria, M. Lapin, M. Hein, and B. Schiele. Efficient output kernel learning for multiple tasks. In Advances in Neural Information Processing Systems, pages 1189–1197, 2015.
 Kato et al. (2008) T. Kato, H. Kashima, M. Sugiyama, and K. Asai. Multitask learning via conic programming. In Advances in Neural Information Processing Systems, pages 737–744, 2008.
 Li et al. (2014) C. Li, J. Zhu, and J. Chen. Bayesian maxmargin multitask learning with data augmentation. In International Conference on Machine Learning, pages 415–423, 2014.
 Li et al. (2015) Y. Li, X. Tian, T. Liu, and D. Tao. Multitask model and feature joint learning. In Proceedings of the 24th International Conference on Artificial Intelligence, pages 3643–3649. AAAI Press, 2015.
 Liu et al. (2009) J. Liu, S. Ji, and J. Ye. Multitask feature learning via efficient l 2, 1norm minimization. In Proceedings of the twentyfifth conference on uncertainty in artificial intelligence, pages 339–348. AUAI Press, 2009.
 Long et al. (2017) M. Long, Z. Cao, J. Wang, and S. Y. Philip. Learning multiple tasks with multilinear relationship networks. In Advances in Neural Information Processing Systems, pages 1593–1602, 2017.
 Nesterov (2013) Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
 Rahimi and Recht (2008) A. Rahimi and B. Recht. Random features for largescale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.

Roś et al. (2016)
B. Roś, F. Bijma, J. C. de Munck, and M. C. de Gunst.
Existence and uniqueness of the maximum likelihood estimator for
models with a kronecker product covariance structure.
Journal of Multivariate Analysis
, 143:345–361, 2016.  Shewchuk et al. (1994) J. R. Shewchuk et al. An introduction to the conjugate gradient method without the agonizing pain, 1994.
 Thrun (1996) S. Thrun. Explanationbased neural network learning. In ExplanationBased Neural Network Learning, pages 19–48. Springer, 1996.
 Vijayakumar and Schaal (2000) S. Vijayakumar and S. Schaal. Locally weighted projection regression: Incremental real time learning in high dimensional space. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 1079–1086. Morgan Kaufmann Publishers Inc., 2000.
 Zhang (2011) Y. Zhang. Supervision reduction by encoding extra information about models, features and labels. 2011.
 Zhang and Schneider (2010) Y. Zhang and J. G. Schneider. Learning multiple tasks with a sparse matrixnormal penalty. In Advances in Neural Information Processing Systems, pages 2550–2558, 2010.
 Zhang and Yang (2017) Y. Zhang and Q. Yang. A survey on multitask learning. arXiv preprint arXiv:1707.08114, 2017.
 Zhang and Yeung (2010a) Y. Zhang and D.Y. Yeung. A convex formulation for learning task relationships in multitask learning. 2010a.
 Zhang and Yeung (2010b) Y. Zhang and D.Y. Yeung. Multitask learning using generalized t process. In AISTATS, pages 964–971, 2010b.
Appendix A Proofs
a.1 Proof of Proposition 4.1
Fact A.1.
Let be a matrix. Then .
Fact A.2.
Let , and . Then .
Fact A.3.
Let and . Then .
Fact A.4.
Let and . Let be the spectrum of and be the spectrum of . Then the spectrum of is .
We can show the following result by transforming into its isomorphic counterpart:
Proof.
The last equation above is a quadratic function of , from which we can read off that the optimal solution should satisfy:
(12) 
can then be obtained simply by reformatting into a matrix. The computational bottleneck in the above procedure is in solving an system of equations, which scales as if no further structure is available. The overall computational complexity is . ∎
a.2 Proof of Proposition 4.2
To analyze the convergence rate of gradient descent in this case, we start by bounding the smallest and largest eigenvalue of the quadratic system.
Lemma A.1 (Weyl’s inequality).
Let and be by Hermitian matrices, and . Let , and be the eigenvalues of and respectively. Then the following inequalities hold for , :
Let be the th largest eigenvalue of matrix .
Lemma A.2.
Proof.
We will first introduce the following two lemmas adapted from [Nesterov, 2013], using the fact that the spectral norm of the Hessian matrix is bounded.
Lemma A.3.
Let be a twice differentiable function with . is a constant. The minimum value of can be achieved. Let , then
Lemma A.4.
Let be a convex, twice differentiable function with . is a constant, then :
Proof.
Define function as follows:
Since we have already bounded that , it follows that is a convex function and furthermore . Applying Lemma A.4 to , , we have:
Plug in into the above inequality and after some algebraic manipulations, we have:
(13) 
Let . Within each iteration of the algorithm, we have the update formula as , we can bound as follows
(By inequality 13)  
Apply the above inequality recursively for times, we have
where . For , we have