Efficient Multi-task Feature and Relationship Learning

02/14/2017 ∙ by Han Zhao, et al. ∙ 0

In this paper we propose a multi-convex framework for multi-task learning that improves predictions by learning relationships both between tasks and between features. Our framework is a generalization of related methods in multi-task learning, that either learn task relationships, or feature relationships, but not both. We start with a hierarchical Bayesian model, and use the empirical Bayes method to transform the underlying inference problem into a multi-convex optimization problem. We propose a coordinate-wise minimization algorithm that has a closed form solution for each block subproblem. Naively these solutions would be expensive to compute, but by using the theory of doubly stochastic matrices, we are able to reduce the underlying matrix optimization subproblem into a minimum weight perfect matching problem on a complete bipartite graph, and solve it analytically and efficiently. To solve the weight learning subproblem, we propose three different strategies, including a gradient descent method with linear convergence guarantee when the instances are not shared by multiple tasks, and a numerical solution based on Sylvester equation when instances are shared. We demonstrate the efficiency of our method on both synthetic datasets and real-world datasets. Experiments show that the proposed optimization method is orders of magnitude faster than an off-the-shelf projected gradient method, and our model is able to exploit the correlation structures among multiple tasks and features.



page 12

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In machine learning the goal is often to train predictive models for one or more tasks of interest. Making accurate predictions relies heavily on the existence of labeled data for the desired tasks. However, in real-world problems data is often hard to acquire (e.g., medical domains) or expensive to label (e.g., image segmentation). For many tasks, this makes it impractical or impossible to collect large volumes of labeled data. Multitask learning is a learning paradigm that aims to improve generalization performance in a learning task, by learning models for multiple related tasks simultaneously. It has received considerable interest in the past decades (Caruana, 1997; Evgeniou and Pontil, 2004; Argyriou et al., 2007, 2008; Kato et al., 2008; Liu et al., 2009; Jacob et al., 2009; Zhang and Yeung, 2010a; Zhang and Schneider, 2010; Chen et al., 2011, 2012; Li et al., 2015; Jawanpuria et al., 2015; Adel et al., 2017)

. One of the underlying assumptions behind many multitask learning algorithms is that the tasks are related to each other. Hence, a key question is how to define the notion of task relatedness, and how to capture it in the learning formulation. A common assumption is that tasks can be described by weight vectors, and that they are sampled from a shared prior distribution over their space 

(Liu et al., 2009; Zhang and Yeung, 2010a, b). Another strand of work assumes common feature representations to be shared among multiple tasks, and the goal is to learn the shared representation as well as task-specific parameters simultaneously (Thrun, 1996; Caruana, 1997; Evgeniou and Pontil, 2007; Argyriou et al., 2008). Moreover, when structure about multiple tasks is available, e.g., task-specific descriptors (Bonilla et al., 2007) or a task similarity graph (Evgeniou and Pontil, 2004), regularizers can often be incorporated into the learning formulation to penalize hypotheses that are not consistent with the given structure.

There have been several attempts to improve predictions by either learning the relationships between different tasks (Zhang and Yeung, 2010a), or by exploiting the relationships between different features (Argyriou et al., 2008). In this paper we consider a multiconvex framework for multitask learning that improves predictions over tabula rasa learning by assuming that all the task vectors are sampled from a common matrix-variate normal prior. The framework, known as MTFRL (Zhang and Schneider, 2010)

, learns the relationships both between tasks and between features simultaneously via two covariance matrices, i.e., the feature covariance matrix and the task covariance matrix. In this context, learning multiple tasks corresponds to estimating a matrix of model parameters, and learning feature/task relationships corresponds to estimating the row/column covariance matrices of model parameters, respectively. This property is favorable for applications where we not only aim for better generalization, but also seek to have a clear understanding about the relationships among different tasks.

The goal of MTFRL is to optimize over both the task vectors, as well as two covariance matrices in the prior. When the loss function is convex, the regularized problem of MTFRL is multiconvex. Previous approaches 

(Zhang and Schneider, 2010; Zhang, 2011; Long et al., 2017) for solving this problem hinge on the classic flip-flop algorithm (Dutilleul, 1999) to estimate the two covariance matrices. However, as we point out in Sec. 3

, the flip-flop algorithm cannot be directly applied as the maximum likelihood estimation (MLE) formulation of the multitask learning problem under this setting is ill-posed. As a result, in practice, heuristics have to be invented and applied in the algorithm to ensure the positive-definiteness of both covariance matrices. However, it is not clear whether such a fixed algorithm still converges or not.

Our Contributions

In this paper we propose a variant of the MTFRL framework that is well-defined, as well as design a block coordinate-wise minimization algorithm to solve this problem. We term our new formulation FEaTure and Relation learning (FETR). By design, FETR is free of the nonpositive-definite problem in MTFRL. To solve FETR, we propose efficient and analytic solutions for each of the subproblem, which allows us to get rid of the expensive iterative procedure to optimize the covariance matrices. Specifically, we achieve this by reducing an underlying matrix optimization problem with positive definite constraints into a minimum weight perfect matching problem on a complete bipartite graph, where we are able to solve analytically using combinatorial techniques. To solve the weight learning subproblem, we propose three different strategies, including a closed form solution, a gradient descent method with linear convergence guarantee when the instances are not shared by multiple tasks, and a numerical solution based on Sylvester equation when instances are shared. We demonstrate the efficiency of the proposed optimization algorithm by comparing it with an off-the-shelf projected gradient descent algorithm and the classic flip-flop algorithm, on both synthetic and real-world data. Experiments show that the proposed optimization method is orders of magnitude faster than its competitors, and it often converges to better solutions. Lastly, we extend FETR to nonlinear setting by combining its regularization scheme with rich nonlinear transformations using neural networks. This combined approach is able to achieve significantly better generalizations than existing methods on real-world datasets.

2 Preliminary

We start by introducing notations used throughout the paper and briefly discussing the MTFRL framework (Zhang and Schneider, 2010). Readers are referred to Zhang and Yang (2017) for a recent survey on various multitask learning algorithms.

2.1 Notation and Setup

We use lowercase letters, such as , to represent scalars, and lowercase bold letters, such as , to denote vectors. Capital letters are reserved for matrices. We use and to denote the -dimensional symmetric positive semidefinite cone and the -dimensional symmetric positive definite cone, respectively. We write for the trace of a matrix , and

for the multivariate normal distribution with mean

and covariance matrix . Finally, is a weighted bipartite graph with vertex sets , , edge set and weight function . denotes the set . For a matrix , we use to denote its vectorization.

We consider the following setup. Suppose we are given learning tasks , where for each learning task we have access to a training set with data instances . For the simplicity of discussion, here we focus on the regression setting where and . Extension to classification setting is straightforward. Let be our model with parameter . In what follows, we will assume our model for each task

to be a linear regression, i.e.,


2.2 Matrix-Variate Normal Distribution

A matrix-variate normal distribution (Gupta and Nagar, 1999) with mean , row covariance matrix and column covariance matrix can be understood as a multivariate normal distribution with . 111Probability density: . One advantage of the matrix-variate normal distribution over its equivalent multivariate counterpart is that by imposing structure on the row and column covariance matrices, the former admits a much more compact representation than the latter ( versus ). The MLE of the matrix-variate normal distribution has been well studied in the literature (Dutilleul, 1999). Specifically, given an i.i.d. sample from , the MLE of is . The MLE of and are solutions to the following system:


The above system of equations does not have a closed form solution as the two covariance estimates depend on each other. Hence, their estimates must be computed in an iterative fashion until convergence, which is known as the “flip-flop” algorithm (Dutilleul, 1999; Glanz and Carvalho, 2013). Furthermore, Dutilleul (1999) showed that the flip-flop algorithm is guaranteed to converge to positive definite covariance matrices iff . More properties about the MLE of the matrix-variate normal distribution can be found in (Roś et al., 2016).

2.3 Multitask Feature and Relationship Learning

In linear regression, the likelihood function for task is given by: . Let be the model parameter for different tasks drawn from the matrix-variate normal distribution

. By maximizing the joint distribution and optimize over both the model parameters, as well as the two covariance matrices in the prior, we reach the following optimization problem:

subject to (2)

where are the row and column precision matrices of the matrix normal prior distribution, respectively, and is a constant that does not depend on the optimization variables. It is not hard to see that the optimization problem in (2) is not convex due to the coupling between and in the trace term. On the other hand, since the function is concave in the positive definite cone (Boyd and Vandenberghe, 2004), and the trace is linear in terms of its components, it follows that (2) is multiconvex. Zhang and Schneider (2010) propose to use the flip-flop algorithm to solve the matrix subproblem in (2), and this approach has been widely applied in following works on multitask learning (Zhang, 2011; Li et al., 2014; Long et al., 2017).

3 Ill-posed Optimization

In this section we first point out an important issue in the literature on the application of the flip-flop algorithm to solve the matrix subproblem in (2). We then proceed to propose a well-defined variant of (2) to fix the problem. Interestingly, the variant we propose admits a closed form solution for each block variable that can be computed efficiently without any iterative procedure, which we will describe and derive in more detail in Sec. 4.

As proved by Dutilleul (1999), one sufficient and necessary condition for the flip-flop algorithm to converge to positive definite matrices is that the number of samples from the matrix-variate normal distribution should satisfy . However, in the context of multitask learning, we are essentially dealing with an inference problem, where the goal is to estimate the value of , which is assumed to be an unknown but unique model parameter from the prior. This means that in this case we have , hence the condition for the convergence of the algorithm is violated. Technically, for any where , following the iterative update formula of the flip-flop algorithm in (1), for any feasible initialization of and , we will have . Now since , we know that . As a result, after one iteration, we will have

i.e., at least one of and is going to be rank deficient, and in the next iteration the inverse operation is not well-defined on at least one of them. As a fix, Zhang and Schneider (2010) proposed to use an artificial fudge factor to ensure that both covariance matrices stay positive definite after each update:

where is a fixed, small constant. However, since the fudge factor is a fixed constant which does not decrease to 0 in the limit, it also introduce extra biases into the estimation, and it is not clear whether or not the fixed algorithm converges.

Perhaps what is more surprising is that (2) is not even well-defined as an optimization problem. As a counterexample, we can fix and let , with so that both and are feasible. Now let , and it is easy to verify that in this case the objective function goes to . Although we only provide one counterexample, there is no reason to believe that the one we find is the only case that fails (2). In fact, Roś et al. (2016) have recently shown that the MLE of does not exist if , and the only nontrivial sufficient condition known so far to guarantee the existence of the MLE is . However, in the context of multitask learning, the unknown model parameter is unique and hence we have .

Given the wide applications of the above multitask learning framework in the literature, as well as the flip-flop algorithm in this setting, we feel it important and urgent to solve the above ill-posed and nonpositive definite problem. To this end, for some positive constants , we propose a variant of (2) as follows:

subject to (3)

The bounded constraints make the feasible set compact. Since the objective function is continuous, by the extreme value theorem, we know that the matrix subproblem of (3) becomes well-defined and can achieve finite lower and upper bounds within the feasible set. Alternatively, one can also understand this constraint as specifying a truncated matrix normal prior over the compact set. As we will see shortly, technically the bounded constraint also allows us to develop an optimization procedure for with linear convergence rate, which is an exponential acceleration over the unbounded case.

4 Multiconvex Optimization

In this section we propose a block coordinate-wise minimization algorithm to optimize the objective given in (3). In each iteration, we alternatively minimize over with and fixed, then minimize over with and fixed, and lastly minimize with and fixed. The whole procedure is repeated until a stationary point is found. Due to space limit, we defer all the proofs and derivations to appendix. To simplify the notation, we assume . Let be the target matrix and be the feature matrix shared by all the tasks. Using this notation, the objective can be equivalently expressed in matrix form as:

subject to (4)

4.1 Optimization of

In order to minimize over when both and are fixed, we solve the following subproblem:


As shown in the last section, this is an unconstrained convex optimization problem. We present three different algorithms to find the optimal solution of this subproblem. The first one guarantees to find an exact solution in closed form in time. The second one does gradient descent with fixed step size to iteratively refine the solution, and we show that in our case a linear convergence rate can be guaranteed. The third one finds the optimal solution by solving the Sylvester equation (Bartels and Stewart, 1972) characterized by the first-order optimality condition, after a proper transformation.

A closed form solution. It is worth noting that it is not obvious how to obtain a closed form solution directly from the formulation in (5). An application of the first order optimality condition to (5) will lead to: . Hence except for the special case where with a constant, the above equation does not admit an easy closed form solution in its matrix representation. The workaround is based on the fact that the dimensional matrix space is isomorphic to the dimensional vector space, with the operator implementing the isomorphism from to . Using this property, we have:

Proposition 4.1.

(5) can be solved in closed form in time; the optimal solution is: .

The computational bottleneck in the above procedure is in solving an system of equations, which scales as if no further sparsity structure is available.

Gradient descent. The closed form solution shown above scales cubically in both and , and requires us to explicitly form a matrix of size . This can be intractable even for moderate and . In such cases, instead of computing an exact solution to (5), we can use gradient descent with fixed step size to obtain an approximate solution. The objective function in (5) is differentiable and its gradient can be obtained in time as . Note that we can compute in advance both and in time, and cache them so that we do not need to recompute them in each gradient update step. Let be the

th largest eigenvalue of a real symmetric matrix

. Adapted from Nesterov (2013), we provide a linear convergence guarantee for the gradient method in the following proposition:

Proposition 4.2.

Let and . Choose . For all , gradient descent with step size converges to the optima within steps.

The computational complexity to achieve an approximate solution using gradient descent is . Compared with the complexity for the exact solution, the gradient descent algorithm scales much better provided the condition number is not too large. As a side note, when the condition number is large, we can effectively reduce it to by using conjugate gradient method (Shewchuk et al., 1994).

Sylvester equation. In the field of control theory, a Sylvester equation (Bhatia and Rosenthal, 1997) is a matrix equation of the form , where the goal is to find a solution matrix given and . For this problem, there are efficient numerical algorithms with highly optimized implementations that can obtain a solution within cubic time. For example, the Bartels-Stewart algorithm (Bartels and Stewart, 1972) solves the Sylvester equation by first transforming and into Schur forms by QR factorization, and then solves the resulting triangular system via back-substitution. Our third approach is based on the observation that we can equivalently transform the first-order optimality equation into a Sylvester equation by multiplying both sides of the equation by : . As a result, finding the optimal solution of the subproblem amounts to solving the above Sylvester equation. Specifically, the solution to the above equation can be obtained using the Bartels-Stewart algorithm in .

Remark. Both the gradient descent and the Bartels-Stewart algorithm find the optimal solution in cubic time. However, the gradient descent algorithm is more widely applicable than the Bartels-Stewart algorithm: the Bartels-Stewart algorithm only applies to the case where all the tasks share the same instances, so that we can write down the matrix equation explicitly, while gradient descent can be applied in the case where each task has different number of inputs and those inputs are not shared among tasks. On the other hand, as we will see shortly in the experiments, in practice the Bartels-Stewart algorithm is faster than gradient descent, and provides a more numerically stable solution.

4.2 Optimization of and

0:  , and .
1:   SVD.
2:  .
3:  .
Algorithm 1 Minimize
0:  , and .
1:   SVD.
2:  .
3:  .
Algorithm 2 Minimize

Before we delve into the detailed analysis below, we first list the final algorithms used to optimize and in Alg. 1 and Alg. 2, respectively. The hard-thresholding function used in Line 2 of Alg. 1 and Alg. 2 is defined as follows:


The hard-thresholding function essentially keeps the value of its argument if , otherwise it truncates the value of to if respectively. Both algorithms are remarkably simple: each algorithm only involves one SVD, one truncation and two matrix multiplications. The computational complexities of Alg. 1 and Alg. 2 are bounded by and , respectively.

In this section we focus on analyzing the optimization w.r.t. . A symmetric analysis can be applied to solve as well. In order to minimize over when and are fixed, we solve the following subproblem:


Although (7) is a convex optimization problem, it is computationally expensive to solve using off-the-shelf algorithm, e.g., the interior point method, because of the constraints, as well as the non-linearity of the objective function. However, as we will show shortly, we can find a closed form optimal solution to this problem, using tools from the theory of doubly stochastic matrices (Dufossé and Uçar, 2016) and perfect bipartite graph matching. Due to space limit, we defer the detailed derivation and proof to appendix, and only show a sketch below.

Without loss of generality, for any feasible , using spectral decomposition, we can reparametrize as


where . Similarly, we can represent


where . Let and . Set and define to be the Hadamard product of , i.e., . Since both and are orthonormal matrices, it immediately follows that is also an orthonormal matrix. As a result, we have the following two equations hold:

which implies that

is a doubly stochastic matrix. Given

being an orthonormal matrix, we have . On the other hand, it can be readily verified that the following equality holds:


By combining all the transformations in (8), (9) and (10) and plug them in (7), we have the following equivalent optimization problem:


where denotes a vector of all ones with dimension . To solve (11), we make the following key observations:

  1. [noitemsep,topsep=0pt]

  2. The minimization is decomposable in terms of and . Furthermore, the optimization over

    is a linear program (LP).

  3. For any bounded LP, there exists at least one extreme point that achieves the optimal solution.

  4. The set of doubly stochastic matrices, denoted as , forms a convex polytope, known as the Birkhoff polytope.

  5. By the Birkhoff-von Neumann theorem, is the convex hull of the set of permutation matrices, i.e., every extreme point of is a permutation matrix.

Combine all the analysis above, it is clear to see that the optimal solution must be a permutation matrix. This motivates us to reduce (11) to a minimum-weight perfect matching problem on a weighted complete bipartite graph as follows: for any , we can construct a weighted bipartite graph as follows:

  • [noitemsep,topsep=0pt]

  • For each , construct a vertex , .

  • For each , construct a vertex , .

  • For each pair , construct an edge with weight .

The following theorem relates the solution of the minimum weight matching to the partial solution of (11) w.r.t. :

Theorem 4.1.

Let and with and . The minimum-weight perfect matching on is with the minimum weight . Furthermore, it equals .

Proof sketch. The full proof of Thm. 4.1 is deferred to the appendix, and here we only show a sketch of the high-level idea. Basically, given a matching in the graph as shown in Fig. 1, if there is an inverse pair (a cross) in the matching, then we can improve the matching by removing the inverse pair. Since there are only at most finitely many number of inverse pairs, an inductive argument shows that the optimal matching is achieved when there is no inverse pair, i.e., is matched to .

Figure 1: The inductive proof works by recursively removing inverse pairs from the right side of the graph to the left side of the graph. The process stops until there is no inverse pair in the matching. Red color is used to highlight edges in the perfect matching.

The optimal matching in Thm. 4.1 suggests that the optimal doubly stochastic matrix is given by , which also implies and . Now plug in the into (11). The optimization w.r.t.  decomposes into independent scalar optimization problems, which can be easily solved. Using the hard-thresholding function defined in (6), we can express the optimal solution as . Combine all the analysis given above, we get the algorithms listed at the beginning of this section to optimize and . Interestingly, they have close connection to the proximal method proposed in the literature to solve matrix completion (Cai et al., 2010), or Euclidean projection under trace norm constraint (Chen et al., 2011, 2012). To the best of our knowledge, this is the first algorithm that solves linear function over matrices with negative log-determinant regularization (e.g. (7)) efficiently.

5 Nonlinear Extension

So far we discuss our FETR framework under the linear regression model, but it can be readily extended to any nonlinear regression/classification settings. One straightforward way to do so is to apply the (orthogonal) random Fourier transformation 

(Rahimi and Recht, 2008; Felix et al., 2016) to generate high-dimensional random features so that linear FETR in the transformed space corresponds to nonlinear models in the original feature space. However, depending on the dimension of the random features, this approach might lead to a huge covariance matrix that is expensive to optimize.

Another more natural and expressive approach is to combine our regularization scheme and optimization method with parametrized nonlinear feature transformations, such as neural networks. More specifically, let be a neural network with learnable parameter that defines a nonlinear transformation of the input features from to . Essentially we can replace the feature matrix in (4) with to create a regularized multitask neural network (Caruana, 1997) where we add one more layer defined by the matrix on top of the nonlinear mapping given by

. To train the model, we can use backpropagation to optimize

, and our proposed approach to optimize the two covariance matrices. We will further explore this nonlinear extension in Sec. 6 to demonstrate its power in statistical modeling.

(a) The mean run time (seconds) under each experimental configuration. The closed form solution does not scale when .
(b) The convergence speed of coordinate minimization versus projected gradient descent and the flip-flop algorithm on the SARCOS dataset. All the experiments are repeated 10 times.
Figure 2: Experimental results of the convergence analysis on synthetic data.

6 Experiments

6.1 Convergence Analysis and Computational Efficiency

We first investigate the efficiency and scalability of the three different algorithms for minimizing w.r.t.  on synthetic data sets. For each experiment, we generate a synthetic data set which consists of instances that are shared among all the tasks. All the instances are randomly sampled uniformly from . We gradually increase the dimension of features, , and the number of tasks, to test scalability. The first algorithm implements the closed form solution by explicitly computing the matrix product and then solving the linear system. The second one is the proposed gradient descent, and the last one uses the Bartels-Stewart algorithm to solve the equivalent Sylvester equation to compute . We use open source toolkit scipy whose backend implementation uses highly optimized Fortran code. For all the synthetic experiments we set and , which corresponds to a condition number of . We fix the coefficients

. We repeat each experiment for 10 times to show both the mean and the variance. The experimental results are shown in Fig. 

1(a). As expected, the closed form solution does not scale to problems of even moderate size due to its large memory requirement. In practice the Bartels-Stewart algorithm is about one order of magnitude faster than the gradient descent method when either or is large. It is also worth pointing out here that the Bartels-Stewart algorithm is the most numerically stable algorithm among the three based on our observations.

Method 1st 2nd 3rd 4th 5th 6th 7th NMSE
STL 31.40 22.90 9.13 10.30 0.14 0.84 0.46 0.9882 0.0196
MTFL 31.41 22.91 9.13 10.33 0.14 0.83 0.45 0.8891 0.0380
MTRL 31.09 22.69 9.08 9.74 0.14 0.83 0.44 0.9007 0.0407
MTFRL 31.13 22.60 9.10 9.74 0.13 0.83 0.45 0.8451 0.0197
FETR 31.08 22.68 9.08 9.73 0.13 0.83 0.43 0.8134 0.0253
STN 24.81 17.20 8.97 8.36 0.13 0.72 0.34
MTN 12.01 10.54 5.02 7.15 0.09 0.70 0.27
MTN-FETR 10.77 9.34 4.95 7.01 0.08 0.59 0.24
Table 1: Mean squared error on the SARCOS data and the mean of normalized mean squared error (NMSE) on the school dataset across 10-fold cross-validation.

We compare our proposed coordinate minimization algorithm with an off-the-shelf projected gradient method and the flip-flop algorithm to solve the optimization problem (4). Specifically, the projected gradient method updates and in each iteration and then projects and onto the corresponding feasible regions. The flip-flop algorithm is implemented as suggested  (Zhang and Schneider, 2010) and we use a fudge factor of to avoid the nonpositive definite problem. In each iteration, both covariance matrices are projected onto the feasible region as well. In the SARCOS dataset all the instances are shared among all the tasks, so that the Sylvester solver is used to optimize

in coordinate minimization. We repeat the experiments 10 times and report the mean and standard deviation of the

function values versus the time used by all three algorithms (Fig. 1(b)). It is clear from Fig. 1(b) that our proposed algorithm not only converges much faster than the other two competitors, but also achieves better results. In fact, as we observe in our experiments, the proposed algorithm usually converges in less than 10 iterations.

6.2 Datasets

Robot Inverse Dynamics

This data relates to an inverse dynamics problem for a seven degree-of-freedom (DOF) SARCOS anthropomorphic robot arm 

(Vijayakumar and Schaal, 2000). The goal of this task is to map from a 21-dimensional input space (7 joint positions, 7 joint velocities, 7 joint accelerations) to the corresponding 7 joint torques. Hence there are 7 tasks and the inputs are shared among all the tasks. The training set and test set contain 44,484 and 4,449 examples, respectively.

School Data This dataset consists of the examination scores of 15,362 students from 139 secondary schools (Goldstein, 1991). It has 27 input features, and contains 139 tasks. In this dataset, instances are not shared among different tasks, hence we use our gradient descent solver for instead of the Sylvester equation solver. We use 10-fold cross-validation to generate training and test datasets, and for each partition we compute the mean of normalized mean squared error (NMSE) over 139 tasks. The normalized mean squared error is defined as the ratio of the MSE and the variance on a task. We show the mean NMSE and its standard deviation across 10 cross-validation folds. We compare FETR with multitask feature learning (Evgeniou and Pontil, 2007) (MTFL), multitask relationship learning (Zhang and Yeung, 2010a)

(MTRL), and the MTFRL framework on both the School and the SARCOS datasets. We use ridge regression as our baseline model, and denote it as single task learning (STL).

6.3 Results and Analysis

For the SARCOS dataset, we partition the training set into a development set and a validation set, containing 31,138 and 13,346 instances respectively. To show the power of nonlinear extension, we also include the single task neural network (STN) and the multitask neural network (MTN) into the experiment, based on which we propose our MTN-FETR, which incorporates the regularization scheme into the last layer of MTN. STN is a model where we use a separate network for each task, while in MTN except the last output layer, all the other layers are shared among different tasks. In the experiment, STN, MTN and MTN-FETR share exactly the same network structure: an input layer with 21 dimensions, followed by two hidden layers with 256 and 100 hidden units. The output of the network in MTN and MTN-FETR is a multitask layer that contains 7 output units, while in STN, the output only contains a single unit. All the methods share the same experimental setting, including model selection. In all the experiments we fix and

. The hyperparameters range from

, and we use the validation set for model selection. Note that because the instances are not shared between different tasks for the School dataset, MTN and MTN-FETR cannot be directly applied. For each method, the best model on the validation set is selected. The results are summarized in Table 1 (the smaller the better). Among all the methods, FETR consistently achieves lower test set MSEs. More importantly, we observe a significant improvement of both MTN and MTN-FETR over all the linear baselines and STN. MTN-FETR further improves over MTN on all the tasks. The experimental results confirm that multitask learning usually improves over single task learning when the dataset is small and tasks are related. Furthermore, among all the competitors, we observe that nonlinear models combined with our FETR framework give the overall best results, demonstrating the effectiveness of the proposed approach in both linear and nonlinear settings.

(a) Feature covariance matrix.
(b) Task covariance matrix.
Figure 3: Estimated feature and task covariance matrices on the SARCOS dataset.

One by-product of the FETR method is that we also have access to the estimated row and column covariance matrices. In Fig. 3 we plot the feature and task covariance matrices respectively, where we can clearly observe a block diagonal structure: the first 4 tasks are negatively correlated with the rest 3, and the 5th and 6th task are positively correlated. Intuitively, these correlations are consistent with the SARCOS dataset where several joints move up and down jointly.

7 Conclusion

In this paper we point out a common misconception in the existing multitask feature and relationship learning framework, and propose a constrained variant to fix it. Our framework admits a multiconvex formulation, which allows us to design an efficient block coordinate-wise algorithm to optimize. To solve the weight learning subproblem, we propose three different strategies that can be used no matter whether the instances are shared by multiple tasks or not. To learn the covariance matrices, we reduce the underlying matrix optimization subproblem to a minimum weight perfect matching problem, and solve it exactly in closed form. To the best of our knowledge, all the previous methods have to resort to expensive iterative procedures to solve this problem. At the end, we also discuss several possible extensions of the proposed framework into nonlinear settings. Experimental evidences show that our method is orders of magnitude faster than its competitors, and it demonstrates significantly improved statistical performance on two real-world datasets.


  • Adel et al. (2017) T. Adel, H. Zhao, and A. Wong. Unsupervised domain adaptation with a relaxed covariate shift assumption. In AAAI, pages 1691–1697, 2017.
  • Argyriou et al. (2007) A. Argyriou, M. Pontil, Y. Ying, and C. A. Micchelli. A spectral regularization framework for multi-task structure learning. In Advances in neural information processing systems, 2007.
  • Argyriou et al. (2008) A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008.
  • Bartels and Stewart (1972) R. H. Bartels and G. Stewart. Solution of the matrix equation AX+ XB= C [F4]. Communications of the ACM, 15(9):820–826, 1972.
  • Bertsimas and Tsitsiklis (1997) D. Bertsimas and J. N. Tsitsiklis. Introduction to linear optimization, volume 6. Athena Scientific Belmont, MA, 1997.
  • Bhatia and Rosenthal (1997) R. Bhatia and P. Rosenthal. How and why to solve the operator equation ax- xb= y. Bulletin of the London Mathematical Society, 29(01):1–21, 1997.
  • Bonilla et al. (2007) E. V. Bonilla, F. V. Agakov, and C. Williams. Kernel multi-task learning using task-specific features. In

    International Conference on Artificial Intelligence and Statistics

    , pages 43–50, 2007.
  • Boyd and Vandenberghe (2004) S. Boyd and L. Vandenberghe. Convex optimization. 2004.
  • Cai et al. (2010) J.-F. Cai, E. J. Candès, and Z. Shen.

    A singular value thresholding algorithm for matrix completion.

    SIAM Journal on Optimization, 20(4):1956–1982, 2010.
  • Caruana (1997) R. Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
  • Chen et al. (2011) J. Chen, J. Zhou, and J. Ye. Integrating low-rank and group-sparse structures for robust multi-task learning. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 42–50. ACM, 2011.
  • Chen et al. (2012) J. Chen, J. Liu, and J. Ye. Learning incoherent sparse and low-rank patterns from multiple tasks. ACM Transactions on Knowledge Discovery from Data (TKDD), 5(4):22, 2012.
  • Dufossé and Uçar (2016) F. Dufossé and B. Uçar. Notes on birkhoff–von neumann decomposition of doubly stochastic matrices. Linear Algebra and its Applications, 497:108–115, 2016.
  • Dutilleul (1999) P. Dutilleul. The MLE algorithm for the matrix normal distribution. Journal of statistical computation and simulation, 64(2):105–123, 1999.
  • Evgeniou and Pontil (2007) A. Evgeniou and M. Pontil. Multi-task feature learning. Advances in neural information processing systems, 19:41, 2007.
  • Evgeniou and Pontil (2004) T. Evgeniou and M. Pontil. Regularized multi–task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 109–117. ACM, 2004.
  • Felix et al. (2016) X. Y. Felix, A. T. Suresh, K. M. Choromanski, D. N. Holtmann-Rice, and S. Kumar. Orthogonal random features. In Advances in Neural Information Processing Systems, pages 1975–1983, 2016.
  • Glanz and Carvalho (2013) H. Glanz and L. Carvalho. An expectation-maximization algorithm for the matrix normal distribution. arXiv preprint arXiv:1309.6609, 2013.
  • Goldstein (1991) H. Goldstein. Multilevel modelling of survey data. Journal of the Royal Statistical Society. Series D (The Statistician), 40(2):235–244, 1991.
  • Gupta and Nagar (1999) A. K. Gupta and D. K. Nagar. Matrix variate distributions, volume 104. CRC Press, 1999.
  • Jacob et al. (2009) L. Jacob, J.-p. Vert, and F. R. Bach. Clustered multi-task learning: A convex formulation. In Advances in neural information processing systems, pages 745–752, 2009.
  • Jawanpuria et al. (2015) P. Jawanpuria, M. Lapin, M. Hein, and B. Schiele. Efficient output kernel learning for multiple tasks. In Advances in Neural Information Processing Systems, pages 1189–1197, 2015.
  • Kato et al. (2008) T. Kato, H. Kashima, M. Sugiyama, and K. Asai. Multi-task learning via conic programming. In Advances in Neural Information Processing Systems, pages 737–744, 2008.
  • Li et al. (2014) C. Li, J. Zhu, and J. Chen. Bayesian max-margin multi-task learning with data augmentation. In International Conference on Machine Learning, pages 415–423, 2014.
  • Li et al. (2015) Y. Li, X. Tian, T. Liu, and D. Tao. Multi-task model and feature joint learning. In Proceedings of the 24th International Conference on Artificial Intelligence, pages 3643–3649. AAAI Press, 2015.
  • Liu et al. (2009) J. Liu, S. Ji, and J. Ye. Multi-task feature learning via efficient l 2, 1-norm minimization. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, pages 339–348. AUAI Press, 2009.
  • Long et al. (2017) M. Long, Z. Cao, J. Wang, and S. Y. Philip. Learning multiple tasks with multilinear relationship networks. In Advances in Neural Information Processing Systems, pages 1593–1602, 2017.
  • Nesterov (2013) Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
  • Rahimi and Recht (2008) A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.
  • Roś et al. (2016) B. Roś, F. Bijma, J. C. de Munck, and M. C. de Gunst. Existence and uniqueness of the maximum likelihood estimator for models with a kronecker product covariance structure.

    Journal of Multivariate Analysis

    , 143:345–361, 2016.
  • Shewchuk et al. (1994) J. R. Shewchuk et al. An introduction to the conjugate gradient method without the agonizing pain, 1994.
  • Thrun (1996) S. Thrun. Explanation-based neural network learning. In Explanation-Based Neural Network Learning, pages 19–48. Springer, 1996.
  • Vijayakumar and Schaal (2000) S. Vijayakumar and S. Schaal. Locally weighted projection regression: Incremental real time learning in high dimensional space. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 1079–1086. Morgan Kaufmann Publishers Inc., 2000.
  • Zhang (2011) Y. Zhang. Supervision reduction by encoding extra information about models, features and labels. 2011.
  • Zhang and Schneider (2010) Y. Zhang and J. G. Schneider. Learning multiple tasks with a sparse matrix-normal penalty. In Advances in Neural Information Processing Systems, pages 2550–2558, 2010.
  • Zhang and Yang (2017) Y. Zhang and Q. Yang. A survey on multi-task learning. arXiv preprint arXiv:1707.08114, 2017.
  • Zhang and Yeung (2010a) Y. Zhang and D.-Y. Yeung. A convex formulation for learning task relationships in multi-task learning. 2010a.
  • Zhang and Yeung (2010b) Y. Zhang and D.-Y. Yeung. Multi-task learning using generalized t process. In AISTATS, pages 964–971, 2010b.

Appendix A Proofs

a.1 Proof of Proposition 4.1

See 4.1

To prove this claim, we need the following facts about tensor product:

Fact A.1.

Let be a matrix. Then .

Fact A.2.

Let , and . Then .

Fact A.3.

Let and . Then .

Fact A.4.

Let and . Let be the spectrum of and be the spectrum of . Then the spectrum of is .

We can show the following result by transforming into its isomorphic counterpart:

(By Fact A.1)
(By Fact A.2)
(By Fact A.3)

The last equation above is a quadratic function of , from which we can read off that the optimal solution should satisfy:


can then be obtained simply by reformatting into a matrix. The computational bottleneck in the above procedure is in solving an system of equations, which scales as if no further structure is available. The overall computational complexity is . ∎

a.2 Proof of Proposition 4.2

To analyze the convergence rate of gradient descent in this case, we start by bounding the smallest and largest eigenvalue of the quadratic system.

Lemma A.1 (Weyl’s inequality).

Let and be -by- Hermitian matrices, and . Let , and be the eigenvalues of and respectively. Then the following inequalities hold for , :

Let be the -th largest eigenvalue of matrix .

Lemma A.2.

If and are feasible in (4), then


By Weyl’s inequality, setting , we have . Set , we have . We can bound the largest and smallest eigenvalues of as follows:

(By Weyl’s inequality)
(By Fact A.4)
(By the feasibility assumption)


(By Weyl’s inequality)
(By Fact A.4)
(By the feasibility assumption)

We will first introduce the following two lemmas adapted from [Nesterov, 2013], using the fact that the spectral norm of the Hessian matrix is bounded.

Lemma A.3.

Let be a twice differentiable function with . is a constant. The minimum value of can be achieved. Let , then

Lemma A.4.

Let be a convex, twice differentiable function with . is a constant, then :

We can now proceed to show Proposition 4.2. See 4.2


Define function as follows:

Since we have already bounded that , it follows that is a convex function and furthermore . Applying Lemma A.4 to , , we have:

Plug in into the above inequality and after some algebraic manipulations, we have:


Let . Within each iteration of the algorithm, we have the update formula as , we can bound as follows

(By inequality 13)

Apply the above inequality recursively for times, we have

where . For , we have