Variable Selection and Task Grouping for Multi-Task Learning
We consider multi-task learning, which simultaneously learns related prediction tasks, to improve generalization performance. We factorize a coefficient matrix as the product of two matrices based on a low-rank assumption. These matrices have sparsities to simultaneously perform variable selection and learn and overlapping group structure among the tasks. The resulting bi-convex objective function is minimized by alternating optimization where sub-problems are solved using alternating direction method of multipliers and accelerated proximal gradient descent. Moreover, we provide the performance bound of the proposed method. The effectiveness of the proposed method is validated for both synthetic and real-world datasets.READ FULL TEXT VIEW PDF
Variable Selection and Task Grouping for Multi-Task Learning
Multi-task learning (MTL) refers to simultaneously learning multiple related prediction tasks rather than learning each task independently (Caruana, 1997; Zhang and Yang, 2017). Simultaneous learning enables us to share common information among related tasks, and works as an inductive bias to improve generalization performance. MTL is based on the premise the fact that humans can learn a new task easily when they already have knowledge from similar tasks.
The major challenges in MTL are how to share common information among related tasks and how to prevent unrelated tasks from being sharing. Previous studies achieved this by performing variable selection (Jalali et al., 2010; Wang et al., 2016), assuming a low-rank structure (Ando and Zhang, 2005; Argyriou et al., 2008; Kumar and Daumé III, 2012), or learning structures among tasks (Jacob et al., 2008; Zhou and Zhao, 2016; Lee et al., 2016; Han and Zhang, 2015).
The variable selection approach selects a subset of variables for related tasks (Liu et al., 2009; Wang et al., 2016). Traditional studies are based on a strict assumption that selected variables are shared among all tasks (Liu et al., 2009; Obozinski et al., 2010). Recent studies have suggested a more flexible approach that involves selecting variables by decomposing a coefficient into a shared part and an individual part (Jalali et al., 2010; Hernandez-Lobato et al., 2015) or factorizing a coefficient using a variable specific part and a task-variable part (Wang et al., 2016). Although the variable selection approach provides better interpretability than the other approaches, it has limited ability to share common information among related tasks.
The low-rank approach assumes that coefficient vectors lie within a low-dimensional latent space(Ando and Zhang, 2005; Argyriou et al., 2008) and is a representation learning that transform input variables into low-dimensional features and learn coefficient vectors in the feature space (Maurer et al., 2016). The low-rank approach has also been widely studied in multi-output regression, where entire tasks have real-valued outputs and share the same training set (Chen and Z.Huang, 2012). It can be achieved by imposing a trace-constraint (Argyriou et al., 2008)
, encouraging sparsity on the singular values of a coefficient matrix(Richard et al., 2012, 2013; McDonald et al., 2016; Han and Zhang, 2016), or factorizing a coefficient matrix as the product of a variable-latent matrix and a latent-task matrix (Ando and Zhang, 2005; Argyriou et al., 2008; Kang et al., 2011a; Kumar and Daumé III, 2012; Maurer et al., 2016). Several studies have shown that the low-rank approach is equivalent to an approach that assumes a group structure among tasks (McDonald et al., 2016; Zhou et al., 2011). Thus, recent studies on the low-rank approach have focused on improving the ability of models to learn group structures among tasks (Kang et al., 2011b; Kumar and Daumé III, 2012; Barzilai and Crammer, 2015). The low-rank approach provides a flexible way to share common information among related tasks and reduces the effective number of parameters.
It attempts to combine the variable selection approach and the learning of group structures among tasks, especially those based on the low-rank approach. This combination learns sparse representations to provide better interpretability and shares common information among related tasks in a group to improve generalization performance. Previous studies have either partially achieved this goal or have limitations. For example, Chen and Huang (Chen and Z.Huang, 2012) factorized a coefficient matrix and imposed sparsity between the rows of a variable-latent matrix to perform variable selection. They solved multi-output regression and did not explicitly learn a group structure among tasks. Kumar and Daumé III (Kumar and Daumé III, 2012) also factorized a coefficient matrix and imposed sparsity within the column vectors of a latent-task matrix to learn overlapping group structures among tasks, but they did not perform variable selection. Richard et al. (Richard et al., 2012, 2013) penalized both a trace norm and an norm to simultaneously perform variable selection and impose a low-rank structure. However, a trace norm penalty requires the use of extensive assumptions to ensure a low-rank structure (Mishra et al., 2013)
and singular value decomposition for each iteration of the optimization. Han and Zhang(Han and Zhang, 2015) learned overlapping group structures among tasks by decomposing a coefficient matrix into component matrices, but they could not remove irrelevant variables. Wang et al. (Wang et al., 2016) factorized a coefficient matrix as the product of full-rank matrices to perform variable selection, but did not explicitly learn a group structure among tasks.
This paper proposes the variable selection and task grouping-MTL (VSTG-MTL) approach, which simultaneously performs variable selection and learns an overlapping group structure among tasks based on the low-rank approach. Our main ideas are to express a coefficient matrix as the product of a variable-latent matrix and a latent-task matrix and impose sparsities on these matrices. The sparsities between and within the rows of a variable-latent matrix help the model to select relevant variables and have flexibility. We also encourage sparsity within the columns of a latent-task matrix to learn an overlapping group structure among tasks, and note that learning the latent-task matrix is equivalent to learning task coefficient vectors in a feature space where features can be highly correlated. This correlation is considered in the model by applying a -support norm (McDonald et al., 2016). The resulting bi-convex problem is minimized by alternating optimization, where sub-problems are solved by applying the alternating direction method of multipliers (ADMM) and accelerated proximal gradient descent. We provide an upper bound on the excess risk of the proposed method to guarantee its performance. Experiments conducted on four synthetic datasets and five real-world datasets show that the proposed VSTG-MTL approach outperforms several benchmark MTL methods and that the -support norm is effective on handling the possible correlation.
We summarize our contributions as follows
To the best our knowledge, this is the first work that simultaneously performs variable selection and learns an overlapping group structure among tasks using the low-rank approach.
We focus on the possible correlation from a representation learning and apply a support norm to improve generalization performance.
We present an upper bound on the excess risk of the proposed method.
In this section, we explain multi-task learning, low-rank structures, and -support norms.
Suppose that we are given variables and supervised learning tasks, where the -th task has an input matrix ,
with and an output vector . Next, we focus on a linear relation between input and output
where is an identity function for a regression problem
or a logit function for a binary classification problemand represents a coefficient vector for the -th task. Then, we can describe the matrix as a coefficient matrix.
We then impose a low-rank structure on the coefficient matrix W to share common information among related tasks (Ando and Zhang, 2005; Kumar and Daumé III, 2012). The low-rank structure assumes that the coefficient vectors , lie within a low-dimensional latent space and are expressed by a linear combination of latent bases. The coefficient matrix W can be factorized as the product of two low rank matrices UV, where is the variable-latent matrix, is the latent-task matrix, and is the number of latent bases. Then, we can express the coefficient of the -th variable for the -th task and the coefficient vector for the -th task as follows
where and are the -th row vector and -th column vector of the variable-latent matrix U, respectively, and is the -th column vector of the variable-latent matrix V. The above equations reveal the roles of the two matrices. The -th row vector and the -th column vector can be regarded as being of equal importance of that of the -th variable and -th latent basis. Then, the -th column vector can be regarded as the weighting vector for the -th task.
The transpose of the variable-latent matrix and the -th weighting vector represent a linear map from a variable space to a feature space, where is mapped to and the coefficient vector of the -th task on the feature space, respectively. We note that unless the latent bases , are orthogonal, the features can be highly correlated.
We commonly use an norm as a convex approximation to an norm in regularized regression. When features are correlated and form several groups, the norm penalty tends to select a few features from the groups, where we can improve the generalization performance by selecting all correlated features (Argyriou et al., 2012). In this case, a possible alternative to the norm is a -support norm , i.e., the tightest convex relaxation of sparsity within a Euclidean ball (Argyriou et al., 2012). The -support norm is defined for each as follows:
where denotes all subsets of of cardinality of at most . Moreover, and . Thus, the -support norm is a trade-off between an norm and an norm. This property can be enhanced by inspecting the following proposition:
(Proposition 2.1 (Argyriou et al., 2012)) For every w ,
where is the -th largest element of the absolute values of w, letting denote , and is the unique integer in satisfying
The above proposition shows that the -support norm imposes both the uniform shrinkage of an norm on the largest components and the spare shrinkage of an norm on the smallest components. Thus, in a similar way to Elastic net (Zou and Hastie, 2005), the -support norm penalty encourages the selection of a few groups of correlated features and imposes the uniform shrinkage of the norm on the selected groups.
We aim to simultaneously learn an overlapping group structure among tasks and select relevant variables. To achieve these goals, we employ the low-rank assumption shown in Section 3.1 and impose sparsities on a variable-latent matrix U and a latent-task matrix V. Fig. 1 shows an example of VSTG-ML, where the gray and white entries express non-zero and zero values, respectively. Each row of the variable-latent matrix U and the coefficient matrix W represent a variable. Similarly, each column of the latent-task matrix V and the coefficient matrix W represent a task; each column of the variable-latent matrix U and row of the latent-task matrix V represent a latent basis or feature. The variable-latent matrix U in Fig. 1(a) shows the sparsities between and within its rows, while the latent-task matrix V in Fig. 1(b) shows the sparsity within its columns. The coefficient matrix W in Fig. 1(c) expresses the product of these matrices.
The sparsity between the variable importance vectors , induces a model that can be used to select relevant variables (Chen and Z.Huang, 2012). If the -th variable importance vector is set to 0, then the corresponding variable is removed from the model in accordance with Eq. (2). For example, in Fig. 1(a), the 2nd, 6th and 7th variables are excluded from the model, whereas the 1st, 3rd, 4th, and 5th variables are selected. Simultaneously, the sparsity within the variable importance vector improves the flexibility of the model. The latent basis vector does not necessarily depend on all selected variables. Instead, it can have non-zero values from a subset of the selected variables.
The sparsities within the weighting vectors , learn an overlapping group structure among tasks. (Kumar and Daumé III, 2012). The group structure among tasks are decided by the sparsity patterns on the weighting vector . Tasks with same sparsity patterns on the weighting vector belong to the same group, whereas those with the orthogonal ones belong to disjoint groups. Two groups are regard as being overlapped if their sparsity patterns are not orthogonal, i.e., they partially share the latent bases. For example, in Fig. 1(b), the 1st and 2nd tasks belong to the same group and share the 2nd latent basis with the 3rd and 4th tasks. However, they do not share any latent basis with the 5th task. As mentioned in Sec 2.1, learning the -th weighting vectors is equivalent to learning the coefficient vector of the -th task in a feature space induced by the transpose of the variable-latent matrix . The features , can be highly correlated unless the latent bases are orthogonal. Thus, instead of the norm, the -support norm is appropriate to encouraging the sparsity within the weighting vector . The -support norm induces the less sparse weighting vector than that from the norm and similarly enhances the overlaps in the task groups.
We formulate the following problem
is the empirical loss function, which becomes a squared lossfor a regression problem and a logistic loss for a binary classification problem; is the norm; is the norm; is the -support norm; and and are the constraint parameters. The norm and the norm constraints encourage the sparsities between and within the variable importance vectors , . The squared -support norm constraint encourages the sparsity within the weighting vectors , while considering possible correlations among the features.
The optimization problem (5) is bi-convex for the variable-latent matrix U and latent-task matrix V; for a given U, it is convex for V and vice versa. We transform the above constraint problem to the following regularized objective function
where , and are the regularization parameters. Then, we apply alternating optimization to obtain the partial minimum of the objective function (6).
Initial estimates of the matricesU and V are crucial in generalization performance considering that the optimization function (6
We also define an initial coefficient matrix that stacks the ridge coefficient as a column vector:
Then, we compute the top- left-singular vectors , the top- right singular vectors , and the top- singular value matrix of the initial coefficient matrix . The initial estimates and are given by and , respectively.
For a fixed latent-task matrix V, the objective function for the variable-latent matrix U becomes as follows:
It is solved by applying ADMM (Boyd et al., 2011). First, we introduce auxiliary variables and reformulate the above problem as follows:
Let be a scaled Lagrangian multiplier for the -th auxiliary variables and . Then, the variable-latent matrix U is updated as follows:
where denotes the iteration and is the ADMM parameter.
The auxiliary variables , are updated by solving the following problem
In detail, the first auxiliary variable is updated as follows
For regression problems with a squared loss, we can compute the close-form updating equation by equating the gradient of the optimization problem (13) to zero as follows:
where is the vectorization operator and is the Kronecker product. The above linear system of equations is solved by using the Cholesky or LU decomposition.
For binary classification problems with a logistic loss, it is solved by using L-BFGS (Nocedal and Wright, 2006), where the gradient is given as follows:
The other auxiliary variables and are updated as follows:
and are the proximal operators of an norm and an norm, respectively, which are shown in (Parikh and Boyd, 2014).
The Lagrangian multipliers , are updated as follows:
Then, the primal and dual residuals and are given by
Note that the updating equation for the variable-latent matrix U in Eq. (11) does not guarantee sparsity. Thus, after convergence, the final variable-latent matrix U is given by the second auxiliary variable , which guarantees sparsity due to the proximal operator of the norm.
|and : training data for task|
|: number of latent bases|
|: regularization parameters|
|: parameter for the -support norm|
|: parameter for ADMM|
|U: variable-latent matrix|
|V: latent-task matrix|
|W=UV: Coefficient matrix|
|1. Estimate an initial coefficient matrix|
|by using Eqs. (7) and (8).|
|2. Compute the top- left singular vectors P, the top- right|
|singular vectors Q, and the top- singular value matrix|
|3. Estimate initial estimates for and as follows:|
|4. Repeat step 5 to 13.|
|5. Repeat step 6 to 8.|
|6. Update the variable-latent matrix U by using Eq. (11).|
|7. Update the auxiliary variables ,|
|by solving Eqs. (13), (14), and (15).|
|8. Update scaled Lagrangian multipliers ,|
|by using Eq. (16).|
|9. until the Frobeneus norms of r and s in Eqs. (17) and (18)|
|10. Set the variable-latent matrix U to be equal|
|to the second auxiliary variable .|
|11. for do|
|12. Update the weighting vector by solving Eq. (19).|
|13. end for|
|14. until the objective function in Eq. (6) converges.|
For a fixed variable-latent matrix U, the problem for the latent-task matrix V is separable into its column vector as follows:
The -th weighting vector is updated by solving the -support norm regularized regression or logistic regression, where a input matrix becomes . The above problem is solved using accelerated proximal gradient descent (Beck and Teboulle, 2009), where the proximal operator for the squared -support norm is given by (Argyriou et al., 2012).
Algorithm 1 summarizes the procedure to optimize the objective function (6). We set the ADMM parameter to 2 and consider that the ADMM for updating U converges if and .
In this section, we provide an upper bound on excess error of the proposed method based on the previous work from Maurer et al. (Maurer et al., 2016).
be probability measures on. Then, the input matrix and the output vector are drawn from the probability measure with . We express .
The optimization problem (5) is reformulated as follows:
where , and , and is the scaled loss function in . We are interested in the expected error given by
Let and be the optimal solution of the optimization problem (20), then we have the following theorem
(Upper bound on excess error). If , with probability at least 1- in the excess error is bounded by
is the empirical covariance of input data for the -th task, is the largest eigenvalue, and
is the largest eigenvalue, andand are universal constants.
The above theorem shows the roles of the hyper-parameters. The constraint parameters and should be low enough to satisfy , and produce a tighter bound. Thus, the corresponding regularization parameters and should be large enough to fulfill the above condition and tighten the bound.
In this section, we present experiments conducted to evaluate the effectiveness of our proposed method. We compare our proposed methods with the following benchmark methods:
LASSO method: This single-task learning method learns a sparse prediction model for each task independently.
L1+trace norm (Richard et al., 2012): This MTL method simultaneously achieves a low-rank structure and variable selection by penalizing both the nuclear norm and the norm of the coefficient matrix.
Multiplicative multi-task feature learning (MMTFL)
(Wang et al., 2016): This MTL method factorizes a coefficient matrix as the product of full rank matrices to select the relevant input variables. In this paper, we set and .
Group overlap multi-task learning (GO-MTL) (Kumar and Daumé III, 2012): This MTL method factorizes a coefficient matrix as the product of low-rank matrices and learn an overlapping group structure among tasks by imposing sparsity on the weighting vectors.
The hyper-parameters of all methods are selected by minimizing the error from an inner 10-fold cross validation step or a validation set. To reduce the computational complexity of the proposed method, we set the third regularization parameter to be equal to the first regularization parameter . The regularization parameters of all methods are selected from the search grid . For GO-MTL and VSTG-MTL, the number of latent bases is selected from the search grid . For the synthetic datasets, the value of is set to (VSTG-MTL =1), which is equivalent to the squared norm, or (VSTG-MTL =3) to identify the effectiveness of the -support norm for correlated features. In the real datasets, it is selected from (VSTG-MTL =opt). The Matlab implementation of the proposed method is available at the following URL: https://github.com/JunYongJeong/VSTG-MTL.
The evaluation measurements approach used are the root mean squared error (RMSE) for a regression problem and the error rate (ER) for a classification problem. For synthetic datasets, we also compute the relative estimation error (REE) , where is the true coefficient matrix and
is the estimated one. We repeat the experiments 10 times and compute the mean and standard deviation of the evaluation measurement. We also perform a Wilcoxon signed rank test with
, which is a non-parametric paired t-test, to find the best model statistically. The statistically best models are highlighted in bold in Tables 1 and 2.
We generate the following four synthetic datasets. We use 25-dimensional variables () and 20 tasks (). For the -th task, we generate 50 training observations and 100 test observations from and . A true coefficient matrix has a low-rank structure and is estimated by UV, where and . Each synthetic dataset differs on the structure of the two matrices U and V.
Syn1. Orthogonal features and disjoint task groups
: For , the latent basis only has non-zero values from the -th to the
-th components. The non-zero values are generated through a normal distribution with mean 1.0 and std 0.25. Similarly, the weighting vectorsonly have nonzero values on the
-th component. The nonzero values are generated through a uniform distribution from 1 to 1.5. Thus, the last five variables are irrelevant. The latent bases, , as well as the corresponding features, are orthogonal to each other. Each latent basis forms a disjoint group, where each group consists of four variables and tasks.
Syn2. Orthogonal features and overlapping task groups
: The variable-latent matrix U is generated by the same procedure as that shown in Syn1. For , the weighting vectors only have nonzero values on the -th and -th components. The last four weighting vectors only have the nonzero values on the -th and -th components. The nonzero values are generated using the same uniform distribution as that used in Syn1. Then, the last five variables are irrelevant and the features are still orthogonal. The tasks have overlapping groups, where each group consists of four variables and five tasks.
Syn3. Correlated features and disjoint task groups
: For , the latent basis only has nonzero values from the -th to the -th components. The nonzero values are generated using the same normal distribution as that used in Syn1. The latent-task matrix V is generated using the same procedure as that used in Syn1. The last seven variables are irrelevant and the latent bases are not orthogonal, resulting in correlation among features. The tasks have disjoint groups, where each group consists of six variables and four tasks.
Syn4. Correlated features and overlapping task groups
: The variable-latent matrix U is generated using the same procedure as that used in Syn3. The latent-task matrix V is generated using the same procedure as that used in Syn2. The last seven input variables are irrelevant. Thus, the features are correlated and the tasks have overlapping groups, where each group consists of six variables and five tasks.
|Syn1||RMSE||1.4625 0.1349||1.1585 0.0180||1.1384 0.0257||1.0935 0.0185||1.0456 0.0228||1.0766 0.0176|
|REE||0.4155 0.0595||0.2249 0.0200||0.2089 0.0169||0.1737 0.0165||0.1226 0.0149||0.1536 0.0128|
|Syn2||RMSE||1.6811 0.1146||1.2639 0.0418||1.2377 0.0401||1.1509 0.0267||1.1294 0.0332||1.1314 0.0275|
|REE||0.3703 0.0441||0.2040 0.0169||0.1921 0.0152||0.1488 0.0122||0.1365 0.0135||0.1376 0.0091|
|Syn3||RMSE||1.5303 0.0483||1.2244 0.0320||1.1797 0.0287||1.1129 0.0250||1.1086 0.0192||1.1020 0.0226|
|REE||0.3801 0.0328||0.2262 0.0211||0.2001 0.0168||0.1565 0.0148||0.1538 0.0123||0.1486 0.0121|
|Syn4||RMSE||1.7380 0.1032||1.2673 0.0312||1.2271 0.0309||1.1278 0.0235||1.1139 0.0236||1.1085 0.0214|
|REE||0.2729 0.0365||0.1419 0.0125||0.1302 0.0111||0.0945 0.0087||0.0895 0.0094||0.0868 0.0083|
Table 1 summarizes the experimental results for the four synthetic datasets in terms of RMSE and REE. For all the synthetic datasets, the MTL methods outperform the single-task learning method LASSO and VSTG-MTL exhibits the best performance. Moreover, we can identify the effect of the -support norm on the correlated features. On Syn1 and Syn2, where the latent bases , are orthogonal , VSTG-MTL =1 outperforms VSTG-MTL =3. This results indicates that the squared- norm penalty performs better than the squared support norm penalty with when the features are orthogonal. In contrast, on Syn3 and Syn4, where the latent bases , are not orthogonal, VSTG-MTL =3 outperforms VSTG-MTL =1. These results confirm to our premise that the -support norm penalty can improve generalization performance more than the norm penalty when correlation exists.
The true coefficient matrix and estimated matrix using the proposed method are shown in Fig. 2, where the dark and white color entries indicate large and zero values, respectively. VSTG-MTL can recover a group structure among tasks and exclude irrelevant variables.
We also evaluate the performance of VSTG-MTL on the following five real datasets. After splitting the dataset into a training set and a test set, we transform the continuous input variables from the training set into by dividing the maximums of their absolute values. Then, we divide the continuous input variables in the test set by using the same values as those in the training set.
|School exam||RMSE||12.0483 0.1738||10.5041 0.1432||10.1303 0.1291||10.1924 0.1331||9.9475 0.1189|
|Parkinson||2.9177 0.0960||1.0481 0.0243||1.1079 0.0182||1.0231 0.0285||1.0076 0.0188|
|Computer survey||2.3119 0.3997||4.9493 2.1592||1.7525 0.1237||1.9067 0.1864||1.6866 0.1463|
|MNIST||ER||13.0200 0.7084||17.9800 1.7574||12.6000 0.8641||12.8400 1.2989||11.7000 1.4461|
|USPS||12.8800 1.5061||16.0200 1.2874||11.3600 1.1462||12.9000 1.0842||12.1800 1.3547|
School exam dataset111http://ttic.uchicago.edu/~argyriou/code/index.html (Goldstein, 1991)
: This multi-task regression dataset is obtained from the Inner London Education Authority. It consists of an examination of 15362 students from 139 secondary schools in London during a three year period: 1985-1987. We have 139 tasks and 15362 observations, where each task and observation correspond to a prediction of the exam scores of a school and a student, respectively. Each observation is represented by 3 continuous and 23 binary variables including school and student-specific attributes. We follow the split procedure shown in(Argyriou et al., 2008), resulting in a training set of 75% observations and a test set of 25% observations.
Parkinson’s disease dataset 222http://archive.ics.uci.edu/ml/datasets/Parkinsons+Telemonitoring (Tsanas et al., 2010): This multi-task regression dataset is obtained from biomedical voice measurements taken from 42 people with early-stage Parkinson’s disease. We have 42 tasks and 5875 observations, where each task and observation correspond to a prediction of the symptom score (motor UPDRS) for a patient and a record of a patient, respectively. Each observation is represented by 19 continuous variables including age, gender, time interval, and voice measurements. We use 75% of the observations as a training set and the remaining 25% as a test set.
Computer survey dataset 333https://github.com/probml/pmtk3/tree/master/data/conjointAnalysisComputerBuyers (Lenk et al., 1996): This multi-output regression dataset is obtained from a survey of 190 ratings from people about their likelihood of purchasing each of the 20 different personal computers. We have 190 tasks and 20 observations shared for all tasks, where each task and observation correspond to a prediction of the integer ratings of a person on a scale of 0 to 10 and a computer. Each observation is represented by 13 binary variables, including its specification. We insert an additional variable to account for the bias term and use 75% of the observations as a training set and the remaining 25% as a test set.
MNIST dataset444http://yann.lecun.com/exdb/mnist/ (Lecun et al., 1998): This multi-class classification dataset is obtained from 10 handwritten digits. We have 10 tasks, 60,000 training observations and 10,000 test observations, where each task and observation correspond to a prediction of the digit and an image, respectively. Each observations is represented by variables and reduced to 64 dimensions using PCA. Train, validation and test set are generated by randomly selecting 1,000 observations from the train set of and two sets of 500 observations from the test set, similar to the procedure of Kang et al. (Kang et al., 2011a).
USPS dataset555http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html (Hull, 1994): This multi-class classification dataset is also obtained from the 10 handwritten digits. We have 10 tasks, 7,291 training observations and 2,007 test observations, where each task and observation correspond to a prediction of the digit and an image, respectively. Each observation is represented by variables and reduced to 87 dimensions using PCA. ,We follow the same procedure of that used in the MNIST dataset to generate train, validation and test set, resulting in 1000, 500, and 500 observations, respectively.
Table 2 summarizes the results for the five real datasets over 10 repetitions. VSTG-MTL =opt outperforms the benchmark methods except the USPS dataset. This is especially true for the school exam dataset, the computer survey dataset, and the MNIST dataset, where the proposed method shows statistically significant improvements over the benchmark methods.
This paper proposes a novel algorithm of VSTG-MTL, which simultaneously performs variable selection and learns an overlapping group structure among tasks. VSTG-MTL factorizes a coefficient matrix into the product of low-rank matrices and impose sparsities on them while considering possible correlations. The resulting bi-convex constrained problem is transformed to a regularized problem that is solved by alternating optimization. We provide the upper bound on the excess risk of the proposed method. The experimental results show that the proposed VSTG-MTL method outperforms the benchmark methods on synthetic as well as real datasets.
Journal of Machine Learning Research6 (Dec. 2005), 1817–1853.
Proceedings of the 18th International Conference on Artificial Intelligence and Statistics(AISTATS’2015), Guy Lebanon and S. V. N. Vishwanathan (Eds.), Vol. 38. PMLR, San Diego, California, USA, 65–73.
A Probabilistic Model for Dirty Multi-task Feature Selection. InProceedings of the 32nd International Conference on Machine Learning (ICML’15), Francis Bach and David Blei (Eds.), Vol. 37. PMLR, Lille, France, 1073–1082.