Framework for Multi-task Multiple Kernel Learning and Applications in Genome Analysis

06/30/2015
by   Christian Widmer, et al.
0

We present a general regularization-based framework for Multi-task learning (MTL), in which the similarity between tasks can be learned or refined using ℓ_p-norm Multiple Kernel learning (MKL). Based on this very general formulation (including a general loss function), we derive the corresponding dual formulation using Fenchel duality applied to Hermitian matrices. We show that numerous established MTL methods can be derived as special cases from both, the primal and dual of our formulation. Furthermore, we derive a modern dual-coordinate descend optimization strategy for the hinge-loss variant of our formulation and provide convergence bounds for our algorithm. As a special case, we implement in C++ a fast LibLinear-style solver for ℓ_p-norm MKL. In the experimental section, we analyze various aspects of our algorithm such as predictive performance and ability to reconstruct task relationships on biologically inspired synthetic data, where we have full control over the underlying ground truth. We also experiment on a new dataset from the domain of computational biology that we collected for the purpose of this paper. It concerns the prediction of transcription start sites (TSS) over nine organisms, which is a crucial task in gene finding. Our solvers including all discussed special cases are made available as open-source software as part of the SHOGUN machine learning toolbox (available at <http://shogun.ml>).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 27

10/24/2020

Multi-task Supervised Learning via Cross-learning

In this paper we consider a problem known as multi-task learning, consis...
10/04/2009

Regularization Techniques for Learning with Matrices

There is growing body of learning problems for which it is natural to or...
10/26/2021

Conflict-Averse Gradient Descent for Multi-task Learning

The goal of multi-task learning is to enable more efficient learning tha...
02/14/2017

Efficient Multi-task Feature and Relationship Learning

In this paper we propose a multi-convex framework for multi-task learnin...
06/06/2019

Primal-Dual Block Frank-Wolfe

We propose a variant of the Frank-Wolfe algorithm for solving a class of...
08/30/2014

Kernel Coding: General Formulation and Special Cases

Representing images by compact codes has proven beneficial for many visu...
09/13/2012

Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

Since its inception, the modus operandi of multi-task learning (MTL) has...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the key challenges in computational biology is to build effective and efficient statistical models that learn from data to predict, analyze, and ultimately understand biological systems. Regardless of the problem at hand, however, be it the recognition of sequence signals such as splice sites, the prediction of protein-protein interactions, or the modeling of metabolic networks, we frequently have access to data sets for multiple organisms, tissues or cell-lines. Can we develop methods that optimally combine such multi-domain data?

While the field of Transfer or Multitask Learning enjoys a growing interest in the Machine Learning community in recent years, it can be traced back to ideas from the mid 90’s. During that time Thrun (1996) asked the provocative question ”Is Learning the

-th Thing any Easier Than Learning the First?”, effectively laying the ground for the field of Transfer Learning. Their work was motivated by findings in human psychology, where humans were found to be capable of learning based on as few as a single example

(Ahn and Brewer, 1993). The key insight was that humans build upon previously learned related concepts, when learning new tasks, something Thrun (1996) call lifelong learning. Around the same time, Caruana (1993, 1997) coined the term Multitask Learning. Rather than formalizing the idea of learning a sequence of tasks, they propose machinery to learn multiple related tasks in parallel.

While most of the early work on Multitask Learning was carried out in the context of learning a shared representation for neural networks

(Caruana, 1997; Baxter, 2000), Evgeniou and Pontil (2004) adapted this concept in the context of kernel machines. At first, they assumed that the models of all tasks are close to each other (Evgeniou and Pontil, 2004) and later generalized their framework to non-uniform relations, allowing to couple some tasks more strongly than others (Evgeniou et al., 2005), according to some externally defined task structure. In recent years, there has been an increased interest in learning the structure potentially underlying the tasks. Ando and Zhang (2005) proposed a non-convex method based on Alternating Structure Optimization (ASO) for identifying the task structure. A convex relaxation of their approach was developed by Chen et al. (2009). Zhou et al. (2011) showed the equivalence between ASO and Clustered Multitask Learning (Jacob et al., 2008; Obozinski et al., 2010) and their convex relaxations. While the structure between tasks is defined by assigning tasks to clusters in the above approaches, Zhang and Yeung (2010) propose to learn a constrained task covariance matrix directly and show the relationship to Multitask Feature Learning (Argyriou et al., 2007, 2008a, 2008b; Liu et al., 2009). Here, the basic idea is to use a LASSO-inspired (Tibshirani, 1996) -norm to identify a subset of features that is relevant to all tasks.

A challenge remains to find an adequate task similarity measure to compare the multiple domains and tasks. While existing parameter-free approaches such as Romera-Paredes et al. (2013) ignore biological background knowledge about the relatedness of the tasks, in this paper, we present a parametric framework for regularization-based multitask learning that subsumes several approaches and automatically learns the task similarity from a set of candidates measures using -norm Multiple Kernel learning (MKL) see, for instance, Kloft et al. (2011). We thus provide a middle ground between assuming known task relationships and learning the entire task structure from scratch. We propose a general unifying framework of MT-MKL, including a thorough dualization analysis using Fenchel duality, based on which we derive an efficient linear solver that combines our general framework with advances in linear SVM solvers and evaluate our approach on several datasets from Computational Biology.

This paper is based on preliminary material shown in several conference papers and workshop contributions (Widmer et al., 2010a, c, b, 2012; Widmer and Rätsch, 2012), which contained preliminary aspects of the framework presented here. This version additionally includes a unifying framework including Fenchel duality analysis, more complete derivations and theoretical analysis as well as a comparative study in multitask learning and genomics, where we brought together genomic data for a wide range of biological organisms in a multitask learning setting. This dataset will be made freely available and may serve as a benchmark in the domain of multitask learning. Our experiments show that combining data via multitask learning can outperform learning each task independently. In particular, we find that it can be crucial to further refine a given task similarity measure using multitask multiple kernel learning.

The paper is structured as follows: In Section 2 we introduce a unifying view of multitask multiple kernel learning that covers a wide range loss functions and regularizers. We give a general Fenchel dual representation and a representer theorem, and show that the formulation contains several existing formulations as special cases. In Section 3 we propose two optimization strategies: one that can be applied out of the box with any custom set of kernels and another one that is specifically tailored to linear kernels as well as string kernels. Both algorithms were implemented into the Shogun machine learning toolbox. In Section 4 we present results of empirical experiments on artificial data as well as a large biological multi-organism dataset curated for the purpose of this paper.

2 A Unifying View of Regularized Multi-Task Learning

In this section, we present a novel multi-task framework comprising many existing formulations, allowing us to view prevalent approaches from a unifying perspective, yielding new insights. We can also derive new learning machines as special instantiations of the general model. Our approach is embedded into the general framework of regularization-based supervised learning methods, where we minimize a functional

which consists of a loss-term measuring the training error and a regularizer penalizing the complexity of the model . The positive constant controls the trade-off of the criterion. The formulation can easily be generalized to the multi-task setting, where we are interested in obtaining several models parametrized by , where is the number of tasks.

In the past, this has been achieved by employing a joint regularization term that penalizes the discrepancy between the individual models (Evgeniou et al., 2005; Agarwal et al., 2010),

A common approach is, for example, to set where is a task similarity matrix. In this paper, we develop a novel, general framework for multi-task learning of the form

where , . This approach has the additional flexibility of allowing us to incorporate multiple task similarity matrices into the learning problem, each equipped with a weighting factor. Instead of specifying the weighting factor a priori, we will automatically determine optimal weights from the data as part of the learning problem. We show that the above formulation comprises many existing lines of research in the area; this not only includes very recent lines but also seemingly different ones. The unifying framework allows us to analyze a large variety of MTL methods jointly, as exemplified by deriving a general dual representation of the criterion, without making assumptions on the employed norms and losses, besides the latter being convex. This delivers insights into connections between existing MTL formulations and, even more importantly, can be used to derive novel MTL formulations as special cases of our framework, as done in a later section of this paper.

2.1 Problem Setting and Notation

Let

be a set of training pattern/label pairs. In multitask learning, each training example

is associated with a task . Furthermore, we assume that for each the instances associated with task

are independently drawn from a probability distribution

over a measurable space . We denote the set of indices of training points of the th task by . The goal is to find, for each task , a prediction function . In this paper, we consider composite functions of the form , , where , , are mappings into reproducing Hilbert spaces , encoding multiple views of the multi-task learning problem via kernels , and ,

are parameter vectors of the prediction function.

For simplicity of notation, we concentrate on binary prediction, i.e., , and encode the loss of the prediction problem as a loss term , where is a loss function, assumed to be closed convex, lower bounded and finite at . To consider sophisticated couplings between the tasks, we introduce so-called task-similarity matrices with ,   and consider the regularizer (setting , ) with , where with adjoint and

denotes the trace class operator of the tensor Hilbert space

. Note that also the direct sum is a Hilbert space, which will allow us to view as an element in a Hilbert space. The parameters , , are adaptive weights of the views, where denotes the -norm. Here denotes , .

Using the above specification of the regularizer and the loss term, we study the following unifying primal optimization problem.

Problem 1 (Primal problem).

Solve

where

2.2 Dualization

Dual representations of optimization problems deliver insight into the problem, which can be used in practice to, for example, develop optimization algorithms (so done in Section 3 of this paper). In this section, we derive a dual representation of our unifying primal optimization problem, i.e., Problem 1. Our dualization approach is based on Fenchel-Rockafellar duality theory. The basic results of Fenchel-Rockafellar duality theory for Hilbert spaces are reviewed in Appendix A. We present two dual optimization problems: one that is dualized with respect to only (i.e., considering as being fixed) and one that completely removes the dependency on .

2.2.1 Computation of Conjugates and Adjoint Map

To apply Fenchel’s duality theorem, we need to compute the adjoint map of the linear map , , as well as the convex conjugates of and . See Appendix A for a review of the definitions of the convex conjugate and the adjoint map. First, we notice that, by the basic identities for convex conjugates of Prop. 10 in Appendix A, we have that

Next, we define by  . Recall that the mapping between tasks and examples may be expressed in one of two ways. We may use index set to retrieve the indices of training examples associated with task . Alternatively, we may use task indicator to obtain the task index associated with th training example. Using this notation, we verify that, for any and , it holds

Thus, as defined above is indeed the adjoint map. Finally, we compute the conjugate of with respect to , where we consider as a constant (be reminded that are given). We write and note that, by Prop. 10,

Furthermore,

(1)

The supremum is attained when so that in the optimum . Resubstitution into (1) gives , so that we have

2.2.2 Dual Optimization Problems

We may now apply Fenchel’s duality theorem (cf. Theorem 9 in Appendix A), which gives the following dual MTL problem:

Problem 2 (Dual problem—partially dualized minimax formulation).

Solve

(2)

where

(3)

The above problem involves minimization with respect to (the primal variable) and maximization with respect to (the dual variable) . The optimization algorithm presented later in this paper will optimize is based on this minimax formulation. However, we may completely remove the dependency on , which sheds further insights into the problem, which will later be exploited for optimization, i.e., to control the duality gap of the computed solutions.

To remove the dependency on , we first note that Problem 2 is convex (even affine) in and concave in and thus, by Sion’s minimax theorem, we may exchange the order of minimization and maximization:

where the last step is by the definition of the dual norm, i.e., and denotes the conjugated exponent. We thus have the following alternative dual problem.

Problem 3 (Dual problem—completely dualized formulation).

Solve

where

2.3 Representer Theorem

Fenchel’s duality theorem (Theorem 9 in Appendix A) yields a useful optimality condition, that is,

under the minimal assumption that is differentiable in . The above requirement can be thought of as an analog to the KKT condition stationarity in Lagrangian duality. Note that we can rewrite the above equation by inserting the definitions of and from the previous subsection; this gives, for any ,

which we may rewrite as

(4)

The above equation gives us a representer theorem (Argyriou et al., 2009) for the optimal , which we will exploit later in this paper for deriving an efficient optimization algorithm to solve Problem 1.

2.4 Relation to Multiple Kernel Learning

Evgeniou et al. (2005) introduce the notion of a multi-task kernel. We can generalize this framework by defining multiple multi-task kernels

(5)

To see this, first note that the term can alternatively be written as

(6)

so it follows

and thus Problem 2 becomes

(7)

which is an -regularized multiple-kernel-learning problem over the kernels (Kloft et al., 2008b, 2011).

loss dual loss
hinge loss
logistic loss
Table 1: Examples of loss functions and corresponding conjugate functions. See Appendix B.

2.5 Specific Instantiations of the Framework

In this section, we show that several regularization-based multi-task learning machines are subsumed by the generalized primal and dual formulations of Problems 12. As a first step, we will specialize our general framework to the hinge-loss, and show its primal and dual form. Based on this, we then instantiate our framework further to known methods in increasing complexity, starting with single-task learning (standard SVM) and working towards graph-regularized multitask learning and its relation to multitask kernels. Finally, we derive several novel methods from our general framework.

2.5.1 Hinge Loss

Many existing multi-task learning machines utilize the hinge loss . Employing the hinge loss in Problem 1, yields the loss term

Furthermore, as shown in Table 1, the conjugate of the hinge loss is , if and elsewise, which is readily verified by elementary calculus. Thus, we have

(8)

provided that ; otherwise we have . Hence, for the hinge-loss, we obtain the following pair of primal and dual problem.

Primal:

(9)

Dual:

(10)

2.5.2 Single Task Learning

Starting from the simplest special case, we briefly show how single-task learning methods may be recovered from our general framework. By mapping well understood single-task methods onto our framework, we hope to achieve two things. First, we believe this will greatly facilitate understanding for the reader who is familiar with standard methods like the SVM. Second, we pave the way for applying efficient training algorithms developed in Section 3 to these single-task formulations, for example yielding a new linear solver for non-sparse Multiple Kernel Learning as a corollary.

Support Vector Machine

In the case of the single-task (, ), single kernel SVM (), the primal from Equation 9 and dual from Equation 2.5.1 can be greatly simplified:

which corresponds to the well-established linear SVM formulation (without bias). Similarly, the dual is readily obtained from Equation 2.5.1 and is given by

Mkl

-norm MKL (Kloft et al., 2011) is obtained as a special case of our framework. This case is of particular interest, as it allows to obtain a linear solver for -norm MKL, as a corollary. By restricting the number of tasks to one (i.e., ), becomes and . Equation (9) reduces to:

In agreement with Kloft et al. (2009a), we recover the dual formulation from Equation 2.5.1.

2.5.3 Multitask Learning

Here, we first derive the primal and dual formulations of regularization-based multitask learning as a special case of our framework and then give an overview of existing variants that can be mapped onto this formulation as a precursor to novel instantiations in Section 2.6. In this setting, we deal with multiple tasks , but only a single kernel or task similarity measure (i.e., ). The primal thus becomes:

(11)

with corresponding dual

(12)

where the definition of is given in Equation 5. As we will see in the following, the above formulation captures several existing MTL approaches, which can be expressed by choosing different encodings for task similarity.

Frustratingly Easy Domain Adaptation

An appealing special case of Graph-regularized MTL was presented by Daumé (2007). They considered the setting of only two tasks (source task and target task), with a fix task relationship. Their frustratingly easy idea was to assign a higher similarity to pairs of examples from the same task than between examples from different tasks. In a publication titled Frustratingly Easy Domain Adaptation, Daumé (2007) present a simple, yet appealing special case of graph-regularized MTL. They considered the setting of only two tasks (source task and target task), with a fix task relationship (i.e., the influence of the two tasks on each other was not determined by their actual similarity). Their idea was to assign a higher base-similarity to pairs of examples from the same task than between examples from different tasks. This may be expressed by the following multitask kernel:

From the above, we can readily read off the corresponding (and compute ).

Given the above, we can express this special case in terms of Equation (11) and (12). With some elementary algebra, this method can be viewed as pulling weight vectors of source and target towards a common mean vector by means of a regularization term. If we generalize this idea to allow for multiple cluster centers, we arrive at task clustering, which is described in the following.

Task Clustering Regularization

Here, tasks are grouped into clusters, whereas parameter vectors of tasks within each cluster are pulled towards the respective cluster center where is the number of tasks in cluster (Evgeniou et al., 2005). To understand what and correspond to in terms of Equations 11 and 12, consider the definition of the multitask regularizer for task clustering.

(13)
(14)
(15)

where is the number of clusters, encodes assignment of task to cluster , controls regularization of cluster centers and are given by

If any task is assigned to at least one cluster (i.e., ) is positive definite (Evgeniou et al., 2005) and we can express the above in terms of our primal formulation in Equation 11 as and the corresponding dual as , even for . We note that the formulation given in Section 2.5.3 may by expressed via task clustering regularization, by choosing only one cluster (i.e., ) and setting , and , we get , equating to the task similarity matrix from the previous section.

Graph-regularized MTL

Graph-regularized MTL was established by Evgeniou et al. (2005) and constitutes one of the most influential MTL approaches to date. Their method is based on the following multi-task regularizer, which also forms one of the main inspirations for our framework:

(16)
(17)
(18)

where is a given graph adjacency matrix encoding the pairwise similarities of the tasks, denotes the corresponding graph Laplacian, where , and is a identity matrix. Note that the number of zero eigenvalues of the graph Laplacian corresponds to the number of connected components. We may view graph-regularized MTL as an instantiation of our general primal problem, Problem 1, where we have only one task similarity measure (i.e., ). As the graph Laplacian is not invertible in general, we use its pseudo-inverse to express the dual formulation of the above MTL regularizer.

(19)

where is the rank of , are the eigenvalues of and

is the orthogonal matrix of eigenvectors.

Multi-task Kernels

In contrast to graph-regularized MTL, where task relations are captured by an adjacency matrix or graph Laplacian as discussed in the previous paragraph, task relationships may directly be expressed in terms of a kernel on tasks . This relationship has been illuminated in Section 2.4, where we have seen that the kernel on tasks corresponds to in our dual MTL formulation. A formulation involving a combination of several MTL kernels with a fix weighting was explored by Jacob and Vert (2008) in the context of Bioinformatics. In its most basic form, the authors considered a multitask kernel of the form

Furthermore, the authors considered a sum of different multi-task kernels, among them the corner cases (independent tasks) and the uniform kernel (uniformly related tasks). In general, their dual formulation is given by

The above is a very interesting special case and can easily be expressed within our general framework. For this, consider the dual formulation given in Equation 2.5.1 for and . In other words, the above also constitutes a form of multitask multiple kernel learning, however, without actually learning the kernel weights . Nevertheless, the choice and discussion of different multitask kernels in Jacob and Vert (2008) is of high relevance with respect to the family of methods explored in this work.

2.6 Proposing Novel Instances of Multi-task Learning Machines

We now move ahead and derive novel instantiations from our general framework. Most importantly, we go beyond previous formulations by learning or refining task similarities from data using MKL as an engine.

(a) Multigraph MT-MKL
(b) Hierarchical MT-MKL
(c) Smooth MT-MKL
Figure 1: Learning additive transformations of task similarities: (a) Multigraph MT-MKL where one combines similarities from multiple independent graphs (which includes the approaches proposed in Widmer et al. (2010c); Jacob and Vert (2008)); (b) Hierarchical MT-MKL where one uses a tree to generate specific similarity matrices (as proposed in Widmer et al. (2010a, c); Görnitz et al. (2011); Widmer et al. (2012)); and (c) Smooth MT-MKL where one uses multiple transformations of an existing similarity matrix for linear combination.

2.6.1 Multi-graph MT-MKL

One of the most popular MTL approaches is graph-regularized MTL by Evgeniou and Pontil (2004). We have seen in Section 2.5.3, that such a graph is expressed as a adjacency matrix and may alternatively be expressed in terms of its graph Laplacian . Our extension readily deals with multiple graphs encoding task similarity , which is of interest in cases where - as in Multiple kernel learning - we have access to alternative sources of task similarity and it is unclear which one is best suited. This concept gives rise to the multi-graph MTL regularizer

where denotes the graph Laplacian corresponding to . As before, we learn a weighting of the given graphs, therefore determining which measures are best suited to maximize prediction accuracy.

2.6.2 Hierarchical MT-MKL

Recall that in task clustering, parameter vectors of tasks within the same cluster are coupled (Equation 13). The strength of that coupling, however, has be be chosen in advance and remains fixed throughout the learning procedure. We extend the formulation of task clustering by introducing a weighting to task cluster and tuning this weighting using our framework. We decompose over clusters and arrive at the following MTL regularizer

(20)
(21)

where is given by

Note that, if not all tasks belong to the same cluster, will not be invertible. Therefore, we need to express the mapping onto the dual of our general framework from Equation 2.5.1 in terms of the pseudo-inverse (see Equation 19) of : .

An important special case of the above is given by a scenario where task relationships are described by a hierarchical structure (see Figure 1(b)), such as a tree or a directed acyclic graph. Assuming hierarchical relations between tasks is particularly relevant to Computational Biology where often different tasks correspond to different organisms. In this context, we expect that the longer the common evolutionary history between two organisms, the more beneficial it is to share information between these organisms in a MTL setting. The tasks correspond to the leaves or terminal nodes and each inner node defines a cluster , by grouping tasks of all terminal nodes that are descendants of the current node . As before, task clusters can be used in the way discussed in the previous section.

2.6.3 Smooth hierarchical MT-MKL

Finally, we present a variant that may be regarded as a smooth version of the hierarchical MT-MKL approach presented above. Here, however, we require access to a given task similarity matrix, which is then subsequently transformed by squared exponentials with different length scales, for instance, . We use MT-MKL to learn a weighting of the kernels associated with the different length scales, which corresponds to finding the right level in the hierarchy to trade off information between tasks. As an example, consider Figure 1(c), where we show the original task similarity matrix and the transformed matrices at different length scales.

3 Algorithms

In this section, we present efficient optimization algorithms to solve the primal and dual problems, i.e., Problems 1 and 2, respectively. We distinguish the cases of linear and non-linear kernel matrices. For non-linear kernels, we can simply use existing MKL implementations, while, for linear kernels, we develop a specifically tailored large-scale algorithm that allows us to train on problems with a large number of data points and dimensions, as demonstrated on several data sets. We can even employ this algorithm for non-linear kernels, if the kernel admits a sparse, efficiently computable feature representation. For example, this is the case for certain string kernels and polynomial kernels of degree 2 or 3. Our algorithms are embedded into the COFFIN framework (Sonnenburg and Franc, 2010) and integrated into the SHOGUN large-scale machine learning toolbox (Sonnenburg et al., 2010).

3.1 General Algorithms for Non-linear Kernels

A very convenient way to numerically solve the proposed framework is to simply exploit existing MKL implementations. To see this, recall from Section 2.4 that if we use the multi-task kernels as defined in (5) as the set of multiple kernels, the completely dualized MKL formulation (see Problem 3) is given by,

An efficient optimization approach is by Vishwanathan et al. (2010), who optimize the completely dualized MKL formulation. This implementation comes along without a -step, but any of the -steps computations of the -steps are more costly as in the case of vanilla (MT-)SVMs.

Further, combining the partially dualized formulation in Problem 2 with the definition of multi-task kernels from (5), we arrive at an equivalent problem to (7), that is,

which is exactly the optimization problem of -norm multiple kernel learning as described in Kloft et al. (2011). We may thus build on existing research in the field of MKL and use one of the prevalent efficient implementations to solve -norm MKL. Most of the -norm MKL solvers are specifically tailored to the hinge loss. Proven implementations are, for example, the interleaved optimization method of Kloft et al. (2011), which is directly integrated into the SVMLight module (Joachims, 1999) of the SHOGUN toolbox such that the -step is performed after each decomposition step, i.e., after solving the small QP occurring in SVMLight, which allows very fast convergence (Sonnenburg et al., 2006).

For an overview of MKL algorithms and their implementations, see the survey paper by Gönen and Alpaydin (2011).

3.2 A Large-scale Algorithm for Linear or String Kernels and Beyond

For specific kernels such as linear kernels and string kernels—and, more generally, any kernel admitting an efficient feature space representation—, we can derive a specifically tailored large-scale algorithm. This requires considerably more work than the algorithm presented in the previous subsection.

3.2.1 Overview

From a top-level view, the upcoming algorithm underlies the core idea of alternating the following two steps:

  1. the step, where the kernel weights are improved

  2. the step, where the remaining primal variables are improved.

1:  input: data