One of the key challenges in computational biology is to build effective and efficient statistical models that learn from data to predict, analyze, and ultimately understand biological systems. Regardless of the problem at hand, however, be it the recognition of sequence signals such as splice sites, the prediction of protein-protein interactions, or the modeling of metabolic networks, we frequently have access to data sets for multiple organisms, tissues or cell-lines. Can we develop methods that optimally combine such multi-domain data?
While the field of Transfer or Multitask Learning enjoys a growing interest in the Machine Learning community in recent years, it can be traced back to ideas from the mid 90’s. During that time Thrun (1996) asked the provocative question ”Is Learning the
-th Thing any Easier Than Learning the First?”, effectively laying the ground for the field of Transfer Learning. Their work was motivated by findings in human psychology, where humans were found to be capable of learning based on as few as a single example(Ahn and Brewer, 1993). The key insight was that humans build upon previously learned related concepts, when learning new tasks, something Thrun (1996) call lifelong learning. Around the same time, Caruana (1993, 1997) coined the term Multitask Learning. Rather than formalizing the idea of learning a sequence of tasks, they propose machinery to learn multiple related tasks in parallel.
While most of the early work on Multitask Learning was carried out in the context of learning a shared representation for neural networks(Caruana, 1997; Baxter, 2000), Evgeniou and Pontil (2004) adapted this concept in the context of kernel machines. At first, they assumed that the models of all tasks are close to each other (Evgeniou and Pontil, 2004) and later generalized their framework to non-uniform relations, allowing to couple some tasks more strongly than others (Evgeniou et al., 2005), according to some externally defined task structure. In recent years, there has been an increased interest in learning the structure potentially underlying the tasks. Ando and Zhang (2005) proposed a non-convex method based on Alternating Structure Optimization (ASO) for identifying the task structure. A convex relaxation of their approach was developed by Chen et al. (2009). Zhou et al. (2011) showed the equivalence between ASO and Clustered Multitask Learning (Jacob et al., 2008; Obozinski et al., 2010) and their convex relaxations. While the structure between tasks is defined by assigning tasks to clusters in the above approaches, Zhang and Yeung (2010) propose to learn a constrained task covariance matrix directly and show the relationship to Multitask Feature Learning (Argyriou et al., 2007, 2008a, 2008b; Liu et al., 2009). Here, the basic idea is to use a LASSO-inspired (Tibshirani, 1996) -norm to identify a subset of features that is relevant to all tasks.
A challenge remains to find an adequate task similarity measure to compare the multiple domains and tasks. While existing parameter-free approaches such as Romera-Paredes et al. (2013) ignore biological background knowledge about the relatedness of the tasks, in this paper, we present a parametric framework for regularization-based multitask learning that subsumes several approaches and automatically learns the task similarity from a set of candidates measures using -norm Multiple Kernel learning (MKL) see, for instance, Kloft et al. (2011). We thus provide a middle ground between assuming known task relationships and learning the entire task structure from scratch. We propose a general unifying framework of MT-MKL, including a thorough dualization analysis using Fenchel duality, based on which we derive an efficient linear solver that combines our general framework with advances in linear SVM solvers and evaluate our approach on several datasets from Computational Biology.
This paper is based on preliminary material shown in several conference papers and workshop contributions (Widmer et al., 2010a, c, b, 2012; Widmer and Rätsch, 2012), which contained preliminary aspects of the framework presented here. This version additionally includes a unifying framework including Fenchel duality analysis, more complete derivations and theoretical analysis as well as a comparative study in multitask learning and genomics, where we brought together genomic data for a wide range of biological organisms in a multitask learning setting. This dataset will be made freely available and may serve as a benchmark in the domain of multitask learning. Our experiments show that combining data via multitask learning can outperform learning each task independently. In particular, we find that it can be crucial to further refine a given task similarity measure using multitask multiple kernel learning.
The paper is structured as follows: In Section 2 we introduce a unifying view of multitask multiple kernel learning that covers a wide range loss functions and regularizers. We give a general Fenchel dual representation and a representer theorem, and show that the formulation contains several existing formulations as special cases. In Section 3 we propose two optimization strategies: one that can be applied out of the box with any custom set of kernels and another one that is specifically tailored to linear kernels as well as string kernels. Both algorithms were implemented into the Shogun machine learning toolbox. In Section 4 we present results of empirical experiments on artificial data as well as a large biological multi-organism dataset curated for the purpose of this paper.
2 A Unifying View of Regularized Multi-Task Learning
In this section, we present a novel multi-task framework comprising many existing formulations, allowing us to view prevalent approaches from a unifying perspective, yielding new insights. We can also derive new learning machines as special instantiations of the general model. Our approach is embedded into the general framework of regularization-based supervised learning methods, where we minimize a functional
which consists of a loss-term measuring the training error and a regularizer penalizing the complexity of the model . The positive constant controls the trade-off of the criterion. The formulation can easily be generalized to the multi-task setting, where we are interested in obtaining several models parametrized by , where is the number of tasks.
A common approach is, for example, to set where is a task similarity matrix. In this paper, we develop a novel, general framework for multi-task learning of the form
where , . This approach has the additional flexibility of allowing us to incorporate multiple task similarity matrices into the learning problem, each equipped with a weighting factor. Instead of specifying the weighting factor a priori, we will automatically determine optimal weights from the data as part of the learning problem. We show that the above formulation comprises many existing lines of research in the area; this not only includes very recent lines but also seemingly different ones. The unifying framework allows us to analyze a large variety of MTL methods jointly, as exemplified by deriving a general dual representation of the criterion, without making assumptions on the employed norms and losses, besides the latter being convex. This delivers insights into connections between existing MTL formulations and, even more importantly, can be used to derive novel MTL formulations as special cases of our framework, as done in a later section of this paper.
2.1 Problem Setting and Notation
be a set of training pattern/label pairs. In multitask learning, each training exampleis associated with a task . Furthermore, we assume that for each the instances associated with task
are independently drawn from a probability distributionover a measurable space . We denote the set of indices of training points of the th task by . The goal is to find, for each task , a prediction function . In this paper, we consider composite functions of the form , , where , , are mappings into reproducing Hilbert spaces , encoding multiple views of the multi-task learning problem via kernels , and ,
are parameter vectors of the prediction function.
For simplicity of notation, we concentrate on binary prediction, i.e., , and encode the loss of the prediction problem as a loss term , where is a loss function, assumed to be closed convex, lower bounded and finite at . To consider sophisticated couplings between the tasks, we introduce so-called task-similarity matrices with , and consider the regularizer (setting , ) with , where with adjoint and
denotes the trace class operator of the tensor Hilbert space. Note that also the direct sum is a Hilbert space, which will allow us to view as an element in a Hilbert space. The parameters , , are adaptive weights of the views, where denotes the -norm. Here denotes , .
Using the above specification of the regularizer and the loss term, we study the following unifying primal optimization problem.
Problem 1 (Primal problem).
Dual representations of optimization problems deliver insight into the problem, which can be used in practice to, for example, develop optimization algorithms (so done in Section 3 of this paper). In this section, we derive a dual representation of our unifying primal optimization problem, i.e., Problem 1. Our dualization approach is based on Fenchel-Rockafellar duality theory. The basic results of Fenchel-Rockafellar duality theory for Hilbert spaces are reviewed in Appendix A. We present two dual optimization problems: one that is dualized with respect to only (i.e., considering as being fixed) and one that completely removes the dependency on .
2.2.1 Computation of Conjugates and Adjoint Map
To apply Fenchel’s duality theorem, we need to compute the adjoint map of the linear map , , as well as the convex conjugates of and . See Appendix A for a review of the definitions of the convex conjugate and the adjoint map. First, we notice that, by the basic identities for convex conjugates of Prop. 10 in Appendix A, we have that
Next, we define by . Recall that the mapping between tasks and examples may be expressed in one of two ways. We may use index set to retrieve the indices of training examples associated with task . Alternatively, we may use task indicator to obtain the task index associated with th training example. Using this notation, we verify that, for any and , it holds
Thus, as defined above is indeed the adjoint map. Finally, we compute the conjugate of with respect to , where we consider as a constant (be reminded that are given). We write and note that, by Prop. 10,
The supremum is attained when so that in the optimum . Resubstitution into (1) gives , so that we have
2.2.2 Dual Optimization Problems
Problem 2 (Dual problem—partially dualized minimax formulation).
The above problem involves minimization with respect to (the primal variable) and maximization with respect to (the dual variable) . The optimization algorithm presented later in this paper will optimize is based on this minimax formulation. However, we may completely remove the dependency on , which sheds further insights into the problem, which will later be exploited for optimization, i.e., to control the duality gap of the computed solutions.
To remove the dependency on , we first note that Problem 2 is convex (even affine) in and concave in and thus, by Sion’s minimax theorem, we may exchange the order of minimization and maximization:
where the last step is by the definition of the dual norm, i.e., and denotes the conjugated exponent. We thus have the following alternative dual problem.
Problem 3 (Dual problem—completely dualized formulation).
2.3 Representer Theorem
under the minimal assumption that is differentiable in . The above requirement can be thought of as an analog to the KKT condition stationarity in Lagrangian duality. Note that we can rewrite the above equation by inserting the definitions of and from the previous subsection; this gives, for any ,
which we may rewrite as
The above equation gives us a representer theorem (Argyriou et al., 2009) for the optimal , which we will exploit later in this paper for deriving an efficient optimization algorithm to solve Problem 1.
2.4 Relation to Multiple Kernel Learning
Evgeniou et al. (2005) introduce the notion of a multi-task kernel. We can generalize this framework by defining multiple multi-task kernels
To see this, first note that the term can alternatively be written as
so it follows
and thus Problem 2 becomes
2.5 Specific Instantiations of the Framework
In this section, we show that several regularization-based multi-task learning machines are subsumed by the generalized primal and dual formulations of Problems 1–2. As a first step, we will specialize our general framework to the hinge-loss, and show its primal and dual form. Based on this, we then instantiate our framework further to known methods in increasing complexity, starting with single-task learning (standard SVM) and working towards graph-regularized multitask learning and its relation to multitask kernels. Finally, we derive several novel methods from our general framework.
2.5.1 Hinge Loss
Many existing multi-task learning machines utilize the hinge loss . Employing the hinge loss in Problem 1, yields the loss term
Furthermore, as shown in Table 1, the conjugate of the hinge loss is , if and elsewise, which is readily verified by elementary calculus. Thus, we have
provided that ; otherwise we have .
Hence, for the hinge-loss, we obtain the following pair of primal and dual problem.
2.5.2 Single Task Learning
Starting from the simplest special case, we briefly show how single-task learning methods may be recovered from our general framework. By mapping well understood single-task methods onto our framework, we hope to achieve two things. First, we believe this will greatly facilitate understanding for the reader who is familiar with standard methods like the SVM. Second, we pave the way for applying efficient training algorithms developed in Section 3 to these single-task formulations, for example yielding a new linear solver for non-sparse Multiple Kernel Learning as a corollary.
Support Vector Machine
-norm MKL (Kloft et al., 2011) is obtained as a special case of our framework. This case is of particular interest, as it allows to obtain a linear solver for -norm MKL, as a corollary. By restricting the number of tasks to one (i.e., ), becomes and . Equation (9) reduces to:
2.5.3 Multitask Learning
Here, we first derive the primal and dual formulations of regularization-based multitask learning as a special case of our framework and then give an overview of existing variants that can be mapped onto this formulation as a precursor to novel instantiations in Section 2.6. In this setting, we deal with multiple tasks , but only a single kernel or task similarity measure (i.e., ). The primal thus becomes:
with corresponding dual
where the definition of is given in Equation 5. As we will see in the following, the above formulation captures several existing MTL approaches, which can be expressed by choosing different encodings for task similarity.
Frustratingly Easy Domain Adaptation
An appealing special case of Graph-regularized MTL was presented by Daumé (2007). They considered the setting of only two tasks (source task and target task), with a fix task relationship. Their frustratingly easy idea was to assign a higher similarity to pairs of examples from the same task than between examples from different tasks. In a publication titled Frustratingly Easy Domain Adaptation, Daumé (2007) present a simple, yet appealing special case of graph-regularized MTL. They considered the setting of only two tasks (source task and target task), with a fix task relationship (i.e., the influence of the two tasks on each other was not determined by their actual similarity). Their idea was to assign a higher base-similarity to pairs of examples from the same task than between examples from different tasks. This may be expressed by the following multitask kernel:
From the above, we can readily read off the corresponding (and compute ).
Given the above, we can express this special case in terms of Equation (11) and (12). With some elementary algebra, this method can be viewed as pulling weight vectors of source and target towards a common mean vector by means of a regularization term. If we generalize this idea to allow for multiple cluster centers, we arrive at task clustering, which is described in the following.
Task Clustering Regularization
Here, tasks are grouped into clusters, whereas parameter vectors of tasks within each cluster are pulled towards the respective cluster center where is the number of tasks in cluster (Evgeniou et al., 2005). To understand what and correspond to in terms of Equations 11 and 12, consider the definition of the multitask regularizer for task clustering.
where is the number of clusters, encodes assignment of task to cluster , controls regularization of cluster centers and are given by
If any task is assigned to at least one cluster (i.e., ) is positive definite (Evgeniou et al., 2005) and we can express the above in terms of our primal formulation in Equation 11 as and the corresponding dual as , even for . We note that the formulation given in Section 2.5.3 may by expressed via task clustering regularization, by choosing only one cluster (i.e., ) and setting , and , we get , equating to the task similarity matrix from the previous section.
Graph-regularized MTL was established by Evgeniou et al. (2005) and constitutes one of the most influential MTL approaches to date. Their method is based on the following multi-task regularizer, which also forms one of the main inspirations for our framework:
where is a given graph adjacency matrix encoding the pairwise similarities of the tasks, denotes the corresponding graph Laplacian, where , and is a identity matrix. Note that the number of zero eigenvalues of the graph Laplacian corresponds to the number of connected components. We may view graph-regularized MTL as an instantiation of our general primal problem, Problem 1, where we have only one task similarity measure (i.e., ). As the graph Laplacian is not invertible in general, we use its pseudo-inverse to express the dual formulation of the above MTL regularizer.
where is the rank of , are the eigenvalues of and
In contrast to graph-regularized MTL, where task relations are captured by an adjacency matrix or graph Laplacian as discussed in the previous paragraph, task relationships may directly be expressed in terms of a kernel on tasks . This relationship has been illuminated in Section 2.4, where we have seen that the kernel on tasks corresponds to in our dual MTL formulation. A formulation involving a combination of several MTL kernels with a fix weighting was explored by Jacob and Vert (2008) in the context of Bioinformatics. In its most basic form, the authors considered a multitask kernel of the form
Furthermore, the authors considered a sum of different multi-task kernels, among them the corner cases (independent tasks) and the uniform kernel (uniformly related tasks). In general, their dual formulation is given by
The above is a very interesting special case and can easily be expressed within our general framework. For this, consider the dual formulation given in Equation 2.5.1 for and . In other words, the above also constitutes a form of multitask multiple kernel learning, however, without actually learning the kernel weights . Nevertheless, the choice and discussion of different multitask kernels in Jacob and Vert (2008) is of high relevance with respect to the family of methods explored in this work.
2.6 Proposing Novel Instances of Multi-task Learning Machines
We now move ahead and derive novel instantiations from our general framework. Most importantly, we go beyond previous formulations by learning or refining task similarities from data using MKL as an engine.
2.6.1 Multi-graph MT-MKL
One of the most popular MTL approaches is graph-regularized MTL by Evgeniou and Pontil (2004). We have seen in Section 2.5.3, that such a graph is expressed as a adjacency matrix and may alternatively be expressed in terms of its graph Laplacian . Our extension readily deals with multiple graphs encoding task similarity , which is of interest in cases where - as in Multiple kernel learning - we have access to alternative sources of task similarity and it is unclear which one is best suited. This concept gives rise to the multi-graph MTL regularizer
where denotes the graph Laplacian corresponding to . As before, we learn a weighting of the given graphs, therefore determining which measures are best suited to maximize prediction accuracy.
2.6.2 Hierarchical MT-MKL
Recall that in task clustering, parameter vectors of tasks within the same cluster are coupled (Equation 13). The strength of that coupling, however, has be be chosen in advance and remains fixed throughout the learning procedure. We extend the formulation of task clustering by introducing a weighting to task cluster and tuning this weighting using our framework. We decompose over clusters and arrive at the following MTL regularizer
where is given by
Note that, if not all tasks belong to the same cluster, will not be invertible. Therefore, we need to express the mapping onto the dual of our general framework from Equation 2.5.1 in terms of the pseudo-inverse (see Equation 19) of : .
An important special case of the above is given by a scenario where task relationships are described by a hierarchical structure (see Figure 1(b)), such as a tree or a directed acyclic graph. Assuming hierarchical relations between tasks is particularly relevant to Computational Biology where often different tasks correspond to different organisms. In this context, we expect that the longer the common evolutionary history between two organisms, the more beneficial it is to share information between these organisms in a MTL setting. The tasks correspond to the leaves or terminal nodes and each inner node defines a cluster , by grouping tasks of all terminal nodes that are descendants of the current node . As before, task clusters can be used in the way discussed in the previous section.
2.6.3 Smooth hierarchical MT-MKL
Finally, we present a variant that may be regarded as a smooth version of the hierarchical MT-MKL approach presented above. Here, however, we require access to a given task similarity matrix, which is then subsequently transformed by squared exponentials with different length scales, for instance, . We use MT-MKL to learn a weighting of the kernels associated with the different length scales, which corresponds to finding the right level in the hierarchy to trade off information between tasks. As an example, consider Figure 1(c), where we show the original task similarity matrix and the transformed matrices at different length scales.
In this section, we present efficient optimization algorithms to solve the primal and dual problems, i.e., Problems 1 and 2, respectively. We distinguish the cases of linear and non-linear kernel matrices. For non-linear kernels, we can simply use existing MKL implementations, while, for linear kernels, we develop a specifically tailored large-scale algorithm that allows us to train on problems with a large number of data points and dimensions, as demonstrated on several data sets. We can even employ this algorithm for non-linear kernels, if the kernel admits a sparse, efficiently computable feature representation. For example, this is the case for certain string kernels and polynomial kernels of degree 2 or 3. Our algorithms are embedded into the COFFIN framework (Sonnenburg and Franc, 2010) and integrated into the SHOGUN large-scale machine learning toolbox (Sonnenburg et al., 2010).
3.1 General Algorithms for Non-linear Kernels
A very convenient way to numerically solve the proposed framework is to simply exploit existing MKL implementations. To see this, recall from Section 2.4 that if we use the multi-task kernels as defined in (5) as the set of multiple kernels, the completely dualized MKL formulation (see Problem 3) is given by,
An efficient optimization approach is by Vishwanathan et al. (2010), who optimize the completely dualized MKL formulation. This implementation comes along without a -step, but any of the -steps computations of the -steps are more costly as in the case of vanilla (MT-)SVMs.
which is exactly the optimization problem of -norm multiple kernel learning as described in Kloft et al. (2011). We may thus build on existing research in the field of MKL and use one of the prevalent efficient implementations to solve -norm MKL. Most of the -norm MKL solvers are specifically tailored to the hinge loss. Proven implementations are, for example, the interleaved optimization method of Kloft et al. (2011), which is directly integrated into the SVMLight module (Joachims, 1999) of the SHOGUN toolbox such that the -step is performed after each decomposition step, i.e., after solving the small QP occurring in SVMLight, which allows very fast convergence (Sonnenburg et al., 2006).
For an overview of MKL algorithms and their implementations, see the survey paper by Gönen and Alpaydin (2011).
3.2 A Large-scale Algorithm for Linear or String Kernels and Beyond
For specific kernels such as linear kernels and string kernels—and, more generally, any kernel admitting an efficient feature space representation—, we can derive a specifically tailored large-scale algorithm. This requires considerably more work than the algorithm presented in the previous subsection.
From a top-level view, the upcoming algorithm underlies the core idea of alternating the following two steps:
the step, where the kernel weights are improved
the step, where the remaining primal variables are improved.