# Efficient Output Kernel Learning for Multiple Tasks

The paradigm of multi-task learning is that one can achieve better generalization by learning tasks jointly and thus exploiting the similarity between the tasks rather than learning them independently of each other. While previously the relationship between tasks had to be user-defined in the form of an output kernel, recent approaches jointly learn the tasks and the output kernel. As the output kernel is a positive semidefinite matrix, the resulting optimization problems are not scalable in the number of tasks as an eigendecomposition is required in each step. Using the theory of positive semidefinite kernels we show in this paper that for a certain class of regularizers on the output kernel, the constraint of being positive semidefinite can be dropped as it is automatically satisfied for the relaxed problem. This leads to an unconstrained dual problem which can be solved efficiently. Experiments on several multi-task and multi-class data sets illustrate the efficacy of our approach in terms of computational efficiency as well as generalization performance.

## Authors

• 17 publications
• 5 publications
• 67 publications
• 112 publications
• ### Multi-Task Kernel Null-Space for One-Class Classification

The one-class kernel spectral regression (OC-KSR), the regression-based ...
05/22/2019 ∙ by Shervin Rahimzadeh Arashloo, et al. ∙ 0

• ### Multi-task and Lifelong Learning of Kernels

We consider a problem of learning kernels for use in SVM classification ...
02/21/2016 ∙ by Anastasia Pentina, et al. ∙ 0

• ### Classifying Documents within Multiple Hierarchical Datasets using Multi-Task Learning

06/06/2017 ∙ by Azad Naik, et al. ∙ 0

• ### Multi-Task Multiple Kernel Relationship Learning

This paper presents a novel multitask multiple kernel learning framework...
11/10/2016 ∙ by Keerthiram Murugesan, et al. ∙ 0

• ### Multi-Task Learning Using Neighborhood Kernels

This paper introduces a new and effective algorithm for learning kernels...
07/11/2017 ∙ by Niloofar Yousefi, et al. ∙ 0

• ### Curriculum Learning of Multiple Tasks

Sharing information between multiple tasks enables algorithms to achieve...
12/03/2014 ∙ by Anastasia Pentina, et al. ∙ 0

• ### Bayesian Efficient Multiple Kernel Learning

Multiple kernel learning algorithms are proposed to combine kernels in o...
06/27/2012 ∙ by Mehmet Gönen, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Multi-task learning (MTL) advocates sharing relevant information among several related tasks during the training stage. The advantage of MTL over learning tasks independently has been shown theoretically as well as empirically [1, 2, 3, 4, 5, 6, 7].

The focus of this paper is the question how the task relationships can be inferred from the data. It has been noted that naively grouping all the tasks together may be detrimental [8, 9, 10, 11]

[10, 12] aim to learn groups of closely related tasks. The information is then shared only within these clusters of tasks. This corresponds to learning the task covariance matrix, which we denote as the output kernel in this paper. Most of these approaches lead to non-convex problems.

In this work, we focus on the problem of directly learning the output kernel in the multi-task learning framework. The multi-task kernel on input and output is assumed to be decoupled as the product of a scalar kernel and the output kernel, which is a positive semidefinite matrix [1, 13, 14, 15]. In classical multi-task learning algorithms [1, 16]

, the degree of relatedness between distinct tasks is set to a constant and is optimized as a hyperparameter. However, constant similarity between tasks is a strong assumption and is unlikely to hold in practice. Thus recent approaches have tackled the problem of directly learning the output kernel.

[17]

solves a multi-task formulation in the framework of vector-valued reproducing kernel Hilbert spaces involving squared loss where they penalize the Frobenius norm of the output kernel as a regularizer. They formulate an invex optimization problem that they solve optimally. In comparison,

[18] recently proposed an efficient barrier method to optimize a generic convex output kernel learning formulation. On the other hand, [9] proposes a convex formulation to learn low rank output kernel matrix by enforcing a trace constraint. The above approaches [9, 17, 18]

solve the resulting optimization problem via alternate minimization between task parameters and the output kernel. Each step of the alternate minimization requires an eigenvalue decomposition of a matrix having as size the number of tasks and a problem corresponding to learning all tasks independently.

In this paper we study a similar formulation as [17]

. However, we allow arbitrary convex loss functions and employ general

-norms for (including the Frobenius norm) as regularizer for the output kernel. Our problem is jointly convex over the task parameters and the output kernel. Small leads to sparse output kernels which allows for an easier interpretation of the learned task relationships in the output kernel. Under certain conditions on we show that one can drop the constraint that the output kernel should be positive definite as it is automatically satisfied for the unconstrained problem. This significantly simplifies the optimization and our result could also be of interest in other areas where one optimizes over the cone of positive definite matrices. The resulting unconstrained dual problem is amenable to efficient optimization methods such as stochastic dual coordinate ascent [19], which scale well to large data sets. Overall we do not require any eigenvalue decomposition operation at any stage of our algorithm and no alternate minimization is necessary, leading to a highly efficient methodology. Furthermore, we show that this trick not only applies to -norms but also applies to a large class of regularizers for which we provide a characterization.

Our contributions are as follows: (a) we propose a generic -norm regularized output kernel matrix learning formulation, which can be extended to a large class of regularizers; (b) we show that the constraint on the output kernel to be positive definite can be dropped as it is automatically satisfied, leading to an unconstrained dual problem; (c) we propose an efficient stochastic dual coordinate ascent based method for solving the dual formulation; (d) we empirically demonstrate the superiority of our approach in terms of generalization performance as well as significant reduction in training time compared to other methods learning the output kernel.

The paper is organized as follows. We introduce our formulation in Section 2. Our main technical result is discussed in Section 3. The proposed optimization algorithm is described in Section 4. In Section 5, we report the empirical results.

## 2 The Output Kernel Learning Formulation

We first introduce the setting considered in this paper. We denote the number of tasks by . We assume that all tasks have a common input space and a common positive definite kernel function . We denote by the feature map and by the reproducing kernel Hilbert space (RKHS) [20] associated with . The training data is , where , is the task the -th instance belongs to and is the corresponding label. Moreover, we have a positive definite matrix on the set of tasks , where is the set of symmetric and positive semidefinite (p.s.d.) matrices.

If one arranges the predictions of all tasks in a vector one can see multi-task learning as learning a vector-valued function in a RKHS [see 1, 13, 14, 15, 18, and references therein]. However, in this paper we use the one-to-one correspondence between real-valued and matrix-valued kernels, see [21], in order to limit the technical overhead. In this framework we define the joint kernel of input space and the set of tasks as

 M((x,s),(z,t))=k(x,z)Θ(s,t), (1)

We denote the corresponding RKHS of functions on as and by the corresponding norm. We formulate the output kernel learning problem for multiple tasks as

 minΘ∈ST+,F∈HMCn∑i=1L(yi,F(xi,ti))+12∥F∥2HM+λV(Θ) (2)

where is the convex loss function (convex in the second argument), is a convex regularizer penalizing the complexity of the output kernel and is the regularization parameter. Note that implicitly depends also on . In the following we show that (2) can be reformulated into a jointly convex problem in the parameters of the prediction function and the output kernel . In order to see this we first need the following representer theorem for fixed output kernel .

###### Lemma 1

The optimal solution of the optimization problem

 minF∈HMCn∑i=1L(yi,F(xi,ti))+12∥F∥2HM (3)

admits a representation of the form

where is the prediction for instance belonging to task and .

Proof:  The proof is analogous to the standard representer theorem [20]. We denote by the subspace in spanned by the training data. This induces the orthogonal decomposition of , where is the orthogonal subpace of . Every function can correspondingly decomposed into , where and . Then . As

 F(xi,ti)=⟨F,M((xi,ti),(⋅,⋅))⟩=⟨F∥,M((xi,ti),(⋅,⋅))⟩=F∥(xi,ti). (4)

As the loss only depends on and we minimize the objective by having . This yields the result.
With the explicit form of the prediction function one can rewrite the main problem (2) as

 minΘ∈ST+,γ∈Rn×TCn∑i=1L(yi,T∑s=1n∑j=1γjskjiΘsti)+12T∑r,s=1n∑i,j=1γirγjskijΘrs+λV(Θ), (5)

where and . Unfortunately, problem (5) is not jointly convex in and due to the product in the second term. A similar problem has been analyzed in [17]. They could show that for the squared loss and the corresponding optimization problem is invex and directly optimize it. For an invex function every stationary point is globally optimal [22].

We follow a different path which leads to a formulation similar to the one of [2] used for learning an input mapping (see also [9]). Our formulation for the output kernel learning problem is jointly convex in the task kernel and the task parameters. We present a derivation for the general RKHS , analogous to the linear case presented in [2, 9]. We use the following variable transformation,

 βit=T∑s=1Θtsγis,i=1,…,n,s=1,…,T, resp. γis=T∑t=1(Θ−1)stβit.

In the last expression has to be understood as the pseudo-inverse if is not invertible. Note that this causes no problems as in case is not invertible, we can without loss of generality restrict in (5) to the range of . The transformation leads to our final problem formulation, where the prediction function and its squared norm can be written as

 F(x,t)=n∑i=1βitk(xi,x),∥F∥2HM=T∑r,s=1n∑i,j=1(Θ−1)srβisβjrk(xi,xj). (6)

This can be seen as follows

 ∥F∥2HM =T∑r,s=1n∑i,j=1γirγjsk(xi,xj)Θrs (7) =T∑t,u=1T∑r,s=1n∑i,j=1βitβju(Θ−1)tr(Θ−1)usk(xi,xj)Θrs (8) =T∑t,u=1n∑i,j=1(Θ−1)tuβitβjuk(xi,xj). (9)

We get our final primal optimization problem

 minΘ∈ST+,β∈Rn×TCn∑i=1L(yi,n∑j=1βjtikji)+12T∑r,s=1n∑i,j=1(Θ−1)srβisβjrkij+λV(Θ) (10)

Before we analyze the convexity of this problem, we want to illustrate the connection to the formulations in [9, 17]. With the task weight vectors we get predictions as and one can rewrite

 ∥F∥2HM=T∑r,s=1n∑i,j=1(Θ−1)srβisβjrk(xi,xj)=T∑r,s=1(Θ−1)sr⟨ws,wt⟩.

This identity is known for vector-valued RKHS, see [15] and references therein. When is

times the identity matrix, then

and thus (2) is learning the tasks independently. As mentioned before the convexity of the expression of is crucial for the convexity of the full problem (10). The following result has been shown in [2] (see also [9]).

###### Lemma 2

Let denote the range of and let be the pseudoinverse. The extended function defined as

 f(Θ,β)={∑Tr,s=1∑ni,j=1(Θ†)srβisβjrk(xi,xj), if βi⋅∈R(Θ),∀i=1,…,n,∞ else .,

is jointly convex.

Proof:  It has been shown in [2] and [23][p. 223] that is jointly convex on , where is the range of and is the pseudoinverse of . As is positive semi-definite we can compute the eigendecomposition as

 Lij=n∑l=1λluliulj,

where , are the eigenvalues and

the eigenvectors. Using this we get

 T∑r,s=1n∑i,j=1(Θ−1)srβisβjrk(xi,xj)=n∑l=1λlT∑r,s=1(n∑i=1βisuli)(n∑j=1βjrulj)(Θ−1)rs (11)

and thus we can write the function as a positive combination of convex functions, where the arguments are composed with linear mappings which preserves convexity [24].
The formulation in (10) is similar to  [9, 17, 18]. [9] uses the constraint instead of a regularizer enforcing low rank of the output kernel. On the other hand, [17] employs squared Frobenius norm for with squared loss function. [18] proposed an efficient algorithm for convex . Instead we think that sparsity of is better to avoid the emergence of spurious relations between tasks and also leads to output kernels which are easier to interpret. Thus we propose to use the following regularization functional for the output kernel :

 V(Θ)=T∑t,t′=1|Θtt′|p=∥Θ∥pp,

for . Several approaches [9, 17, 18] employ alternate minimization scheme, involving costly eigendecompositions of matrix per iteration (as ). In the next section we show that for a certain set of values of one can derive an unconstrained dual optimization problem which thus avoids the explicit minimization over the cone. The resulting unconstrained dual problem can then be easily optimized by stochastic coordinate ascent. Having explicit expressions of the primal variables and in terms of the dual variables allows us to get back to the original problem.

## 3 Unconstrained Dual Problem Avoiding Optimization over St+

The primal formulation (10) is a convex multi-task output kernel learning problem. The next lemma derives the Fenchel dual function of (10). This still involves the optimization over the primal variable . A main contribution of this paper is to show that this optimization problem over the cone can be solved with an analytical solution for a certain class of regularizers . In the following we denote by the dual variables corresponding to task and by the kernel matrix corresponding to the dual variables of tasks and .

###### Lemma 3

Let be the conjugate function of the loss , then

 q:Rn→R,q(α)=−Cn∑i=1L∗i(−αiC)−λmaxΘ∈ST+(12λT∑r,s=1Θrs⟨αr,Krsαs⟩−V(Θ)) (12)

is the dual function of (10), where are the dual variables. The primal variable in (10) and the prediction function can be expressed in terms of and as and respectively, where is the task of the -th training example.

Proof:  We derive the Fenchel dual function of (10). For this purpose we introduce auxiliary variables which satisfy the constraint

 zi=n∑j=1βjtik(xj,xi)=F(xi,ti).

The Lagrangian of the resulting problem (10) is given as:

 L(β,Θ,z,α)= Cn∑i=1L(yi,zi)+12T∑r,s=1n∑i,j=1(Θ−1)srβisβjrk(xi,xj) (13) +n∑i=1αi(zi−n∑j=1βjtik(xj,xi))+iST+(Θ)+λV(Θ).

where is the indicator function of the set . The dual function is defined as

 q(α)=minβ∈Rn×T,Θ∈ST+,z∈RnL(β,Θ,z,α). (14)

Using the definition of the conjugate function  [24], we get

 minzi∈RCL(yi,zi)+αizi =Cminzi∈RL(yi,zi)+αiCzi=−Cmaxzi∈R(−αiCzi−L(yi,zi)) (15) =−CL∗i(−αiC), (16)

where is the conjugate function of . Moreover, we compute the minimizer with respect to , via

 ∂∂βlu(12T∑r,s=1n∑i,j=1(Θ−1)srβisβjrk(xi,xj)−n∑i=1αi(n∑j=1βjtik(xj,xi)) (17) = T∑r=1n∑j=1βjr(Θ−1)urk(xl,xj)−n∑i=1αiδutik(xl,xi),

where is the Kronecker symbol, that is . Solving for the global minimizer yields

 β∗jr=αjΘrtj. (18)

Plugging back into the above expressions yields

 T∑r,s=1n∑i,j=1(Θ−1)srβisβjrk(xi,xj) =T∑r,s=1n∑i,j=1(Θ−1)srΘstiΘrtjαiαjk(xi,xj) =n∑i,j=1Θtitjαiαjk(xi,xj), (19) n∑i,j=1αiβjtik(xj,xi) =n∑i,j=1αiαjΘtjtik(xj,xi), (20)

Introducing , and gathering the terms corresponding to the individual tasks we get

 n∑i,j=1αiαjΘtjtik(xj,xi)=T∑r,s=1⟨αr,Krsαs⟩.

Plugging all the expressions back into (14), we get the dual function as

 q(α) =−CL∗ti(−αtiC)+minΘ∈ST+λV(Θ)−12T∑r,s=1Θrs⟨αr,Krsαs⟩ (21) =−CL∗ti(−αtiC)+λminΘ∈ST+V(Θ)−⟨ρ,Θ⟩ (22) =−CL∗ti(−αtiC)−λmaxΘ∈ST+⟨ρ,Θ⟩−V(Θ) (23)

where we have introduced in the second step with

Note that is a Gram matrix and thus positive semidefinite. The expression for the prediction function is obtained by plugging (18) into (6).

We now focus on the remaining maximization problem in the dual function in (12)

 maxΘ∈ST+12λT∑r,s=1Θrs⟨αr,Krsαs⟩−V(Θ). (24)

This is a semidefinite program which is computationally expensive to solve and thus prohibits to scale the output kernel learning problem to a large number of tasks. However, we show in the following that this problem has an analytical solution for a subset of the regularizers for . For better readability we defer a more general result towards the end of the section. The basic idea is to relax the constraint on in (24) so that it is equivalent to the computation of the conjugate of . If the maximizer of the relaxed problem is positive semi-definite, one has found the solution of the original problem.

###### Theorem 4

Let and , then with we have

 maxΘ∈ST+T∑r,s=1Θrsρrs−12T∑r,s=1|Θrs|p=14k−2(2k−12kλ)2kT∑r,s=1⟨αr,Krsαs⟩2k, (25)

and the maximizer is given by the positive semi-definite matrix

 (26)

Proof:  We relax the constraints and solve

Note that the problem is separable and thus we can solve for each component separately,

 maxΘrs∈R12λΘrs⟨αr,Krsαs⟩−12|Θrs|p.

The optimality condition for becomes with ,

 0=ρrs−p2sign(Θ∗rs)|Θ∗rs|p−1⟹Θ∗rs=(2p)1p−1sign(ρrs)|ρrs|1p−1.

The solution of the relaxed problem is the solution of the original constrained problem, if we can show that the corresponding maximizer is positive semidefinite. Note that is a positive semidefinite (p.s.d.) matrix as it is a Gram matrix. The factor is positive and thus the resulting matrix is p.s.d. if is p.s.d.

It has been shown [25], that the elementwise power of a positive semidefinite matrix is positive definite for all and if and only if is a positive integer. Note that we have an elementwise integer power of if

is an odd positive integer (the case of an even integer is ruled out by Theorem

5), that is for as in this case we have

 Θ∗rs=(2p)2k−1sign(ρrs)|ρrs|2k−1=(2p)2k−1ρ2k−1rs=(2k−12kλ)2k−1⟨αr,Krsαs⟩2k−1.

We get the admissible values of as , (resp. ). We compute the optimal objective value as

 T∑r,s=1ρ2krs((2p)2k−1−12(2p)2k) =(p−1)12(2p)2kT∑r,s=1ρ2krs=14k−2(2k−1k)2kT∑r,s=1ρ2krs (27) =14k−2(2k−12λk)2kT∑r,s=1⟨αr,Krsαs⟩2k (28)

Plugging the result of the previous theorem into the dual function of Lemma 3 we get for and with the following unconstrained dual of our main problem (10):

 maxα∈Rn−Cn∑i=1L∗i(−αiC)−λ4k−2(2k−12kλ)2kT∑r,s=1⟨αr,Krsαs⟩2k. (29)

Note that by doing the variable transformation we effectively have only one hyper-parameter in (29). This allows us to cross-validate more efficiently. The range of admissible values for in Theorem 4 lies in the interval , where we get for the value and as we have . The regularizer for together with the squared loss has been considered in the primal in [17, 18]. Our analytical expression of the dual is novel and allows us to employ stochastic dual coordinate ascent to solve the involved primal optimization problem. Please also note that by optimizing the dual, we have access to the duality gap and thus a well-defined stopping criterion. This is in contrast to the alternating scheme of [17, 18] for the primal problem which involves costly matrix operations. Our runtime experiments show that our solver for (29) outperforms the solvers of [17, 18]. Finally, note that even for suboptimal dual variables , the corresponding matrix in (26) is positive semidefinite. Thus we always get a feasible set of primal variables.

#### Characterizing the set of convex regularizers V which allow an analytic expression for the dual function

The previous theorem raises the question for which class of convex, separable regularizers we can get an analytical expression of the dual function by explicitly solving the optimization problem (24) over the positive semidefinite cone. A key element in the proof of the previous theorem is the characterization of functions which when applied elementwise to a positive semidefinite matrix result in a p.s.d. matrix, that is . This set of functions has been characterized by Hiai [26].

###### Theorem 5 ([26])

Let and . We denote by the elementwise application of to . It holds if and only if is analytic and with for all .

Note that in the previous theorem the condition on is only necessary when we require the implication to hold for all . If is fixed, the set of functions is larger and includes even (large) fractional powers, see [25]. We use the stronger formulation as we want that the result holds without any restriction on the number of tasks . Theorem 5 is the key element used in our following characterization of separable regularizers of which allow an analytical expression of the dual function.

###### Theorem 6

Let be analytic on and given as where . If is convex, then, , is a convex function and

 (30)

where the global maximizer fulfills if and

Proof:  Note that is analytic on and thus infinitely differentiable on . As is additionally convex, it is a proper, lower semi-continuous convex function and thus [27, Corollary 1.3.6]. As is convex, is a convex function and using we get

 (31)

Finally, we show that the global maximizer has the given form. Note that as is a proper, lower semi-continuous convex function it holds [27, Corollary 1.4.4]

 Θrs∈∂ϕ∗(ρrs)⟺ρrs∈∂ϕ(Θrs).

Note that the maximizer of problem (31) fulfills and thus , where we have used that is infinitely differentiable. These conditions allow us to express the maximizer of (30) in terms of . As is continuously differentiable, we get

 Θ∗rs=∂ϕ∂ρrs(ρrs)=∞∑k=0akρkrs.

Note that the series has infinite convergence radius and for all and thus it is of the form provided in Theorem 5. Thus if .
Table 1 summarizes e.g. of functions , the corresponding and the maximizer in (30).

#### Examples

• First we recover the results of Theorem 4. We use for , which is convex. We compute

 ϕ∗(y)=supx∈Rxy−ϕ(x)=supx∈Rxy−12kx2k=2k−12k|y|2k2k−1,

where we have used . We recover

 V(Θ)=T∑r,s=1ϕ∗(Θrs)=2k−12kT∑r,s=1Θ2k2k−1rs,

which with yields up to a positive factor the family of regularizers employed in Theorem 4 together with

 Θ∗rs=ρ2k−1rs
• In the second example we use , which is convex and the series has infinite convergence radius The conjugate is given as

 ϕ∗(y)=supx∈Rxy−ex={ylog(y)−y % if y>0∞ else.

so that the regularizer is given by,

 V(Θ)=T∑r,s=1ϕ∗(Θrs)={∑Tr,s=1Θrslog(Θrs)−Θrs if Θrs>0 ∀r,s=1,…,T∞ else ..

This can be seen as a generalized KL-divergence between and , where is the matrix of all ones

 V(Θ)=T∑r,s=1ϕ∗(Θrs)=