1 Introduction
Nonparametric kernel methods, cornerstones of machine learning today, can be seen from different angles: as regularised risk minimisation in function spaces
(Schölkopf and Smola, 2002), or as probabilistic Gaussian process methods (Rasmussen and Williams, 2006). In these techniques, the kernel (or equivalently covariance) function encodes interpolation characteristics from observed to unseen points, and two basic statistical problems have to be mastered. First, a latent function must be predicted which fits data well, yet is as smooth as possible given the fixed kernel. Second, the kernel function parameters have to be learned as well, to best support predictions which are of primary interest. While the first problem is simpler and has been addressed much more frequently so far, the central role of learning the covariance function is well acknowledged, and a substantial number of methods for “learning the kernel”, “multiple kernel learning”, or “evidence maximisation” are available now. However, much of this work has firmly been associated with one of the “camps” (referred to as
regularised risk and probabilistic in the sequel) with surprisingly little crosstalk or acknowledgments of prior work across this boundary. In this paper, we clarify the relationship between major regularised risk and probabilistic kernel learning techniques precisely, pointing out advantages and pitfalls of either, as well as algorithmic similarities leading to novel powerful algorithms.We develop a common analytical and algorithmical framework encompassing approaches from both camps and provide clear insights into the optimisation structure. Even though, most of the optimisation is non convex, we show how to operate a provably convergent “almost Newton” method nevertheless. Each step is not much more expensive than a gradient based approach. Also, we do not require any foreign optimisation code to be available. Our framework unifies kernel learning for regression, robust regression and classification.
The paper is structured as follows: In section 2, we introduce the regularised risk and the probabilistic view of kernel learning. In increasing order of generality, we explain multiple kernel learning (MKL, section 2.1
), maximum a posteriori estimation (MAP, section
2.2) and marginal likelihood maximisation (MLM, section 2.3). A taxonomy of the mutual relations between the approaches and important special cases is given in section 2.4. Section 3 introduces a general optimisation scheme and section 4 draws a conclusion.2 Kernel Methods and Kernel Learning
Kernel-based algorithms come in many shapes, however, the primary goal is – based on training data , , and a parametrised kernel function – to predict the output for unseen inputs . Often, linear parametrisations are used, where the are fixed positive definite functions, and . Learning the kernel means finding to best support this goal. In general, kernel methods employ a postulated latent function whose smoothness is controlled via the function space squared norm . Most often, smoothness is traded against data fit, either enforced by a loss function or modeled by a likelihood . Let us define kernel matrices , and in
and the vectors
, collecting outputs and latent function values, respectively.The regularised risk
route to kernel prediction, which is followed by any support vector machine (SVM) or ridge regression technique, yields
as criterion, enforcing smoothness of as well as good data fit through the loss function . By the representer theorem, the minimiser can be written as , so that (Schölkopf and Smola, 2002). As , the regularised risk problem is given by(1) |
A probabilistic viewpoint of the same setting is based on the notion of a Gaussian process (GP) (Rasmussen and Williams, 2006): a Gaussian random function with mean function and covariance function . In practice, we only use finite-dimensional snapshots of the process : for example, , a zero-mean joint Gaussian with covariance matrix . We adopt this GP as prior distribution over , estimating the latent function as maximum of the posterior process . Since the likelihood depends on only through the finite subset , the posterior process has a finite-dimensional representation: , so that
is specified by the joint distribution
, a probabilistic equivalent of the representer theorem. Kernel prediction amounts to maximum a posteriori (MAP) estimation(2) |
ignoring an additive constant. Minimising equations (1) and (2) for any fixed kernel matrix gives the same minimiser and prediction .
The correspondence between likelihood and loss function bridges probabilistic and regularised risk techniques. More specifically, any likelihood induces a loss function via
however some loss functions cannot be interpreted as a negative log likelihood as shown in table (2) and as discussed for the SVM by Sollich (2000). If, the likelihood is a log-concave function of , it corresponds to a convex loss function (Boyd and Vandenberghe, 2002, Sect. 3.5.1). Common loss functions and likelihoods for classification and regression are listed in table (2).
Loss function | Likelihood | |||
---|---|---|---|---|
SVM Hinge loss | ||||
Log loss | Logistic | |||
SVM -insensitive loss | ||||
Quadratic loss | Gaussian | |||
Linear loss | Laplace |
In the following, we discuss several approaches to learn the kernel parameters and show how all of them can be understood as instances of or approximations to Bayesian evidence maximisation. Although the exposition MKL section 2.1 and MAP section 2.2 use a linear parametrisation , much of the results in MLM 2.3 and all the aforementioned discussion are still applicable to non-linear parametrisations.
2.1 Multiple Kernel Learning
A widely adopted regularised risk principle, known as multiple kernel learning (MKL) (Christianini et al., 2001; Lanckriet et al., 2004; Bach et al., 2004), is to minimise equation (1) w.r.t. the kernel parameters as well. One obvious caveat is that for any fixed , equation (1) becomes ever smaller as : it cannot per se play a meaningful statistical role. In order to prevent this, researchers constrain the domain of and obtain
where or (Varma and Ray, 2007). Notably, these constraints are imposed independently of the statistical problem, the model and of the parametrization . The Lagrangian form of the MKL problem with parameter and a general -norm unit ball constraint where (Kloft et al., 2009) is given by
(3) |
Since, the regulariser for the kernel parameter is convex, the map is jointly convex for (Boyd and Vandenberghe, 2002) and the parametrisation is linear, MKL is a jointly convex problem for whenever the loss function is convex. Furthermore, there are efficient algorithms to solve equation (3) for large models (Sonnenburg et al., 2006).
2.2 Joint MAP Estimation
Adopting a probabilistic MAP viewpoint, we can minimise equation (2) w.r.t. and :
(4) |
While equation (3) and equation (4) share the “inner solution” for fixed – in case the loss corresponds to a likelihood – they are different when it comes to optimising . The joint MAP problem is not in general jointly convex in , since is concave, see figure 2. However, it is always a well-posed statistical procedure, since as for all .
We show in the following, how the regularisers of equation (3) can be related to the probabilistic term . In fact, the same reasoning can be applied to any concave non-decreasing function.
Since the function , is jointly concave, we can represent it by where denotes Fenchel dual of . Furthermore, the mapping , is jointly concave due to the composition rule (Boyd and Vandenberghe, 2002, §3.2.4), because is jointly concave and is non-decreasing in all components as all matrices
are positive (semi-)definite which guarantees that the eigenvalues of
increase as increases. Thus we can – similarly to Zhang (2010) – represent asChoosing a particular value , we obtain the bound . Figure 1 illustrates the bounds for and . The bottom line is that one can interpret the regularisers in equation (3) as corresponding to parametrised upper bounds to the part in equation (4), hence , where . Far from an ad hoc choice to keep small, the term embodies the Occam’s razor concept behind MAP estimation: overly large are ruled out, since their explanation of the data is extremely unlikely under the prior . The Occam’s razor effect depends crucially on the proper normalization of the prior (MacKay, 1992). For example, the weighting parameter of () can be learned by joint MAP: if , then equation (4) is convex in for any fixed . If kernel-regularised estimation equation (1) is interpreted as MAP estimation under a GP prior equation (2), the correct extension to kernel learning is joint MAP: the MKL criterion equation (3) lacks prior normalization, which renders MAP w.r.t. meaningful in the first place. From a non-probabilistic viewpoint, the term comes with a model and data dependent structure at least as complex as the rest of equation (3).
While the MKL objective, equation (3), enjoys the benefit of being convex in the (linear) kernel parameters , this does not hold true for joint MAP estimation, equation (4), in general. We illustrate the differences in figure 2. The function is a building block of the MAP objective , where
More concretely, is a sum of a nonnegative, jointly convex function that is strictly decreasing in every component and a concave function that is strictly increasing in every component . Both functions and alone do not have a stationary point due to their monotonicity in . However, their sum can have (even multiple) stationary points as shown in figure 2 on the left left. We can show, that the map is invex i.e. every stationary point is a global minimum. Using the convexity of (Boyd and Vandenberghe, 2002) and the fact that the derivative of for , has full rank , we see by Mishra and Giorgi (2008, theorem 2.1) that is indeed invex.
Often, the MKL objective for the case is motivated by the fact that the optimal solution is sparse (e.g. Sonnenburg et al., 2006), meaning that many components are zero. Figure 2 illustrates that also yields sparse solutions; in fact it enforces even more sparsity. In MKL, is simply relaxed to a convex objective at the expense of having only a single less sparse solution.
Intuition for the Gaussian Case
We can gain further intuition about the criteria and by asking which matrices minimise them. For simplicity, assume that and , hence . The inner minimiser for both and is given by . With , we find for joint MAP that results in . While this “nonparametric” estimate requires smoothing to be useful in practice, closeness to is fundamental to covariance estimation and can be found in regularised risk kernel learning work (Christianini et al., 2001). On the other hand, for and hence , leads to
: an odd way of estimating covariance, not supported by any statistical literature we are aware of.
2.3 Marginal Likelihood Maximisation
While the joint MAP criterion uses a properly normalised prior distribution, it is still not probabilistically consistent. Kernel learning amounts to finding a value of high data likelihood, no matter what the latent function is. The correct likelihood to be maximised is marginal: (“max-sum”), while joint MAP employs the plug-in surrogate (“max-max”). Marginal likelihood maximisation (MLM) is also known as Bayesian estimation, and it underlies the EM algorithm or maximum likelihood learning of conditional random fields just as well: complexity is controlled (and overfitting avoided) by averaging over unobserved variables (MacKay, 1992), rather than plugging in some point estimate
(5) |
The Gaussian Case
Before developing a general MLM approximation, we note an important analytically solvable exception: for Gaussian likelihood , , and MLM becomes
(6) |
Even if the primary purpose is classification, the Gaussian likelihood is used for its analytical simplicity (Kapoor et al., 2009). Only for the Gaussian case, joint MAP and MLM have an analytically closed form. From the product formula of Gaussians (Brookes, 2005, §5.1)
where and we can deduce that
(7) |
Using and , we see that by
(8) |
MLM and MAP are very similar for the Gaussian case.
The “ridge regression” approximation is also used together with -norm constraints instead of the term (Cortes et al., 2009)
(9) |
Unfortunately, most GP methods to date work with a Gaussian likelihood for simplicity, a restriction which often proves short-sighted. Gaussian-linear models come with unrealistic properties, and benefits of MLM over joint MAP cannot be realised.
Kernel parameter learning has been an integral part of probabilistic GP methods from the very beginning. Williams and Rasmussen (1996) proposed MLM for Gaussian noise equation 6, fifteen years ago. They treated sums of exponential and linear kernels as well as learning lengthscales (ARD), predating recent proposals such as “products of kernels” (Varma and Babu, 2009).
The General Case
In general, joint MAP always has the analytical form equation 4, while can only be approximated. For non-Gaussian , numerous approximate inference methods have been proposed, specifically motivated by learning kernel parameters via MLM. The simplest such method is Laplace’s approximation, applied to GP binary and multi-way classification by Williams and Barber (1998): starting with convex joint MAP, is expanded to second order around the posterior mode . More recent approximations Girolami and Rogers (2005); Girolami and Zhong (2006) can be much more accurate, yet come with non-convex problems and less robust algorithms (Nickisch and Rasmussen, 2008). In this paper, we concentrate on the variational lower bound relaxation (VB) by Jaakkola and Jordan (2000), which is convex for log-concave likelihoods (Nickisch and Seeger, 2009), providing a novel simple and efficient algorithm. While our VB approximation to MLM is more expensive to run than joint MAP for non-Gaussian likelihood (even using Laplace’s approximation), the implementation complexity of our VB algorithm is comparable to what is required in the Gaussian noise case equation 6.
More, specifically, we exploit that super-Gaussian of likelihoods can be lower bounded by scaled Gaussians of any width :
where are constants, and is convex (Nickisch and Seeger, 2009) whenever the likelihood is log-concave. If the posterior distribution is , then by plugging in these bounds, where is a constant and
(10) |
, . The variational relaxation^{1}^{1}1Generalisations to other super-Gaussian potentials (log-concave or not) or models including linear couplings and mixed potentials are given by Nickisch and Seeger (2009). amounts to maximising the lower bound, which means that is fitted by the Gaussian approximation with covariance matrix (Nickisch and Seeger, 2009). Alternatively, we can interpret
as an upper bound on the Kullback-Leibler divergence
(Nickisch, 2010, §2.5.9), a measure for the dissimilarity between the exact posterior and the parametrised Gaussian approximation .Finally, note that by equation (7), can also be written as
(11) |
where . Using the concavity of and Fenchel duality , with the optimal value , we can reformulate as
which allows to perform the minimisation w.r.t. in closed form (Nickisch, 2010, §3.5.6):
(12) |
where and finally . Note that for , we exactly recover joint MAP estimation, equation (4), as implies and . For fixed , the optimal value
corresponds to the marginal variances of the Gaussian approximation
: Variational inference corresponds to variance-smoothed joint MAP estimation (Nickisch, 2010) with a loss function that explicitly depends on the kernel parameters . We have two equivalent representations of the loss that directly follow from equations (11) and (12):Our VB problem is or equivalently . The inner variables here are and , in addition to in joint MAP. There are further similarities: since , is jointly convex for , , by the joint convexity of and Prékopa’s theorem (Boyd and Vandenberghe, 2002, §3.5.2). Joint MAP and VB share the same convexity structure. In contrast, approximating by other techniques like Expectation Propagation (Minka, 2001) or general Variational Bayes (Opper and Archambeau, 2009) does not even constitute convex problems for fixed .
2.4 Summary and Taxonomy
Name | Objective function |
---|---|
Marginal Likelihood Maximisation | |
Variational Bounds | by |
Maximum A Posteriori | |
Multiple Kernel Learning |
The upper table visualises the relationship between several kernel learning objective functions for arbitrary likelihood/loss functions: Marginal likelihood maximisation (MLM) can be bounded by variational bounds (VB) and maximum a posteriori estimation (MAP) is a special case thereof. Finally multiple kernel learning (MKL) can be understood as an upper bound to the MAP estimation objective . The lower table complements the upper table by also covering the analytically important Gaussian case.
In the last paragraphs, we have detailed how a variety of kernel learning approaches can be obtained from Bayesian marginal likelihood maximisation in a sequence of nested upper bounding steps. Table 2.4 nicely illustrates how many kernel learning objectives are related to each other – either by upper bounds or by Gaussianity assumptions. We can clearly see, that – as an upper bound to the negative log marginal likelihood – can be seen as the mother function. For a special case, , we obtain joint maximum a posteriori estimation, where the loss functions does not depend on the kernel parameters. Going further, a particular instance yields the widely use multiple kernel learning objective that becomes convex in the kernel parameters . In the following, we will concentrate on the optimisation and computational similarities between the approaches.
3 Algorithms
In this section, we derive a simple, provably convergent and efficient algorithm for MKL, joint MAP and VB. We use the Lagrangian form of equation (3) and :
Many previous algorithms use alternating minimization, which is easy to implement but tends to converge slowly. Both and are jointly convex up to the concave part. Since (Legendre duality, Boyd and Vandenberghe, 2002), joint MAP becomes with which is jointly convex in . Algorithm 1 iterates between refits of and joint Newton updates of .
The Newton direction costs , with the number of data points and the number of base kernels. All algorithms discussed in this paper require time, apart from the requirement to store the base matrices . The convergence proof hinges on the fact that and are tangentially equal (Nickisch and Seeger, 2009). Equivalently, the algorithm can be understood as Newton’s method, yet dropping the part of the Hessian corresponding to the term (note that for the Newton direction computation). Exact Newton for MKL.
In practice, we use to avoid numerical problems when computing and . We also have to enforce in algorithm 1, which is done by the barrier method (Boyd and Vandenberghe, 2002). We minimise instead of , increasing every few outer loop iterations.
A variant algorithm 1 can be used to solve VB in a different parametrisation ( replaces ), which has the same convexity structure as joint MAP. Transforming equation (10) similarly to equation (6), we obtain
(13) |
with , computed using the Cholesky factorisation . They cost to compute, which is more expensive than for joint MAP or MKL. Note that the cost is not specific to our particular relaxation or algorithm e.g. the Laplace MLM approximation (Williams and Barber, 1998), solved using gradients w.r.t. only, comes with the same complexity.
4 Conclusion
We presented a unifying probabilistic viewpoint to multiple kernel learning that derives regularised risk approaches as special cases of approximate Bayesian inference. We provided an efficient and provably convergent optimisation algorithm suitable for regression, robust regression and classification.
Our taxonomy of multiple kernel learning approaches connected many previously only loosely related ideas and provided insights into the common structure of the respective optimisation problems. Finally, we proposed an algorithm to solve the latter efficiently.
References
- Bach et al. (2004) Francis Bach, Gert Lanckriet, and Michael Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In ICML, 2004.
- Boyd and Vandenberghe (2002) Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2002.
- Brookes (2005) Mike Brookes. The matrix reference manual, 2005. URL http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html.
- Christianini et al. (2001) Nello Christianini, John Shawe-Taylor, André Elisseeff, and Jaz Kandola. On kernel-target alignment. In NIPS, 2001.
- Cortes et al. (2009) Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. L2 regularization for learning kernels. In UAI, 2009.
- Girolami and Rogers (2005) Mark Girolami and Simon Rogers. Hierarchic Bayesian models for kernel learning. In ICML, 2005.
- Girolami and Zhong (2006) Mark Girolami and Mingjun Zhong. Data integration for classification problems employing Gaussian process. In NIPS, 2006.
- Jaakkola and Jordan (2000) Tomi Jaakkola and Michael Jordan. Bayesian parameter estimation via variational methods. Statistics and Computing, 10:25–37, 2000.
- Kapoor et al. (2009) Ashish Kapoor, Kristen Grauman, Raquel Urtasun, and Trevor Darrell. Gaussian processes for object categorization. IJCV, 2009. doi: 10.1007/s11263-009-0268-3.
- Kloft et al. (2009) Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Pavel Laskov, Klaus-Robert Müller, and Alexander Zien. Efficient and accurate lp-norm multiple kernel learning. In NIPS, 2009.
- Lanckriet et al. (2004) Gert R. G. Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michal I. Jordan. Learning the kernel matrix with semidefinite programming. JMLR, 5:27–72, 2004.
- MacKay (1992) Davic MacKay. Bayesian interpolation. Neural Computation, 4(3):415–447, 1992.
- Minka (2001) Tom Minka. Expectation propagation for approximate Bayesian inference. In UAI, 2001.
- Mishra and Giorgi (2008) Shashi Kant Mishra and Giorgio Giorgi. Invexity and optimization. Springer, 2008.
- Nickisch (2010) Hannes Nickisch. Bayesian Inference and Experimental Design for Large Generalised Linear Models. PhD thesis, TU Berlin, 2010.
- Nickisch and Rasmussen (2008) Hannes Nickisch and Carl Edward Rasmussen. Approximations for binary Gaussian process classification. JMLR, 9:2035–2078, 2008.
- Nickisch and Seeger (2009) Hannes Nickisch and Matthias Seeger. Convex variational Bayesian inference for large scale generalized linear models. In ICML, 2009.
- Opper and Archambeau (2009) Manfred Opper and Cédric Archambeau. The variational Gaussian approximation revisited. Neural Computation, 21(3):786–792, 2009.
- Rasmussen and Williams (2006) Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
- Schölkopf and Smola (2002) Bernhard Schölkopf and Alex Smola. Learning with Kernels. MIT Press, 1st edition, 2002.
- Sollich (2000) Peter Sollich. Probabilistic methods for support vector machines. In NIPS, 2000.
- Sonnenburg et al. (2006) Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, and Bernhard Schölkopf. Large scale multiple kernel learning. JMLR, 7:1531–1565, 2006.
- Varma and Babu (2009) Manik Varma and Bodla Rakesh Babu. More generality in efficient multiple kernel learning. In ICML, 2009.
- Varma and Ray (2007) Manik Varma and Debajyoti Ray. Learning the discriminative power-invariance trade-off. In ICCV, 2007.
- Williams and Barber (1998) Christopher K. I. Williams and David Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998.
- Williams and Rasmussen (1996) Christopher K. I. Williams and Carl Edward Rasmussen. Gaussian processes for regression. In NIPS, 1996.
- Zhang (2010) Tong Zhang. Analysis of multi-stage convex relaxation for sparse regularization. JMLR, 11:1081–1107, 2010.
3 Algorithms
In this section, we derive a simple, provably convergent and efficient algorithm for MKL, joint MAP and VB. We use the Lagrangian form of equation (3) and :
Many previous algorithms use alternating minimization, which is easy to implement but tends to converge slowly. Both and are jointly convex up to the concave part. Since (Legendre duality, Boyd and Vandenberghe, 2002), joint MAP becomes with which is jointly convex in . Algorithm 1 iterates between refits of and joint Newton updates of .
The Newton direction costs , with the number of data points and the number of base kernels. All algorithms discussed in this paper require time, apart from the requirement to store the base matrices . The convergence proof hinges on the fact that and are tangentially equal (Nickisch and Seeger, 2009). Equivalently, the algorithm can be understood as Newton’s method, yet dropping the part of the Hessian corresponding to the term (note that for the Newton direction computation). Exact Newton for MKL.
In practice, we use to avoid numerical problems when computing <
Comments
There are no comments yet.