1 Introduction
is well known in machine learning and provides an elegant means of encapsulating a combined feature map (nonlinear map
from input space into feature space) and inner product into a simple function that effectively hides the feature map: starting from a positive definite kernel one can prove that there exists an associated (implicit) feature map such that , all without knowing (or needing to know) what form actually takes. This allows the (implicit) use feature maps of essentially arbitrary complexity at little or no additional computational cost.While elegant, the kernel trick is not magic. Typically the kernel is selected from a set of “standard” kernels to minimise crossfold error, testset error, loglikelihood or similar. In so doing one is essentially picking a feature map from a bag of “standard” feature maps. The result of this process is a feature map (not known but implicitly defined by a kernel) that may, barring an extremely unlikely perfect match scenario, be viewed as the leastworst of the available maps. Techniques such as hyperparameter tuning, multikernel learning [10; 1] etc aim to improve on this situation by finetuning the kernel (and hence the implicit feature map) or combining kernels (and hence the combining the feature maps implicit in them). However one is still limited in the space of reachable feature maps, and there is no clear interpretation of such techniques from a featurespace perspective.
Our motivation is as follows: suppose the “best” kernel for a given dataset , found using standard techniques, has associated with it an (implicit) feature map . As the leastworst option, this map will have many individual features in it that are highly relevant to the problem at hand, which we would like to emphasise, but also some features that are either not relevant or actively misleading which we would like to suppress or remove. This implies three questions: (a) how can we identify the (ir)relevant features, (b) how can we amplify (or suppress) features to obtain a better feature map and (c) how can we do steps (a) and (b) without direct access to the feature map (whose existence we infer from the positive definiteness of but whose form we do not know)?
To address question (a) we use standard machinelearning techniques. If we apply for example a support vector machine (SVM)
[2] method (or similar) equipped with kernel to learn either from dataset or some related dataset then the answer we obtain takes the form of the representation of a weight vector in feature space. Assuming a reasonable “fit” the weights will be larger in magnitude for relevant features and smaller for irrelevant ones.To address questions (b) and (c) we borrow concepts from reproducing kernel Banach space (RKBS) theory [4; 21; 6] and norm regularisation [16; 15]  in particular kernels (tensor kernels [15]
, moment functions
[4])  and prove that it is possible to adjust the magnitudes of individual features without explicitly knowing the feature map. We show that, if is a kernel with implied feature map , and implicitly define a weight vector in feature space (with larger implying greater relevance), then we may perform a kernel reweighting operation:that converts a kernel (whose implied feature map containing both relevant and irrelevant features) into a kernel with implicit features that emphasises relevant features (larger ) and suppresses irrelevant or misleading ones (smaller ), as shown in figure 1. That is, we may pretune (reweight) our kernel to “fit” a particular problem, adjusting the implicit feature map in a principled manner without any explicit (direct) knowledge of the feature map itself.
To achieve this in section 2.1 we describe kernels (tensorkernels [15; 16], moment function [4]) formally, and then in section 3 introduce a new concept, free kernels, which are families of kernels embodying the same underlying feature map over a range of (examples are given in table 1). We then formally prove how kernels may be reweighted in theorem 1 to emphasise/suppress features, and in section 4 develop an algorithm (algorithm 1) that utilises the concept of kernel reweighting to tune kernels by suppressing irrelevant features and emphasising important ones.
We demonstrate our method on accelerated Bayesian Optimisation (BO [18]). By pretuning the covariance function (kernel) of the Gaussian Process (GP [14]) using auxiliary data we show a speedup in convergence due to better modelling of our function. We consider (1) new short polymer fibre design using microfluid devices (where auxiliary data is generated by an older, slightly different device) and (2) design of a new hybrid Aluminium alloy (where auxiliary data is based on 46 existing patents for aluminium 6000, 7000 and 2000 series). In both cases kernel pretuning results in superior performance.
Our main contributions are:
1.1 Notation
Sets are written ; with , , . Column vectors are bold lower case . Matrices are bold upper case . Element of vector is . Element of matrix is . is the transpose, the elementwise product, the elementwise power, the elementwise absolute value, the elementwise sign, and . a vector of s and a vector of s. The Kronecker product is . The innerproduct is [5; 15; 16].
2 Problem Statement and Background
The kernel trick is well known in machine learning. A (Mercer) kernel is a function for which there exists a corresponding feature map , , such that :
(1) 
Geometrically, if is the angle between and in feature space:
(2) 
so is a measure of the similarity of and in terms of their alignment in feature space. For a normalised kernel such as the RBF kernel this simplifies to .
In practice a kernel is usually selected from a set of wellknown kernels, (e.g. the polynomial kernel or an RBF kernel ), possibly with additional hyperparameter tuning and/or consideration of kernel combinations (e.g. multikernel learning [10; 1]). However the resulting kernel (and the feature map implied by it) may still be viewed as a “leastworst fit” from a set of readily available feature maps.
In the present paper we show how the feature map implied by a given kernel may be tuned, in a principled manner, to make it better fit the data. Using techniques from reproducing kernel Banach spaces [4; 21; 6] and norm regularisation [16; 15] to show how kernels can be pretrained or reweighted by scaling the features of the feature map embodied by a kernel :
(3) 
to emphasise important features (in feature space) over unimportant features. The geometry of this operation is shown in figure 1: important features can be emphasised or amplified, while irrelevant or misleading features are deemphasised.
2.1 Kernels
A number of generalisations of the basic concept of kernel functions arise in generalised norm SVMs [12; 13], reproducing kernel Banachspace (RKBS) theory [4; 21; 6] and norm regularisation [16; 15]. Of interest here is the kernel (tensor kernel [16], moment function [4]), which is a function for which there exists an (unknown) feature map such that:
(4) 
(so Mercer kernels are kernels). Discussion of the properties of kernels may be found in [15; 16; 4]. Examples of kernels include:

innerproduct kernels: By analogy with the innerproduct kernels it may be shown [15] that, given expandable as a Taylor series , the function:
(5) is an kernel if and only if all terms in the series are nonnegative.

directproduct kernels: Similarly for any Taylorexpandable , , with nonnegative terms the function:
(6) is an kernel (a special case of a Taylor kernel [15]).
A sample of kernels is presented in table 1. The generalised RBF kernel in this table is constructed from the exponential kernel, and reduces to the standard RBF kernel when . The implied feature map for this kernel is , where is the feature map of the exponential kernel. Note that this is independent of .
Linear:  

Polynomial order :  
Hyperbolic sine ():  
Exponential ():  
Inverse Gudermannian ():  
Log ratio:  (assuming ) 
RBF (): 
2.2 Norm Support Vector Machines
A canonical application of kernels is the norm support vector machine (SVM) (SVM [16], maxmargin
moment classifier
[4]). Let be a training set. Following [16] the aim is to find a sparse (in ) trained machine:(7) 
(where is implied by a kernel ) to fit the data. , are found by solving the norm SVM training problem, where is dual to (i.e. ):
(8) 
where is strictly monotonically increasing, is an arbitrary empirical risk function, and the use of norm regularisation with encourages sparsity in in feature space. Following [16; 15]:
(9) 
(representor theorem) and hence:
(10) 
where is a kernel with implied feature map (the kernel trick). Moreover we may completely suppress and construct a dual training problem entirely in terms of [15; 16]  e.g. if ,
(ridge regression), the dual training problem is:
(11) 
where . Similar results, analagous to the “standard” SVMs (e.g. binary classification) may be likewise constructed [15; 16; 4].
3 Making Kernels from Kernels  Free Kernels and Kernel ReWeighting
In the present context we wish to leverage the additional expressive power of the kernels to directly tune the feature map to suit the problem (or problems) at hand. In particular we will demonstrate how kernels may be pretuned or learnt for a particular dataset and then transferred to other problems in related domains. We begin with the following definition:
Definition 1 (Free kernels)
Let . A free kernel (of order ) is a family of functions indexed by for which there exists an (unweighted) feature map and feature weights (), both independent of , such that :
(12) 
For fixed a free kernel of order defines (is) an kernel with implied feature map:
(13) 
We assume free kernels of order throughout unless otherwise specified. Note that the innerproduct and direct product kernels are free kernels (it is straightforward to show that they have implied feature map and implied feature weights, respectively, and . It follows that all of the kernels in table 1 are free kernels.^{1}^{1}1The RBF kernel has unweighted feature map and weights , where , , are the feature map (), unweighted feature map and weights of the exponential kernel. Given a freekernel of order we have the following key theorem that enables us to reweight or tune the kernel:
Theorem 1
Let be a free kernel of order with implied feature map and feature weights ; and let and . Then the function defined by:
(14) 
defines a free kernel of order with implied feature map and weights , where:
and we note that has the form of the representation (9) of in a norm SVM, .
Proof:
4 The Kernel PreTuning Algorithm
Having established our theoretical results we arrive at the core of our method  an algorithm for tuning kernels using reweighting (theorem 1) to fit a dataset. Our algorithm is detailed in algorithm 1 and illustrated in figure 2. It is assumed that we are given a dataset from which to infer feature relevance, and a free kernel . Then, assuming , for simplicity, we proceed as follows:

The free kernel defines a twokernel for , implying feature map by (13).

Train an SVM using and to obtain and hence , implying weights in feature space by theorem 1.
In the more general case , a norm SVM () generates a sparse weight vector , but the concept is the same. Note that at no point in this process do we need to explicitly know the implied feature map or weights  all work is done entirely in kernel space.
5 Application: Accelerated Bayesian Optimisation
In this section we present a practical example of the application of kernel pretuning via reweighting, namely accelerated Bayesian Optimisation.
Bayesian Optimisation (BO) [18] is a form of sequential modelbased optimisation (SMBO) that aims to to find with the least number of evaluations for an expensive (to evaluate) function . It is assumed that is a draw from a zero mean Gaussian Process (GP) [14] with covariance function (kernel) . As with kernel selection,
is typically not given apriori but selected from a set of “known” covariance functions (kernels) using heuristics such as maxloglikelihood. Nevertheless the speed of convergence is critically dependent on having a good model for
, which requires selection of an appropriate covariance function.In this experiment we consider the case where we have access to prior knowledge that is related to  but not generated by  . Let . In general and , but we assume that both and are influenced by the same (or very similar) features. In this experiment we use to tune our covariance function via algorithm 1 to fit (and hence ), giving us a better fit for our covariance function and accelerated convergence. Our algorithm is presented in algorithm 2, which is like a standard BO algorithm except for the covariance function pretuning step.
As noted previously we model , allowing us to model the posterior at iteration with mean
and variance
in the usual manner [14]. For the acquisition function we test expected improvement (EI) [8] and GP upper confidence bound (GPUCB) [19], respectively (alternatives include probability of improvement (PI)
[9] and predictive entropy seach (PES) [7]).5.0.1 Short Polymer Fibres
In this experiment we have tested our algorithm on the realworld application of optimizing short polymer fibre (SPF) to achieve a given (median) target length [11]. This process involves the injection of one polymer into another in a special device [20]. The process is controlled by geometric parameters (channel width (mm), constriction angle (degree), device position (mm)) and flow factors (butanol speed (ml/hr), polymer concentration (cm/s)) that parametrise the experiment  see figure 3. Two devices (A and B) were used. Device A is armed by a gear pump and allows for three butanol speeds (86.42, 67.90 and 43.21). The newer device B has a lobe pump and allows butonal speed 98, 63 and 48. Our goal is to design a new short polymer fibre for Device B that results in a (median) target fibre length of 500m.
We write the device parameters as and the result of experiments (median fibre length) on each device as and , respectively. Device A has been characterised to give a dataset of input/output pairs. We aim to minimise: , noting that (the objective differs from the function generating , although both relate to fibre length). Device B has been similarly characterised and this grid forms our search space for Bayesian optimisation.
For this experiment we have used the free RBF kernel . An SVM was trained using
and this kernel (hyperparameters
and were selected to minimise leaveoneout meansquarederror (LOOMSE)) to obtain . The reweighted kernel obtained from this was normalised (to ensure good conditioning along the diagonal of ) and used in Bayesian optimisation as per algorithm 2. All data was normalised to and all experiments were averaged over repetitions.We have tested both the EI and GPUCB acquisition functions. Figure 4 shows the convergence of our proposed algorithm. Also shown for comparison are standard Bayesian optimisation (using a standard RBF kernel as our covariance function); and a variant of our algorithm where a kernel mixture model trained on is used as our covariance function  specifically:
(15) 
where is an RBF kernel, a Matérn 1/2 kernel, a Matérn 3/2 kernel; and and all relevant (kernel) hyperparameters are selected to minimise LOOMSE on . Relevant hyperparameters in Bayesian optimisation were selected using maxloglikelihood at each iteration. As can be seen our proposed approach outperforms other methods with both GPUCB and EI acquisition functions.
5.0.2 Aluminium Alloy Design using ThermoCalc
This experiment considers optimising a new hybrid Aluminium alloy for target yield strength. Designing an alloy is an expensive process. Casting an alloy and then measuring its properties usually takes long time. An alloy has certain phase structures that determine its material properties. For example, phases such as C14LAVES and ALSC3 are known to increase yield strength whilst others such as AL3ZR_D023 and ALLI_B32 reduce the yield strength of the alloy. However a precise function relating the phases to yield strength does not exist. The simulation software ThermoCalc takes a mixture of component elements as input and computes the phase composition of the resulting alloy. We consider 11 elements as potential constituents of the alloy and 24 phases. We use ThermoCalc for this computation.
A dataset of
closely related alloys filed as patents was collected. This dataset consists information about the composition of the elements in the alloy and their yield strength. The phase compositions extracted from ThermoCalc simulations for various alloy compositions were used to understand the positive or negative contribution of phases to the yield strength of the alloy using linear regression. The weights retrieved for these phases were then used formulate a utility function. Figure
6 shows the regression coeffs for the phases contributing to the yield strength.The kernel selection and tuning proceedure was used here as for the short polymer fibre experiment. We have tested both the EI and GPUCB acquisition functions. Figure 6 shows the convergence of our proposed algorithm compared to standard Bayesian optimisation (using a standard RBF kernel as our covariance function). Relevant hyperparameters in Bayesian optimisation ( for our method and kernel mixtures, and for standard Bayesian optimisation) were selected using maxloglikelihood at each iteration. As can be seen from figure 6 our proposed approach outperforms standard Bayesian optimisation by a significant margin for both EI and GPUCB.
5.0.3 Simulated Experiment
In this experiment we consider use of kernel reweighting to incorporate domain knowledge into a kernel design. We aim to minimise the function:
as illustrated in figure 7, where . Noting that this function has rotational symmetry we select an additional dataset to exhibit this property, namely: , of vectors, where is selected uniformly randomly. Thus reflects the rotational symmetry of the target optimisation function but not its form. As for previous experiments a free RBF kernel was chosen and reweighted using algorithm 1 with hyperparameters selected to minimise LOOMSE. However in this case we have not normalised the reweighted kernel but rather used a composite kernel which implies a layer feature map, the first layer being the reweighted feature map implied by and the second layer being the standard feature map implied by the RBF kernel. All experiments were averaged over repetitions.
We have tested both the EI and GPUCB acquisition functions. Figure 7 shows the convergence of our proposed algorithm compared to standard Bayesian optimisation (using a standard RBF kernel) and standard Bayesian optimisation with a kernel mixture model as per our short polymer fibre experiment (15) trained on used as the covariance function. Curiously in this case, while our method combined with GPUCB outperforms the alternatives, the results are less clear for our method combined with EI. The precise reason for this will be investigated in future work.
6 Conclusion
In this paper we have presented a novel approach to kernel tuning. We have based our method on kernel techniques from reproducing kernel Banach space theory and regression. We have defined free kernels families whose implied feature map is independent of , along with a means of constructing them (with examples), and shown how the properties of these may be utilised to tune them by (implicitly) reweighting the features in feature space in a principled manner. As an application we have presented an accelerated Bayesian optimisation algorithm that pretunes the covariance function on auxiliary data to achieve accelerated convergence, demonstrating the efficacy of our proposal.
Acknowledgments
This research was partially funded by the Australian Government through the Australian Research Council (ARC) and the TelstraDeakin Centre of Excellence in Big Data and Machine Learning. Prof. Venkatesh is the recipient of an ARC Australian Laureate Fellowship (FL170100006).
References
 Bach et al. [2004] Francis R. Bach, Gert R. G. Lanckriet, and Michael I. Jordan. Multiple kernel learning, conic duality and the SMO algorithm. In Proceedings of the 21st International Conference on Machine Learning (ICML), Banff, 2004.
 Cortes and Vapnik [1995] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20(3):273–297, 1995.
 Cristianini and ShaweTaylor [2005] Nello Cristianini and John ShaweTaylor. An Introductino to Support Vector Machines and other KernelBased Learning Methods. Cambridge University Press, Cambridge, UK, 2005.
 Der and Lee [2007] Ricky Der and Danial Lee. Largemargin classification in banach spaces. In Proceedings of the JMLR Workshop and Conference 2: AISTATS2007, pages 91–98, 2007.
 Dragomir [2004] Sever S. Dragomir. Semiinner products and Applications. Hauppauge, New York, 2004.
 Fasshauer et al. [2015] Gregory E. Fasshauer, Fred J. Hickernell, and Qi Ye. Solving support vector machines in reproducing kernel banach spaces with positive definite functions. Applied and Computational Harmonic Analysis, 38(1):115–139, 2015.
 HernándezLobato et al. [2014] José Miguel HernándezLobato, Matthew Hoffman, and Zoubin Ghahramani. Predictive entropy search for efficient global optimization of blackbox functions. In NIPS, pages 918–926, "" 2014.
 Jones et al. [1998] Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive blackbox functions. Journal of Global optimization, 13(4):455–492, 1998.
 Kushner [1964] Harold J. Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 86(1):97–106, 1964.
 Lanckriet et al. [2004] Gert R. G. Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5:27–72, April 2004.
 Li et al. [2017] Cheng Li, David Rubín de Celis Leal, Santu Rana, Sunil Gupta, Alessandra Sutti, Stewart Greenhill, Teo Slezak, Murray Height, and Svetha Venkatesh. Rapid bayesian optimisation for synthesis of short polymer fiber materials. Scientific reports, 7(1):5683, "" 2017.
 Mangasarian [1997] Olvi L. Mangasarian. Arbitrarynorm separating plane. Technical Report 9707, Mathematical Programming, May 1997.
 Mangasarian [1999] Olvi L. Mangasarian. Arbitrarynorm separating plane. Operations Research Letters, 24:15–23, 1999.
 Rasmussen [2006] Carl Edward Rasmussen. Gaussian processes for machine learning. MIT Press, 2006.
 Salzo and Suykens [2016] Saverio Salzo and Johan A. K Suykens. Generalized support vector regression: duality and tensorkernel representation. arXiv preprint arXiv:1603.05876, 2016.

Salzo et al. [2018]
Saverio Salzo, Johan A. K. Suykens, and Lorenzo Rosasco.
Solving norm regularization with tensor kernels.
In
Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) 2018
, April 2018.  Schölkopf and Smola [2001] Bernhard Schölkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, Massachusetts, 2001. ISBN 0262194759.
 Snoek et al. [2012] Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical bayesian optimization of machine learning algorithms. In Proceedings of NIPS 2012, pages 2951––2959, 2012.
 Srinivas et al. [2012] Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger. Informationtheoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250–3265, May 2012.
 [20] Alessandra Sutti, Mark Kirkland, Paul Collins, and Ross John George. An apparatus for producing nanobodies: Patent WO 2014134668 A1.
 Zhang et al. [2009] Haizhang Zhang, Yuesheng Xu, and Jun Zhang. Reproducing kernel banach spaces for machine learning. Journal of Machine Learning Research, 10:2741–2775, 2009.
Comments
There are no comments yet.