Kernel Pre-Training in Feature Space via m-Kernels

05/21/2018 ∙ by Alistair Shilton, et al. ∙ Western Sydney University 0

This paper presents a novel approach to kernel tuning. The method presented borrows techniques from reproducing kernel Banach space (RKBS) theory and tensor kernels and leverages them to convert (re-weight in feature space) existing kernel functions into new, problem-specific kernels using auxiliary data. The proposed method is applied to accelerating Bayesian optimisation via covariance (kernel) function pre-tuning for short-polymer fibre manufacture and alloy design.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The kernel trick [2; 17; 3]

is well known in machine learning and provides an elegant means of encapsulating a combined feature map (non-linear map

from input space into feature space) and inner product into a simple function that effectively hides the feature map: starting from a positive definite kernel one can prove that there exists an associated (implicit) feature map such that , all without knowing (or needing to know) what form actually takes. This allows the (implicit) use feature maps of essentially arbitrary complexity at little or no additional computational cost.

While elegant, the kernel trick is not magic. Typically the kernel is selected from a set of “standard” kernels to minimise cross-fold error, test-set error, log-likelihood or similar. In so doing one is essentially picking a feature map from a bag of “standard” feature maps. The result of this process is a feature map (not known but implicitly defined by a kernel) that may, barring an extremely unlikely perfect match scenario, be viewed as the least-worst of the available maps. Techniques such as hyper-parameter tuning, multi-kernel learning [10; 1] etc aim to improve on this situation by fine-tuning the kernel (and hence the implicit feature map) or combining kernels (and hence the combining the feature maps implicit in them). However one is still limited in the space of reachable feature maps, and there is no clear interpretation of such techniques from a feature-space perspective.

Our motivation is as follows: suppose the “best” kernel for a given dataset , found using standard techniques, has associated with it an (implicit) feature map . As the least-worst option, this map will have many individual features in it that are highly relevant to the problem at hand, which we would like to emphasise, but also some features that are either not relevant or actively misleading which we would like to suppress or remove. This implies three questions: (a) how can we identify the (ir)relevant features, (b) how can we amplify (or suppress) features to obtain a better feature map and (c) how can we do steps (a) and (b) without direct access to the feature map (whose existence we infer from the positive definiteness of but whose form we do not know)?

Figure 1: Geometry of kernel re-weighting in feature space, -dimensional example.

To address question (a) we use standard machine-learning techniques. If we apply for example a support vector machine (SVM)

[2] method (or similar) equipped with kernel to learn either from dataset or some related dataset then the answer we obtain takes the form of the representation of a weight vector in feature space. Assuming a reasonable “fit” the weights will be larger in magnitude for relevant features and smaller for irrelevant ones.

To address questions (b) and (c) we borrow concepts from reproducing kernel Banach space (RKBS) theory [4; 21; 6] and -norm regularisation [16; 15] - in particular -kernels (tensor kernels [15]

, moment functions

[4]) - and prove that it is possible to adjust the magnitudes of individual features without explicitly knowing the feature map. We show that, if is a kernel with implied feature map , and implicitly define a weight vector in feature space (with larger implying greater relevance), then we may perform a kernel re-weighting operation:

that converts a kernel (whose implied feature map containing both relevant and irrelevant features) into a kernel with implicit features that emphasises relevant features (larger ) and suppresses irrelevant or misleading ones (smaller ), as shown in figure 1. That is, we may pre-tune (re-weight) our kernel to “fit” a particular problem, adjusting the implicit feature map in a principled manner without any explicit (direct) knowledge of the feature map itself.

To achieve this in section 2.1 we describe -kernels (tensor-kernels [15; 16], moment function [4]) formally, and then in section 3 introduce a new concept, free kernels, which are families of kernels embodying the same underlying feature map over a range of (examples are given in table 1). We then formally prove how -kernels may be re-weighted in theorem 1 to emphasise/suppress features, and in section 4 develop an algorithm (algorithm 1) that utilises the concept of kernel re-weighting to tune kernels by suppressing irrelevant features and emphasising important ones.

We demonstrate our method on accelerated Bayesian Optimisation (BO [18]). By pre-tuning the covariance function (kernel) of the Gaussian Process (GP [14]) using auxiliary data we show a speedup in convergence due to better modelling of our function. We consider (1) new short polymer fibre design using micro-fluid devices (where auxiliary data is generated by an older, slightly different device) and (2) design of a new hybrid Aluminium alloy (where auxiliary data is based on 46 existing patents for aluminium 6000, 7000 and 2000 series). In both cases kernel pre-tuning results in superior performance.

Our main contributions are:

  • Introduction of the concept free kernels: families of -kernels whose corresponding (implied) feature weights and maps are independent of (definition 1, section 3).

  • Construction of a range of -kernel analogues of standard kernels (table 1).

  • Development of kernel re-weighting theory: a method of tuning free kernels to adjust implied feature weights (theorem 1, section 3).

  • Design of an algorigthm using kernel re-weighting for pre-tuning of kernels to fit data (algorihm 1, section 4).

  • Demonstration of our algorithm on accelerated Bayesian Optimisation (section 5).

1.1 Notation

Sets are written ; with , , . Column vectors are bold lower case . Matrices are bold upper case . Element of vector is . Element of matrix is . is the transpose, the elementwise product, the elementwise power, the elementwise absolute value, the elementwise sign, and . a vector of s and a vector of s. The Kronecker product is . The -inner-product is [5; 15; 16].

2 Problem Statement and Background

The kernel trick is well known in machine learning. A (Mercer) kernel is a function for which there exists a corresponding feature map , , such that :

(1)

Geometrically, if is the angle between and in feature space:

(2)

so is a measure of the similarity of and in terms of their alignment in feature space. For a normalised kernel such as the RBF kernel this simplifies to .

In practice a kernel is usually selected from a set of well-known kernels, (e.g. the polynomial kernel or an RBF kernel ), possibly with additional hyper-parameter tuning and/or consideration of kernel combinations (e.g. multi-kernel learning [10; 1]). However the resulting kernel (and the feature map implied by it) may still be viewed as a “least-worst fit” from a set of readily available feature maps.

In the present paper we show how the feature map implied by a given kernel may be tuned, in a principled manner, to make it better fit the data. Using techniques from reproducing kernel Banach spaces [4; 21; 6] and -norm regularisation [16; 15] to show how kernels can be pre-trained or re-weighted by scaling the features of the feature map embodied by a kernel :

(3)

to emphasise important features (in feature space) over unimportant features. The geometry of this operation is shown in figure 1: important features can be emphasised or amplified, while irrelevant or misleading features are de-emphasised.

2.1 -Kernels

A number of generalisations of the basic concept of kernel functions arise in generalised norm SVMs [12; 13], reproducing kernel Banach-space (RKBS) theory [4; 21; 6] and -norm regularisation [16; 15]. Of interest here is the -kernel (tensor kernel [16], moment function [4]), which is a function for which there exists an (unknown) feature map such that:

(4)

(so Mercer kernels are -kernels). Discussion of the properties of -kernels may be found in [15; 16; 4]. Examples of -kernels include:

  • -inner-product kernels: By analogy with the inner-product kernels it may be shown [15] that, given expandable as a Taylor series , the function:

    (5)

    is an -kernel if and only if all terms in the series are non-negative.

  • -direct-product kernels: Similarly for any Taylor-expandable , , with non-negative terms the function:

    (6)

    is an -kernel (a special case of a Taylor kernel [15]).

A sample of -kernels is presented in table 1. The generalised RBF kernel in this table is constructed from the exponential kernel, and reduces to the standard RBF kernel when . The implied feature map for this kernel is , where is the feature map of the exponential kernel. Note that this is independent of .

Linear:
Polynomial order :
Hyperbolic sine ():
Exponential ():
Inverse Gudermannian ():
Log ratio: (assuming )
RBF ():
Table 1: Examples of (free) -kernels. See text (section 2.1) for discussion of RBF kernel.

2.2 -Norm Support Vector Machines

A canonical application of -kernels is the -norm support vector machine (SVM) (-SVM [16], max-margin

moment classifier

[4]). Let be a training set. Following [16] the aim is to find a sparse (in ) trained machine:

(7)

(where is implied by a -kernel ) to fit the data. , are found by solving the -norm SVM training problem, where is dual to (i.e. ):

(8)

where is strictly monotonically increasing, is an arbitrary empirical risk function, and the use of -norm regularisation with encourages sparsity in in feature space. Following [16; 15]:

(9)

(representor theorem) and hence:

(10)

where is a -kernel with implied feature map (the -kernel trick). Moreover we may completely suppress and construct a dual training problem entirely in terms of [15; 16] - e.g. if ,

(ridge regression), the dual training problem is:

(11)

where . Similar results, analagous to the “standard” SVMs (e.g. binary classification) may be likewise constructed [15; 16; 4].

3 Making Kernels from -Kernels - Free Kernels and Kernel Re-Weighting

In the present context we wish to leverage the additional expressive power of the -kernels to directly tune the feature map to suit the problem (or problems) at hand. In particular we will demonstrate how kernels may be pre-tuned or learnt for a particular dataset and then transferred to other problems in related domains. We begin with the following definition:

Definition 1 (Free kernels)

Let . A free kernel (of order ) is a family of functions indexed by for which there exists an (unweighted) feature map and feature weights (), both independent of , such that :

(12)

For fixed a free kernel of order defines (is) an -kernel with implied feature map:

(13)

We assume free kernels of order throughout unless otherwise specified. Note that the -inner-product and -direct product kernels are free kernels (it is straightforward to show that they have implied feature map and implied feature weights, respectively, and . It follows that all of the kernels in table 1 are free kernels.111The RBF kernel has unweighted feature map and weights , where , , are the feature map (), unweighted feature map and weights of the exponential kernel. Given a free-kernel of order we have the following key theorem that enables us to re-weight or tune the kernel:

Theorem 1

Let be a free kernel of order with implied feature map and feature weights ; and let and . Then the function defined by:

(14)

defines a free kernel of order with implied feature map and weights , where:

and we note that has the form of the representation (9) of in a -norm SVM, .

Proof:

This result follows from definition 1 by substitution and application of equation (13).

4 The Kernel Pre-Tuning Algorithm

Figure 2: Process to derive re-weighted kernel (dotted box). The (implied, not explicit) process in feature space is presented below the dotted box.

Having established our theoretical results we arrive at the core of our method - an algorithm for tuning kernels using re-weighting (theorem 1) to fit a dataset. Our algorithm is detailed in algorithm 1 and illustrated in figure 2. It is assumed that we are given a dataset from which to infer feature relevance, and a free kernel . Then, assuming , for simplicity, we proceed as follows:

  1. The free kernel defines a two-kernel for , implying feature map by (13).

  2. Train an SVM using and to obtain and hence , implying weights in feature space by theorem 1.

  3. Using and , construct re-weighted kernel using (14), where has implied feature map by theorem 1.

In the more general case , a -norm SVM () generates a sparse weight vector , but the concept is the same. Note that at no point in this process do we need to explicitly know the implied feature map or weights - all work is done entirely in kernel space.

0:  Dataset , free kernel of order , order .
  Train a -norm SVM () with with -kernel ( with ) to get .
  Construct re-weighted free-kernel of order using the definition:
Algorithm 1 Kernel Tuning (re-weighting) Algorithm.

5 Application: Accelerated Bayesian Optimisation

In this section we present a practical example of the application of kernel pre-tuning via re-weighting, namely accelerated Bayesian Optimisation.

Bayesian Optimisation (BO) [18] is a form of sequential model-based optimisation (SMBO) that aims to to find with the least number of evaluations for an expensive (to evaluate) function . It is assumed that is a draw from a zero mean Gaussian Process (GP) [14] with covariance function (kernel) . As with kernel selection,

is typically not given a-priori but selected from a set of “known” covariance functions (kernels) using heuristics such as max-log-likelihood. Nevertheless the speed of convergence is critically dependent on having a good model for

, which requires selection of an appropriate covariance function.

In this experiment we consider the case where we have access to prior knowledge that is related to - but not generated by - . Let . In general and , but we assume that both and are influenced by the same (or very similar) features. In this experiment we use to tune our covariance function via algorithm 1 to fit (and hence ), giving us a better fit for our covariance function and accelerated convergence. Our algorithm is presented in algorithm 2, which is like a standard BO algorithm except for the covariance function pre-tuning step.

0:  , free kernel of order .
  Generate pre-tuned kernel using algorithm 1.
  Modelling , proceed:
  for  do
     Select test point .
     Perform Experiment .
     Update .
  end for
Algorithm 2 Bayesian Optimisation with Kernel re-weighting.

As noted previously we model , allowing us to model the posterior at iteration with mean

and variance

in the usual manner [14]. For the acquisition function we test expected improvement (EI) [8] and GP upper confidence bound (GP-UCB) [19]

, respectively (alternatives include probability of improvement (PI)

[9] and predictive entropy seach (PES) [7]).

5.0.1 Short Polymer Fibres

Figure 3: Device geometry for short polymer fibre injection.

In this experiment we have tested our algorithm on the real-world application of optimizing short polymer fibre (SPF) to achieve a given (median) target length [11]. This process involves the injection of one polymer into another in a special device [20]. The process is controlled by geometric parameters (channel width (mm), constriction angle (degree), device position (mm)) and flow factors (butanol speed (ml/hr), polymer concentration (cm/s)) that parametrise the experiment - see figure 3. Two devices (A and B) were used. Device A is armed by a gear pump and allows for three butanol speeds (86.42, 67.90 and 43.21). The newer device B has a lobe pump and allows butonal speed 98, 63 and 48. Our goal is to design a new short polymer fibre for Device B that results in a (median) target fibre length of 500m.


Figure 4: Short Polymer Fibre design. Comparison of algorithms in terms of minimum squared distance from the set target versus iterations. GP-UCB and EI indicate acquisition function used. MK indicates mixture kernel used. rmK indicates our proposed method.

We write the device parameters as and the result of experiments (median fibre length) on each device as and , respectively. Device A has been characterised to give a dataset of input/output pairs. We aim to minimise: , noting that (the objective differs from the function generating , although both relate to fibre length). Device B has been similarly characterised and this grid forms our search space for Bayesian optimisation.

For this experiment we have used the free RBF kernel . An SVM was trained using

and this kernel (hyperparameters

and were selected to minimise leave-one-out mean-squared-error (LOO-MSE)) to obtain . The re-weighted kernel obtained from this was normalised (to ensure good conditioning along the diagonal of ) and used in Bayesian optimisation as per algorithm 2. All data was normalised to and all experiments were averaged over repetitions.

We have tested both the EI and GP-UCB acquisition functions. Figure 4 shows the convergence of our proposed algorithm. Also shown for comparison are standard Bayesian optimisation (using a standard RBF kernel as our covariance function); and a variant of our algorithm where a kernel mixture model trained on is used as our covariance function - specifically:

(15)

where is an RBF kernel, a Matérn 1/2 kernel, a Matérn 3/2 kernel; and and all relevant (kernel) hyperparameters are selected to minimise LOO-MSE on . Relevant hyperparameters in Bayesian optimisation were selected using max-log-likelihood at each iteration. As can be seen our proposed approach outperforms other methods with both GP-UCB and EI acquisition functions.

5.0.2 Aluminium Alloy Design using Thermo-Calc

Figure 5: Reegression coefficients for 25 phases as determined from patent data.
Figure 6: Aluminium alloy design. Comparison of algorithms in terms of maximum utility score versus iterations. GP-UCB and EI indicate acquisition function used. rmK indicates our proposed method.
Figure 5: Reegression coefficients for 25 phases as determined from patent data.

This experiment considers optimising a new hybrid Aluminium alloy for target yield strength. Designing an alloy is an expensive process. Casting an alloy and then measuring its properties usually takes long time. An alloy has certain phase structures that determine its material properties. For example, phases such as C14LAVES and ALSC3 are known to increase yield strength whilst others such as AL3ZR_D023 and ALLI_B32 reduce the yield strength of the alloy. However a precise function relating the phases to yield strength does not exist. The simulation software Thermo-Calc takes a mixture of component elements as input and computes the phase composition of the resulting alloy. We consider 11 elements as potential constituents of the alloy and 24 phases. We use Thermo-Calc for this computation.

A dataset of

closely related alloys filed as patents was collected. This dataset consists information about the composition of the elements in the alloy and their yield strength. The phase compositions extracted from Thermo-Calc simulations for various alloy compositions were used to understand the positive or negative contribution of phases to the yield strength of the alloy using linear regression. The weights retrieved for these phases were then used formulate a utility function. Figure

6 shows the regression coeffs for the phases contributing to the yield strength.

The kernel selection and tuning proceedure was used here as for the short polymer fibre experiment. We have tested both the EI and GP-UCB acquisition functions. Figure 6 shows the convergence of our proposed algorithm compared to standard Bayesian optimisation (using a standard RBF kernel as our covariance function). Relevant hyperparameters in Bayesian optimisation ( for our method and kernel mixtures, and for standard Bayesian optimisation) were selected using max-log-likelihood at each iteration. As can be seen from figure 6 our proposed approach outperforms standard Bayesian optimisation by a significant margin for both EI and GP-UCB.

5.0.3 Simulated Experiment

Figure 7: Simulated experiment: (a) Target optimisation function in experiment 3, (b) Comparison of algorithms in terms of minimum versus iterations. GP-UCB and EI indicate acquisition function used. MK indicates mixture kernel used. rmK indicates our proposed method.

In this experiment we consider use of kernel re-weighting to incorporate domain knowledge into a kernel design. We aim to minimise the function:

as illustrated in figure 7, where . Noting that this function has rotational symmetry we select an additional dataset to exhibit this property, namely: , of vectors, where is selected uniformly randomly. Thus reflects the rotational symmetry of the target optimisation function but not its form. As for previous experiments a free RBF kernel was chosen and re-weighted using algorithm 1 with hyperparameters selected to minimise LOO-MSE. However in this case we have not normalised the reweighted kernel but rather used a composite kernel which implies a -layer feature map, the first layer being the re-weighted feature map implied by and the second layer being the standard feature map implied by the RBF kernel. All experiments were averaged over repetitions.

We have tested both the EI and GP-UCB acquisition functions. Figure 7 shows the convergence of our proposed algorithm compared to standard Bayesian optimisation (using a standard RBF kernel) and standard Bayesian optimisation with a kernel mixture model as per our short polymer fibre experiment (15) trained on used as the covariance function. Curiously in this case, while our method combined with GP-UCB outperforms the alternatives, the results are less clear for our method combined with EI. The precise reason for this will be investigated in future work.

6 Conclusion

In this paper we have presented a novel approach to kernel tuning. We have based our method on -kernel techniques from reproducing kernel Banach space theory and -regression. We have defined free kernels families whose implied feature map is independent of , along with a means of constructing them (with examples), and shown how the properties of these may be utilised to tune them by (implicitly) re-weighting the features in feature space in a principled manner. As an application we have presented an accelerated Bayesian optimisation algorithm that pre-tunes the covariance function on auxiliary data to achieve accelerated convergence, demonstrating the efficacy of our proposal.

Acknowledgments

This research was partially funded by the Australian Government through the Australian Research Council (ARC) and the Telstra-Deakin Centre of Excellence in Big Data and Machine Learning. Prof. Venkatesh is the recipient of an ARC Australian Laureate Fellowship (FL170100006).

References

  • Bach et al. [2004] Francis R. Bach, Gert R. G. Lanckriet, and Michael I. Jordan. Multiple kernel learning, conic duality and the SMO algorithm. In Proceedings of the 21st International Conference on Machine Learning (ICML), Banff, 2004.
  • Cortes and Vapnik [1995] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20(3):273–297, 1995.
  • Cristianini and Shawe-Taylor [2005] Nello Cristianini and John Shawe-Taylor. An Introductino to Support Vector Machines and other Kernel-Based Learning Methods. Cambridge University Press, Cambridge, UK, 2005.
  • Der and Lee [2007] Ricky Der and Danial Lee. Large-margin classification in banach spaces. In Proceedings of the JMLR Workshop and Conference 2: AISTATS2007, pages 91–98, 2007.
  • Dragomir [2004] Sever S. Dragomir. Semi-inner products and Applications. Hauppauge, New York, 2004.
  • Fasshauer et al. [2015] Gregory E. Fasshauer, Fred J. Hickernell, and Qi Ye. Solving support vector machines in reproducing kernel banach spaces with positive definite functions. Applied and Computational Harmonic Analysis, 38(1):115–139, 2015.
  • Hernández-Lobato et al. [2014] José Miguel Hernández-Lobato, Matthew Hoffman, and Zoubin Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In NIPS, pages 918–926, "" 2014.
  • Jones et al. [1998] Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455–492, 1998.
  • Kushner [1964] Harold J. Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 86(1):97–106, 1964.
  • Lanckriet et al. [2004] Gert R. G. Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5:27–72, April 2004.
  • Li et al. [2017] Cheng Li, David Rubín de Celis Leal, Santu Rana, Sunil Gupta, Alessandra Sutti, Stewart Greenhill, Teo Slezak, Murray Height, and Svetha Venkatesh. Rapid bayesian optimisation for synthesis of short polymer fiber materials. Scientific reports, 7(1):5683, "" 2017.
  • Mangasarian [1997] Olvi L. Mangasarian. Arbitrary-norm separating plane. Technical Report 97-07, Mathematical Programming, May 1997.
  • Mangasarian [1999] Olvi L. Mangasarian. Arbitrary-norm separating plane. Operations Research Letters, 24:15–23, 1999.
  • Rasmussen [2006] Carl Edward Rasmussen. Gaussian processes for machine learning. MIT Press, 2006.
  • Salzo and Suykens [2016] Saverio Salzo and Johan A. K Suykens. Generalized support vector regression: duality and tensor-kernel representation. arXiv preprint arXiv:1603.05876, 2016.
  • Salzo et al. [2018] Saverio Salzo, Johan A. K. Suykens, and Lorenzo Rosasco. Solving -norm regularization with tensor kernels. In

    Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) 2018

    , April 2018.
  • Schölkopf and Smola [2001] Bernhard Schölkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, Massachusetts, 2001. ISBN 0262194759.
  • Snoek et al. [2012] Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical bayesian optimization of machine learning algorithms. In Proceedings of NIPS 2012, pages 2951––2959, 2012.
  • Srinivas et al. [2012] Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger. Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250–3265, May 2012.
  • [20] Alessandra Sutti, Mark Kirkland, Paul Collins, and Ross John George. An apparatus for producing nano-bodies: Patent WO 2014134668 A1.
  • Zhang et al. [2009] Haizhang Zhang, Yuesheng Xu, and Jun Zhang. Reproducing kernel banach spaces for machine learning. Journal of Machine Learning Research, 10:2741–2775, 2009.