1 Introduction
We begin by formally defining the notion of the Gaussian process as prior distribution on functions where and we consider to be a “statespace” of parameters. In particular, given tuples , we assume that . We say that a series of such tuples, of cardinality
, induces a multivariate Gaussian distribution in
. This Gaussian architecture is appealing for several reasons:
It elegantly fits a basis function to the data, allowing for trivial inference of the behavior of all points in .

The underlying Gaussian assumptions permit statistical notions of expected improvement and uncertainty to arise in closed form from the fitted model.

The prior two points lead naturally to a framework that enables Gaussian processes to optimize parameters in machine learning models via a principled search of .
Optimization frameworks of this form offer an immediate advantage over discrete parameter optimization methodologies such as fold crossvalidation, which requires performance evaluations of the learned model if is the cardinality of the discrete set. This is computationally expensive and fails to generate knowledge of the model’s performance for when is not a member of the discrete parameter set used by crossvalidation. …. .. …. .. …. .. ….. .. ….. .. …. ….. ….. .. …… …….
Current generation Gaussian process optimization methods exploit the the expected improvement of the model performance above the current best at all points in . The improvement at is , where is the current best score of the objective function. It can be shown [3] that the expected improvement is:
(1) 
Here we represent the standard deviation of the prediction at
as and let. We also denote the CDF of the standard normal distribution as
and similarly for the standard normal PDF, . The essential idea of these optimization algorithms is to pursue function evaluations at those points yielding highest expected improvement in the objective function, thereby extracting more information about the nature of the true, underlying objective function itself. This process is continued until no further function evaluations are expected to yield improvements.In this work we consider additionally the applications of the probability of improvement
to enhancing the Monte Carlo nature of our algorithm. The probability of improvement can be derived as:
(2)  
(3) 
Here we say that , which is trivially shown as true. For the purposes of this work, we will refer to the expected improvement, probability of improvement, and meanvalue criteria for point selection as anticipation equations.
1.1 Squared Exponential Covariance Function
Equally necessary to the definition of the Gaussian process is the covariance kernel, which permits the Gaussian process to express a versatile set of basis functions to fit the underlying objective function. A common choice of kernel is squared exponential, which defines a matrix :
(4) 
and a vector
. The hyperparametersof the Gaussian process may be learned via maximum likelihood estimation by maximizing the
evidenceof the fitted model. These are functionally related to the prediction and variance of the prediction as follows:
(5)  
(6) 
The squared exponential kernel is a frequent choice within the Gaussian process literature, so we select it here as a practical (and interpretable) baseline for our proposed methodology. Some authors have criticized this choice of kernel as providing an unreasonably smooth interpolation of the basis
[1]. The alternative option is that of Snook et al. though we do not implement the Matérn kernel:(7)  
(8) 
2 Machine Learning Problem Formalism
In the context of machine learning with are typically presented with a model , which is a function of the observations , the targets , and the model parameters . The efficacy of this model can then be evaluated by a score function , which is most commonly either the accuracy (to be maximized) or the error (to be minimized). Because and are fixed, the ability of the model to generate predictions depends necessarily on
(and perhaps also on random starting conditions in, for example, neural networks). Regardless of whether or not the model parameters are discrete
^{1}^{1}1For example, consider the number of trees grown in a decision forest. or continuous^{2}^{2}2For example, consider the parameter in a SVM with a Gaussian kernel., it is possible to fit a regression function through the score function values at the point .Using this architecture, the fundamental optimization procedure is as follows:
Algorithm 1: Original Gaussian Process Optimization
Input: A labeled data set
and parameters .
Output: Proposed best parameters which maximize the score function.
Algorithm: Learn using the data and .
Evaluate and initialize set of tuples with .
Initialize .
While: Stopping criterion False
Fit a Gaussian process to . Infer a that is anticipated to yield the greatest difference by an anticipation equation.
Evaluate and add tuple to .
If:
Return:
The weakness of this algorithm is that, under most circumstances, if there is no indication that a scoring function evaluation at will lead to improvement, that point will not be evaluated. This is true even when the Gaussian process knows very little about the nature of the function at . As a result, this optimization procedure can be prone to finding poor local minima due to, for instance, bad initialization of . This can be combatted to an extent by pursuing multiple random starts of the algorithm, however that process begins to resemble precisely the kind of crossvalidation procedure we wished to avoid.
In the algorithm, we indicate an unspecified stopping criterion for the optimization. In our experiments, we specify that the algorithm should complete a predetermined number steps unless it converges to a maximum (either local or global) of its own accord and suspects that no further function evaluations are worthwhile, in which case termination is immediate.
3 A Hybrid Optimization Algorithm
It is apparent that it would be preferable if our optimization algorithm incorporated in itself a mechanism to search for maxima in regions about which the Gaussian process can infer very little. However, it is equally apparent that an algorithm that searches only in those lowknowledge regions will be inefficient. Therefore, a superior algorithm would choose to evaluate regions of high uncertainty only a small fraction of the time, and would otherwise devote its attention to maximizing the scoring function in the typical fashion. To this end, we propose to incorporate what nearly amounts to a MetropolisHastingslike step such that the algorithm will use biased “coin flips” to determine whether or not an uncertain region is evaluated in the next iteration.
We note that exploring the region of highest uncertainty offers an immediate advantage over other common, uncertaintybased approaches, namely the method of searching the Gaussian process’ upper confidence bound. In particular, the upper confidence bound would require the additional tuning of a width parameter . We can begin to express this idea in the following algorithm, which preserves the core of the Gaussian process optimization algorithm, yet incorporates a kind of exploratory awareness that can lead to gains.
Algorithm 2: Hybrid Gaussian Process Optimization
Input: Labeled data set
and parameters .
Output: Proposed best parameters which maximize the score function.
Algorithm: Learn on the data and .
Evaluate and initialize set of tuples with .
Initialize and set a threshold .
While: Stopping criterion False
Fit a Gaussian process to .
Infer a that is anticipated to yield the greatest difference by an anticipation equation.
Obtain the closedform standard deviations of all points in and retrieve that point
.
Generate a random uniform value .
If:
Evaluate and add tuple to
If:
Else:
Evaluate and add tuple to
If:
Return:
In our experiments, we select the threshold . In the interest of demonstrating the efficacy of our new algorithm, we construct a toy example that shows an instance where expected improvement optimization terminates before finding the global maximum, whereas our algorithm does precisely the opposite. In particular, for an input , we define our score function by the equation, . We initialize both algorithms with an identical triplet of known function points, and ask the algorithms to run twenty iterations unless convergence is achieved. In the case of the original optimization algorithm, the Gaussian process quickly finds the local optimum, but chooses to discontinue searching the space after four iterations. By contrast, the hybrid architecture also finds the local optimum in four iterations, but then evaluates the point of largest uncertainty, which is near the global maximum. This phenomenon is illustrated in Figure 1 and leads us to validate the hypothesis that our algorithm is capable of finding improved maxima in optimization problems.
Details of Experiments for the Employed Data Set  

Domain  Raw Features  Response  Data Set Cardinality 
Australian Credit Scoring  16  Desired credit approval of individuals based on characteristics  690 
Data set descriptions for the experiments used to validate the efficacy of the proposed algorithm. We summarize here the domain of the application, the input features to the algorithm, the response variable we wish to predict and the number of examples provided in the data.
3.1 Variable Threshold Algorithm
For some purposes it may be desirable not to use a fixed threshold for selecting a proportion of instances to search areas of high uncertainty. We therefore present an additional algorithm which incorporates a dynamic thresholding for choosing to explore lowknowledge regions. This methodology is principled in the sense that it employs the probability of improvement of the highest uncertainty point as a scaling parameter on a “basis” threshold , which may equal unity if so desired. This permits exploration of unknown spaces a portion of the time (unlike the original algorithm), yet also recognizes that it can be advantageous to focus closely on maximizing expected improvement in a fashion that is inversely proportional to the probability of improvement at the location of highest uncertainty in .
Algorithm 3: Variable Threshold Gaussian Process Optimization
Input: A labeled data set
and parameters .
Output: Proposed best parameters which maximize the score function.
Algorithm: Learn on the data and .
Evaluate and initialize set of tuples with .
Initialize and set a “basis” threshold .
While: Stopping criterion False
Fit a Gaussian process to .
Infer a that is anticipated to yield the greatest difference by an anticipation equation.
Obtain the closedform standard deviations of all points in and retrieve that point
.
Obtain the probability of improvement for , and generate a random uniform value .
If:
Evaluate and add tuple to
If:
Else:
Evaluate and add tuple to
If:
Return:
4 Experimental Results
We now turn our attention to analyzing the performance of the variable thresholding algorithm in application to a common machine learning benchmark. We use the Australian credit approval data set available at the UCI Machine Learning repository [4]
. We summarize important statistics of this dataset in Table 1. We employ this dataset for testing the algorithm because it offers a range of variable types: continuous and categorical variables, in addition to missing values.
For the parameter selection stage, we train a random forest, which relies on minimizing the impurity measurement in a series of binary splits. To give an intuitive idea of the random forest’s approach to machine learning, we provide the following formal definition. Given a set of candidate splitting tests at a particular node in a decision tree,
, we seek to split the data that is satisfies:(9) 
Where
represents the class posterior probability (of class
) for the binary split for a data point located in the region of variable space identified as . In the case of this optimization experiment, we will attempt to identify the setting for the number of grown trees that simultaneously maximizes prediction accuracy and minimizes computation complexity.For credit approval classification, our algorithm correctly identifies the optimal setting of parameters within ten iterations of the algorithm, having converged by the ninth. By contrast, the original Gaussian optimization algorithm fails to identify the best number of trees to create in the forest, opting for a value far larger than is empirically shown to be
Details of Experiments for the Variable Threshold Algorithm  
Statistic  Average  Minimum  Maximum  Standard Deviation 
Predictive Accuracy of Random Forest  
Convergence Time of Optimization Algorithm 
necessary. The variable threshold process of parameter selection, by virtue of its exploratory capability, identifies that approximately forty decision trees are necessary to achieve maximum accuracy on unnormalized features. By contrast, the original approach terminates with a selection of 97 decision trees, a significant increase in the computation complexity of the learning algorithm. We report in Table 2. some of the statistics related to the classification results of the random forest and of the convergence of the variable threshold algorithm.
5 Conclusion and Discussion
We have presented here two novel frameworks for Gaussian process optimization. In the case of variable thresholding, we find that we are able to produce results that are superior to those yielded by the original Gaussian process approach. We believe that this particular approach to hyperparameter value assignment has many benefits over
other competing techniques such as fold crossvalidation. We hope that these algorithms will find application in other machine learning applications where parameter optimization is crucial. In particular, we foresee applications to neural network learning as a future application of the approach.
References
 [1] Snook, Jasper and Hugo Larochelle and Ryan Adams. Practical Bayesian Optimization of Machine Learning Algorithms. arXiv. August 29, 2012.
 [2] Frean, Marcus and Phillip Boyle. Using Gaussian Processes to Optimize Expensive Functions. Victoria University of Wellington. 2008.
 [3] Benassi, Romain and Julien Bect and Emmanuel Vazquez. Robust Gaussian ProcessBased Global Optimization Using a Fully Bayesian Expected Improvement Criterion. Learning and Intelligent Optimization. 2011.
 [4] Bache, K. & Lichman, M.UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. 2013.
 [5] Compustat Database. Wharton Research Data Services. University of Pennsylvania, Web. 13 Apr. 2013. https://wrdsweb.wharton.upenn.edu/wrds/.
Comments
There are no comments yet.