Bayesian Optimization for Parameter Tuning of the XOR Neural Network

09/22/2017 ∙ by Lawrence Stewart, et al. ∙ 0

When applying Machine Learning techniques to problems, one must select model parameters to ensure that the system converges but also does not become stuck at the objective function's local minimum. Tuning these parameters becomes a non-trivial task for large models and it is not always apparent if the user has found the optimal parameters. We aim to automate the process of tuning a Neural Network, (where only a limited number of parameter search attempts are available) by implementing Bayesian Optimization. In particular, by assigning Gaussian Process Priors to the parameter space, we utilize Bayesian Optimization to tune an Artificial Neural Network used to learn the XOR function, with the result of achieving higher prediction accuracy.



page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper addresses the use of Bayesian Optimization for parameter tuning in Artificial Neural Networks that learn by Gradient Descent. We will focus on what are arguably the two most important network parameters, the learning rate

, and the node activation function hyper-parameter

. For clarification on the definition of , suppose node in a Neural Network is a hyperbolic tanh node with hyper-parameter . Then the output of this node, will be:

where and are the synaptic weighting elements ( i.e. is the scale factor for the node function.) Whilst we have opted with and as our parameters, the ideas discussed throughout this paper generalise to any number of chosen parameters.

The selection of such parameters can be critical to a Network’s learning process. For example, a learning rate that is too small can stop the system from converging, whilst taking too high could lead to the system becoming stuck in a local minimum of the cost function. It can often be computationally expensive or time costly to evaluate a large set of network parameters and in this case finding the correct tuning becomes an art in itself. We consider an automated parameter tuning technique for scenarios in which an exhaustive parameter search is too costly to evaluate.

1.1 Translation to a Black-Box Problem

Suppose we have a fixed Neural Network topology with epochs of a chosen learning method (e.g 10,000 epochs of Gradient Descent Back-Propagation). Then we denote this by . We also denote the network parameter space (the set of free parameters available to tune the network) for by . For example, if we wish to tune the learning rate and the activation function hyper-parameter , then .

With this terminology one can view the objective/cost function that is associated with the Neural Network as a function of the parameter space for :


The objective to optimally tune a network over a parameter space can now be summarised as finding such that:


1.2 Bayesian Optimization

Bayesian Optimization is a powerful algorithmic approach to finding the extrema of Black-Box functions that are costly to evaluate. Bayesian Optimization techniques are some of the most efficient approaches with respect to the number of function evaluations required mockus1994application . Suppose we have observations where . The Bayesian Optimization algorithm works by using a Utility function in order to decide the next best point to evaluate. The algorithm written in terms of the parameter optimization problem (2) can be summarised as follows:

1:for  to  do
2:     Find
3:     Sample the objective function at the chosen point
4:     Augment the data
5:     Update the Prior Distribution
Algorithm 1 Bayesian Optimization

The algorithm terminates when reaches or if the Utility Function chooses the same element of consecutively. For more information on Bayesian Optimization see nandotut .

1.3 Choice of Parameter Space, Prior and Utility Function

This paper aims to optimize learning rate and node function hyper-parameter as in Section (1.3)   i.e. . We will take our prior distribution (as described in Algorithm 1) to be a Gaussian Process with a standard square exponential kernal (rasmussen_gaussian, ):

For speed the Julia code does not run the niave Sherman-Morrison-Woodbury matrix multiplication rasmussen_gaussian . Instead it uses a Cholesky method detailed in rasmussen_gaussian ; both methods are explained further within the code file. For very large parameter spaces one may want to look into using a Gaussian Process update method of ambikasaran2014fast .

The chosen utility function is the Lower Confidence Bound nandotut :

It is worth noting that whilst this utility function is indeed sufficient for tuning the XOR Neural Network, it may be necessary to use an IMGPO utility function (exponential convergence) for larger scale problems (kawaguchi2015bayesian, ).

2 XOR Neural Network Optimization

Consider the XOR function compbook (Exclusive Or) defined as:

XOR has traditionally been a function of interest as it cannot be learned by a simple 2 layer Neural Network (elman1990finding, )

. Instead the problem requires a 3 layer Neural Network, (hence there will be 3 nodes affected by activation function hyper-parameter tuning.) Figure 1 shows the topology of the Neural Network required to learn the XOR function. The Neural Network consists of an input layer followed by two hidden layers of sigmoid functions, where the output of a sigmoid node is


XOR Neural Network

Figure 1:

The target goal is to optimize , over the space

. The Neural Network’s attributed loss function is the Mean Square Error, the network’s topology

to be as shown in Figure 1 and also . This can all be neatly summarised as .

For Deep Neural Networks or any Neural Network where the number of epochs is large, it can become very time costly to evaluate . Furthermore, to achieve a small step size when discretizing an exhaustive search of the Network parameters becomes time costly. In fact, for higher dimensional variants of it becomes simply impossible to exhaustively evaluate the parameter space. To keep our proposed solution generalised we will limit the number of parameter searches allowed to 20.

2.1 Code Structure

Figure 2: Figure 3:

The Julia code is available with this pre-print on arXiv. The code is distributed under a Creative Commons Attribution 4.0 International Public License. If you use this code please attribute to L. Stewart and M.A. Stalzer Bayesian Optimization for Parameter Tuning of the XOR Neural Network, 2017.

The code compares Bayesian Optimization to a Random Grid Search of , a technique commonly used to achieve an acceptable tuning of Neural Networks. During the Random Grid Search process the code selects 20 random points of and evaluates for each point. During Bayesian Optimization, the code computes for one random element of say and forms . It then runs the Bayesian Optimization algorithm described in Section 1.2 where .

3 Results

The results for the experiment denoted above in Section 2 are as follows:

Random Search Bayesian Opt
0.2681 0.1767
# Search Points 20 6
Run Time 1.5411 0.6012
Table 1:

Where Search Points is the number of points that each of the processes evaluated (this will always be 20 for Random Search as it has no convergence criterion). Figure 2 shows the Mean Square Error plot for the randomly selected elements of , whilst Figure 3 shows the Mean Square Error plot for the points selected by Bayesian Optimization. The Merlot colored plane in Figure 3 denotes the minimum MSE reached by Random Grid Search for comparative purposes.

Figure 4 shows the development of Mean Square Error with time for Bayesian Optimization and also includes Random Grid Search for ease of error comparison.

Figure 4:

It is clear Bayesian Optimization achieved higher accuracy whilst also running faster than Random Grid Search. This is due to the fact that the algorithm converged to the optimal value with respect to the LCB Metric.

This raises the question of how much more accurate is the result obtained by the LCB Metric (i.e. is the LCB a suitable Metric for tuning the XOR Neural Network)? To answer this it is useful to see the time required for Random Grid Search to achieve set a selection of threshold Mean Square Errors, including the Bayesian Optimal 0.1767. The results for this are displayed in Table 2.

Threshold MSE
#Search Points Time to Reach Threshold (s)
0.190 2034 121.8
0.185 5685 454.5
0.177 N/A N/A
Table 2:

One can see that for Random Grid Search to reach a moderately lower error of just 0.190, the consequential time cost vastly increases. When setting the threshold value to the Bayesian Optimal (0.177), Random Grid Search did not even converge.

4 Concluding Remarks and Future Works

We have shown in Section 3 that Bayesian Optimization can be successfully implemented to attain a highly accurate tuning of the XOR Neural Network in a very fast time. There are many avenues for future development, a few of which are briefly summarised below:

  • Apply Bayesian Optimization to more complex Neural Network topologies i.e. Recurrent or Deep Neural Nets. To do this a more sophisticated utility function may be required. kawaguchi2015bayesian .

  • Apply to Bayesian Optimization to tuning higher dimensional variants of . Further parameters could be added e.g. momentum momentum and regularization rate regular

    . Another possibility is to have a different learning rate/node function hyper-parameter for each synapse/node. For both these tasks the use of Random Embeddings may be useful

    wang2016bayesian .

  • Test a wide variety of Gaussian Process Kernals and use a faster Kernal update method ambikasaran2014fast . It would also be interesting to investigate the effect of using different prior distributions in Bayesian Optimization DNGO .


This research is funded by the Caltech SURF program and the Gordon and Betty Moore Foundation through Grant GBMF4915 to the Caltech Center for Data-Driven Discovery.