1 Introduction
This paper addresses the use of Bayesian Optimization for parameter tuning in Artificial Neural Networks that learn by Gradient Descent. We will focus on what are arguably the two most important network parameters, the learning rate
, and the node activation function hyperparameter
. For clarification on the definition of , suppose node in a Neural Network is a hyperbolic tanh node with hyperparameter . Then the output of this node, will be:where and are the synaptic weighting elements ( i.e. is the scale factor for the node function.) Whilst we have opted with and as our parameters, the ideas discussed throughout this paper generalise to any number of chosen parameters.
The selection of such parameters can be critical to a Network’s learning process. For example, a learning rate that is too small can stop the system from converging, whilst taking too high could lead to the system becoming stuck in a local minimum of the cost function. It can often be computationally expensive or time costly to evaluate a large set of network parameters and in this case finding the correct tuning becomes an art in itself. We consider an automated parameter tuning technique for scenarios in which an exhaustive parameter search is too costly to evaluate.
1.1 Translation to a BlackBox Problem
Suppose we have a fixed Neural Network topology with epochs of a chosen learning method (e.g 10,000 epochs of Gradient Descent BackPropagation). Then we denote this by . We also denote the network parameter space (the set of free parameters available to tune the network) for by . For example, if we wish to tune the learning rate and the activation function hyperparameter , then .
With this terminology one can view the objective/cost function that is associated with the Neural Network as a function of the parameter space for :
(1) 
The objective to optimally tune a network over a parameter space can now be summarised as finding such that:
(2) 
1.2 Bayesian Optimization
Bayesian Optimization is a powerful algorithmic approach to finding the extrema of BlackBox functions that are costly to evaluate. Bayesian Optimization techniques are some of the most efficient approaches with respect to the number of function evaluations required mockus1994application . Suppose we have observations where . The Bayesian Optimization algorithm works by using a Utility function in order to decide the next best point to evaluate. The algorithm written in terms of the parameter optimization problem (2) can be summarised as follows:
The algorithm terminates when reaches or if the Utility Function chooses the same element of consecutively. For more information on Bayesian Optimization see nandotut .
1.3 Choice of Parameter Space, Prior and Utility Function
This paper aims to optimize learning rate and node function hyperparameter as in Section (1.3) i.e. . We will take our prior distribution (as described in Algorithm 1) to be a Gaussian Process with a standard square exponential kernal (rasmussen_gaussian, ):
For speed the Julia code does not run the niave ShermanMorrisonWoodbury matrix multiplication rasmussen_gaussian . Instead it uses a Cholesky method detailed in rasmussen_gaussian ; both methods are explained further within the code file. For very large parameter spaces one may want to look into using a Gaussian Process update method of ambikasaran2014fast .
The chosen utility function is the Lower Confidence Bound nandotut :
It is worth noting that whilst this utility function is indeed sufficient for tuning the XOR Neural Network, it may be necessary to use an IMGPO utility function (exponential convergence) for larger scale problems (kawaguchi2015bayesian, ).
2 XOR Neural Network Optimization
Consider the XOR function compbook (Exclusive Or) defined as:
XOR has traditionally been a function of interest as it cannot be learned by a simple 2 layer Neural Network (elman1990finding, )
. Instead the problem requires a 3 layer Neural Network, (hence there will be 3 nodes affected by activation function hyperparameter tuning.) Figure 1 shows the topology of the Neural Network required to learn the XOR function. The Neural Network consists of an input layer followed by two hidden layers of sigmoid functions, where the output of a sigmoid node is
.XOR Neural Network
The target goal is to optimize , over the space
. The Neural Network’s attributed loss function is the Mean Square Error, the network’s topology
to be as shown in Figure 1 and also . This can all be neatly summarised as .For Deep Neural Networks or any Neural Network where the number of epochs is large, it can become very time costly to evaluate . Furthermore, to achieve a small step size when discretizing an exhaustive search of the Network parameters becomes time costly. In fact, for higher dimensional variants of it becomes simply impossible to exhaustively evaluate the parameter space. To keep our proposed solution generalised we will limit the number of parameter searches allowed to 20.
2.1 Code Structure
The Julia code is available with this preprint on arXiv. The code is distributed under a Creative Commons Attribution 4.0 International Public License. If you use this code please attribute to L. Stewart and M.A. Stalzer Bayesian Optimization for Parameter Tuning of the XOR Neural Network, 2017.
The code compares Bayesian Optimization to a Random Grid Search of , a technique commonly used to achieve an acceptable tuning of Neural Networks. During the Random Grid Search process the code selects 20 random points of and evaluates for each point. During Bayesian Optimization, the code computes for one random element of say and forms . It then runs the Bayesian Optimization algorithm described in Section 1.2 where .
3 Results
The results for the experiment denoted above in Section 2 are as follows:
Random Search  Bayesian Opt  

0.2681  0.1767  
# Search Points  20  6 
Run Time  1.5411  0.6012 
Where Search Points is the number of points that each of the processes evaluated (this will always be 20 for Random Search as it has no convergence criterion). Figure 2 shows the Mean Square Error plot for the randomly selected elements of , whilst Figure 3 shows the Mean Square Error plot for the points selected by Bayesian Optimization. The Merlot colored plane in Figure 3 denotes the minimum MSE reached by Random Grid Search for comparative purposes.
Figure 4 shows the development of Mean Square Error with time for Bayesian Optimization and also includes Random Grid Search for ease of error comparison.
It is clear Bayesian Optimization achieved higher accuracy whilst also running faster than Random Grid Search. This is due to the fact that the algorithm converged to the optimal value with respect to the LCB Metric.
This raises the question of how much more accurate is the result obtained by the LCB Metric (i.e. is the LCB a suitable Metric for tuning the XOR Neural Network)? To answer this it is useful to see the time required for Random Grid Search to achieve set a selection of threshold Mean Square Errors, including the Bayesian Optimal 0.1767. The results for this are displayed in Table 2.
Threshold MSE 
#Search Points  Time to Reach Threshold (s) 

0.190  2034  121.8 
0.185  5685  454.5 
0.177  N/A  N/A 
One can see that for Random Grid Search to reach a moderately lower error of just 0.190, the consequential time cost vastly increases. When setting the threshold value to the Bayesian Optimal (0.177), Random Grid Search did not even converge.
4 Concluding Remarks and Future Works
We have shown in Section 3 that Bayesian Optimization can be successfully implemented to attain a highly accurate tuning of the XOR Neural Network in a very fast time. There are many avenues for future development, a few of which are briefly summarised below:

Apply Bayesian Optimization to more complex Neural Network topologies i.e. Recurrent or Deep Neural Nets. To do this a more sophisticated utility function may be required. kawaguchi2015bayesian .

Apply to Bayesian Optimization to tuning higher dimensional variants of . Further parameters could be added e.g. momentum momentum and regularization rate regular
. Another possibility is to have a different learning rate/node function hyperparameter for each synapse/node. For both these tasks the use of Random Embeddings may be useful
wang2016bayesian . 
Test a wide variety of Gaussian Process Kernals and use a faster Kernal update method ambikasaran2014fast . It would also be interesting to investigate the effect of using different prior distributions in Bayesian Optimization DNGO .
Acknowledgements
This research is funded by the Caltech SURF program and the Gordon and Betty Moore Foundation through Grant GBMF4915 to the Caltech Center for DataDriven Discovery.
References
 [1] Jonas Mockus. Application of bayesian approach to numerical methods of global and stochastic optimization. Journal of Global Optimization, 4(4):347–365, 1994.
 [2] Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. CoRR, abs/1012.2599, 2010.
 [3] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning, volume 1. MIT press Cambridge, 2006.
 [4] Sivaram Ambikasaran, Daniel ForemanMackey, Leslie Greengard, David W Hogg, and Michael O’Neil. Fast direct methods for gaussian processes. arXiv preprint arXiv:1403.6015, 2014.
 [5] Kenji Kawaguchi, Leslie Pack Kaelbling, and Tomás LozanoPérez. Bayesian optimization with exponential convergence. In Advances in Neural Information Processing Systems, pages 2809–2817, 2015.
 [6] Gerhard J. Woeginger Alexander S. Kulikov. Computer science  theory and applications. 11th International Computer Science Symposium in Russia, CSR, 2016.
 [7] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
 [8] Robert A Jacobs. Increased rates of convergence through learning rate adaptation. Neural networks, 1(4):295–307, 1988.
 [9] Federico Girosi, Michael Jones, and Tomaso Poggio. Regularization theory and neural networks architectures. Neural computation, 7(2):219–269, 1995.

[10]
Ziyu Wang, Frank Hutter, Masrour Zoghi, David Matheson, and Nando de Feitas.
Bayesian optimization in a billion dimensions via random embeddings.
Journal of Artificial Intelligence Research
, 55:361–387, 2016.  [11] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. Scalable bayesian optimization using deep neural networks. In International Conference on Machine Learning, pages 2171–2180, 2015.