1 Introduction
During the last decades, machine learning algorithms and deep neural networks have shown remarkable potentials in numerous fields. However, despite their success, algorithms of this kind may still be hard to design, and their performance usually highly depend upon the choice of numerous criteria (learning rate, optimiser, structure of the layers, etc.), called hyper-parameters. In practice, finding an optimal combination of those hyper-parameters can often make the difference between bad or average results and state-of-the-art performance. The search of this optimal setting can be performed by maximising the accuracy of the given learning algorithm. However, this process is made particularly arduous in that the maximisation of such a function is usually very expensive.
On the basis of that observation, we decided to develop a strategy for predicting the value of the objective function
at low-cost, in an off-line fashion. Based on Support Vector Machine (SVM) and curve fitting, this method enables to obtain a prediction of the accuracy of an artificial neural network (NN), exploiting only the behaviour of the NN over the first epochs of its learning. To the best of our knowledge, no other article in the literature mentions approaches for a low-cost approximation of the accuracy of a NN, and in this respect, the ideas lying behind our method may constitute a breakthrough in the domain of artificial intelligence. The algorithm we propose can be of particular interest for a rather fast quality assessment of a new learning algorithm. Specially, it may facilitate hyper-parameters optimisation in an interesting way.
The Section 2 of this article describes the main steps of our method for predicting the accuracy of a given NN. Section 3 deals with an application of our method to a topical issue : hyper-parameter optimisation. Section 4 shows the results of the experiments on the MNIST and CIFAR-10 databases. Finally, the paper ends with concluding remarks in Section 5.
2 Methodology
A database is created beforehand, by training the NN algorithm of interest up to convergence, with different sets of randomly chosen hyper-parameters. The prediction accuracies of the NN algorithm are gathered throughout this process, and eventually constitute a database on which the hereinafter-described method is applied. Please note that a new database is to be created every time a new set of training/test samples is considered.
Between SVM and curve fitting
The goal is to be able to predict the final accuracy of the NN algorithm after convergence, based only on its behaviour over the first epochs of its learning. To do so, we have combined two techniques, well-known in the literature: Support Vector Machines (SVM) [12] and curve fitting [13].
We first present a theoretical background of the mentioned techniques, and then indicate how we used them in this work.
A SVM is a discriminative classifier formally defined by a separating hyper-plane. In other words, given labelled training data (supervised learning), the algorithm outputs an optimal hyper-plane which categorises new examples. SVM algorithms are characterised by the usage of kernels, the sparseness of the solution and the capacity control obtained by acting on a margin, or on the number of support vectors. SVM can be applied not only to classification problems but also to the case of regression. One of the most important ideas in SVM for regression is that presenting the solution by means of small subsets of training points gives enormous computational advantages.
We consider the well-known dual problem of the SVM algorithm [12], whose solution is given by:
where is the number of support vectors, and are the multipliers of the support vectors and is a constant. is the kernel function, that we can choose to be linear, polynomial or Gaussian. In particular the Gaussian kernel form is
Curve fitting is the process of constructing a curve that has the best fit to a series of data points, possibly subject to constraints. In other words, the goal of curve fitting is to find the appropriate parameters that, based on a chosen model, describe experimental data as precisely as possible. The values of such parameters are usually computed by the Least Squares (LS) method, which minimises the square of the error between the original data and the values predicted by the model. In this work, the fitting function is chosen to have the form :
(1) |
where and are the parameters to be computed by non-linear LS.
Our methodology is based on the following procedure: knowing the accuracies over the first epochs of its training process, the accuracy reached by an NN after convergence is predicted thanks to an SVM algorithm, beforehand trained on the created database. This value is considered as an approximation of the objective function of interest, in the case it is neither greater than , nor smaller than the maximum of the initial observed accuracies. Otherwise, the final accuracy of the NN is predicted based on a curve fitting strategy: given the values of the accuracies of the first epochs, the corresponding curve is fitted thanks to the function (1). Once the parameters and are computed, the function is used to predict the value of the accuracy after convergence of the algorithm. It is to be noted that the parameters are subject to the following constraints:
(2) |
where is the maximum among the accuracies over the first epochs, and is the maximum number of epochs that are allowed to be seen by the learning algorithm. In particular, the first constraint forces the curve to reach an accuracy at the final epoch higher than the one already observed during the first epochs.
Combining those two different techniques for the prediction of the final accuracy makes the prediction process more efficient, this is why we decided not to use SVM only. Furthermore, the curve fitting method seemed to us appealing insofar as the fitting function can easily be constrained, as explained above. This heuristic choice was then validated by the experiments.
This method can be convenient from a computational point of view because, once the database is created and the SVM is trained, it allows to have new evaluations of the objective function in a short time. The time taken in creating the database must be evaluated; and obviously, the convenience of such a method increases as the complexity of the problem increases.
3 Application : from prediction to hyper-parameter optimisation
Being able to predict the accuracy of an NN for a given hyper-parameter setting is particularly advantageous for finding the optimal combination of hyper-parameters. In this section, we aim to apply the hereinabove described method to the hyper-parameter optimisation (HO) of a Convolutional Neural Network (CNN). We first describe the state-of-the-art HO approaches. Secondly, we present the algorithm we designed to find the optimised hyper-parameters.
3.1 Related works
Two of the most intuitive and widely spread methods are the grid search and the random search [6]. These techniques however are not well-suited in applications where a given set of hyper-parameters is costly to evaluate. As a result, Sequential Model-Based Optimisation (SMBO) [1] algorithms have been employed in many settings when the performance evaluation of a model is expensive. They approximate the black-box objective function that is to be maximised by a surrogate function, cheaper to evaluate. At each iteration of the algorithm, the new point where the surrogate is to be evaluated is chosen by maximising a chosen criterion. Several SMBO algorithms have been proposed in the literature, and differ in the criteria by which they optimise the surrogate, and in the way they model the surrogate given the observation history. Two of the most famous SMBO approaches are the Bayesian Optimisation approach [4, 5]
and the Tree-structured Parzen Estimator strategy
[2].More recently, new hyper-parameter optimisation methods based on reinforcement learning have emerged
[8, 9, 10, 11]. The goal for most of them was to find the neural network (NN) or CNN architectures that are likely to yield to an optimised performance. Thus, they were seeking the appropriate architectural hyper-parameters, as the number of layers or the structure of each convolutional layer, but many other hyper-parameters such as the learning rate and regularisation parameters are manually chosen in the end. In any case, while all the above-mentioned strategies aim to evaluate the expensive objective function (which, in the case of a NN or CNN is the prediction accuracy) as seldom as possible, to the best of our knowledge very few algorithms offer a method to reduce the evaluation cost of .3.2 Our method applied to HO
Here, we assume that we are given a CNN, and we seek its optimal hyper-parameters. Basing the optimisation of these hyper-parameters on our above-mentioned method actually implies the choice of other parameters, such as the type of kernel function, or the form of the fitting function (1). However, it is to be observed that those new parameters can be much more easily chosen. For instance, SVM for regression is yet a very well known method in literature and offers a good robustness in its results with respect to its hyper-parameters. On the contrary, this is definitely not the case for NN in general, which makes the choice of their parameters much harder.
We focus on the optimisation of the learning rate, the optimiser and the mini-batch size leading to the best accuracy, but this method could be extended to other hyper-parameters (layer number, number of neurons in each layer, activation function).
The procedure we propose consists in the following steps. First, a database is created, as above-mentioned, and an SVM algorithm is trained on it. We chose to create this database using 10
of the possible settings of hyper-parameters, randomly selected with uniform probability. Although time-consuming, this preliminary step may lead to a less expensive hyper-parameter optimisation process than the ones already known in the literature.
In our HO approach, each hyper-parameter of interest is represented as a vector containing values in a range chosen by the experimenter. To this vector corresponds a second vector of probabilities
, initialised with uniform distribution. At each iteration of the exploration process, a set of hyper-parameters is randomly chosen, based on the probabilities in the corresponding vectors
. The CNN is parameterised with the selected hyper-parameters, and trained on a few epochs. Based on its behaviour at the beginning of its learning, and thanks to the methodology detailed in Section 2, the accuracy of the CNN after convergence of its learning is predicted, this value being considered as a reward. If the reward at the iteration is higher than the reward at the previous iteration, the probabilities in corresponding to the selected hyper-parameters and their neighbourhood in the vector are increased, while the other ones are penalised. On the contrary, if the probabilities of the selected hyper-parameters are penalised and the other ones are increased. Thus, the hyper-parameters are weighted by probabilities that are modified throughout the exploration process, until one value for each hyper-parameters reaches a probability greater than some threshold . Then, our exploration algorithm is considered to have converged.Finally, the NN is parameterised with the sets of hyper-parameters corresponding to the 10 highest predicted final accuracies, and brought to convergence. The setting leading to the best observed final accuracy is defined as the optimal hyper-parameter setting.
4 Experiments
In order to evaluate and tune our method, we performed different tests. We decided to consider a spreadly used NN, namely CNN for classification. We picked two well-known datasets, MNIST^{1}^{1}1http://yann.lecun.com/exdb/mnist/ and CIFAR-10^{2}^{2}2https://www.cs.toronto.edu/ kriz/cifar.html
, and we designed two different networks, capable of classifying images in one of 10 classes. For the MNIST dataset, the network is composed of an input layer, two sequences of convolutional and max-pooling layers, a fully connected layer and an output layer. We make use of Rectified Linear Unit (ReLU) activations and dropout technique. On the other hand, for the CIFAR dataset, the CNN is composed of an input layer, four sequences of convolutional and max-pooling layers, four fully connected layers and an output layer, exploiting ReLU activations, batch normalisation, and dropout. For the training process of the two networks, we chose to insert a form of regularisation: the early stopping. This technique is used to avoid overfitting when training a learner with an iterative method, such as gradient descent. In particular, if the training of the network is not providing better results after a determined amount of epochs, the learning process is stopped because it is considered to be stuck in a minimum and may lead to overfitting. As already mentioned, we first needed a database to train our SVM on. With the underlying idea of this paper, we won a project that granted us the use of CINECA resources on the Marconi cluster
^{3}^{3}3https://www.cineca.it/en/content/marconi. This grant enabled us to perform all our tests and generate the databases for MNIST and CIFAR. The database consists in a table in which each row corresponds to a full training of a network. We picked learning rate, optimiser and batch size as hyper-parameters and we stored them along with the accuracies observed at every epoch to create the database. Actually, the number of fully trained network needed was quite small. In our case, we needed only 44 examples: we used 35 to train the SVM and 9 to test it. Once we collected the database, we were able to understand which SVM configuration was more suitable for our problem. We tried with three different kernels: linear, polynomial and Gaussian, and we picked the last one because it outperformed the other two in terms of loss (Mean Squared Error - MSE).Ground-truth | Linear | Polynomial | Gaussian | |
---|---|---|---|---|
0 | 0.7129 | 0.816350 | 0.854324 | 0.713435 |
1 | 0.1871 | 0.197152 | 0.207737 | 0.175775 |
2 | 0.1000 | 0.200523 | 0.204122 | 0.200774 |
3 | 0.6820 | 0.704606 | 0.599408 | 0.781832 |
4 | 0.4340 | 0.381075 | 0.301048 | 0.456969 |
5 | 0.6369 | 0.572801 | 0.489510 | 0.638199 |
6 | 0.7315 | 0.747803 | 0.682245 | 0.805067 |
7 | 0.1000 | 0.190538 | 0.203211 | 0.175411 |
8 | 0.4783 | 0.410684 | 0.325788 | 0.491291 |
MSE | 0 | 0.04136 | 0.1138 | 0.0320 |
In Table 1 the predictions of the three different kernels are reported. In this case, predictions are made on the test set and are based on three epochs, for the CIFAR dataset, the MSE losses are reported in Table 1, where is shown that the Gaussian one is the kernel that dominates the other two.
We also made some tests to understand which were the most important features to feed the SVM predictor with and finally chose, given the results, to feed the method only with the epoch accuracies. Regarding the curve fitting, we also tried several functions in order to find the one that produces the best fitting (1). Figure 1 shows a graphical example of the prediction made by the SVM and the curve fitting. In the chart, the full dots represent the real values of the accuracies, epoch by epoch; the star at the final epoch is the SVM prediction, while the line is curve obtained with the fitting. For both SVM and curve fitting, only the first three epochs accuracies are used to predict the accuracy at the final epoch.
Figure 2 shows the results of our method applied to the prediction of the final accuracy of the CNN for the CIFAR-10 dataset. In this case, we use only the accuracies of the first two or four epochs of training to predict the final one. The orange line represents the prediction made by our method, the blue one the ground truth ( i.e. the network is fully trained up to convergence with the same hyper-parameters). As it can be seen, the method can effectively provide a satisfactory prediction of the final behaviour of the network after only few epochs. Encouraged by those results, we applied our strategy to a challenging topical case: to tune automatically the hyper-parameters of a network, with the procedure explained in section 3. We performed this further test using the two first epochs to predict the final one. After 200 iterations, we were able to find the parameters that lead to the best final accuracy, confirmed by a real full convergence. The hyper-parameters provided by our method are: the learning rate is equal to 0.0008425, the mini-batch size is 128, and the chosen optimiser is ADAM.
The source code can be found at https://git.hipert.unimore.it/mverucchi/optics.
5 Conclusion
We proposed and implemented a novel approach to predict the final behaviour of a learning process. This new method exploits both SVM and curve fitting to foresee the resulting accuracy of a long method, using only some initial steps. We applied this technique to a CNN, in order to quickly understand if the training of the network will end up in a good or bad manner. The results show that the predictions achieved with our technique are quite similar to the ground-truth, and confirm that this strategy can be of particular interest in the hyper-parameter optimisation domain. Further, we will focus on a more complete procedure to automatically tune hyper-parameters, such as the number of layers or the activation function, of a network exploiting our SVM-curve-fitting predictor.
Acknowledgements
The research leading to these results has received funding from the European Union’s Horizon 2020 Programme under the CLASS Project (https://class-project.eu/), grant agreement n .
This work was also partially supported by INdAM-GNCS (Research Projects 2018).
Furthermore, it was partially supported by INdAM Doctoral Programme in Mathematics and/or Applications Cofunded by Marie Sklodowska-Curie Actions (INdAM-DP-COFUND-2015) whose grant number is .
References
- [1] F. Hutter, H. Hoos, and K. Leyton-Brown, Sequential model-based optimization for general algorithm configuration. In LION-5, 2011. Extended version as UBC Tech report TR-2010-10.
- [2] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, Algorithms for hyper-parameter optimization. In NIPS. 2011.
- [3] D.R. Jones, A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21:345–383, 2001.
- [4] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams and N. de Freitas, Taking the human out of the loop: A review of bayesian optimization, Proc. IEEE 104(1) (2016) 148–175.
- [5] J. Mockus, V. Tiesis, and A. Zilinskas. The application of Bayesian methods for seeking the extremum. In L.C.W. Dixon and G.P. Szego, editors, Towards Global Optimization, volume 2, pages 117–129. North Holland, New York, 1978
- [6] J. Bergstra and Y. Bengio, Random search for hyper-parameter optimization, J. Mach. Learn. Res. 13(1) (2012) 281–305
- [7] J. Snoek, H. Larochelle and R. P. Adams, Practical Bayesian optimization of machine learning algorithms, Adv. Neural Inf. Process. Syst. 25 (2012) 2951–2959.
- [8] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning, Int. Conf. Learning Representations, Toulon, France, 2017, pp. 1–16.
- [9] B. Baker, O. Gupta, N. Naik and R. Raskar, Designing neural network architectures using reinforcement learning, Int. Conf. Learning Representations, Toulon, France, 2017, pp. 1–18.
- [10] Z. Zhong, J. Yan, W. Wei, J. Shao and C.-L. Liu, Practical block-wise neural network architecture generation, Conf. Computer Vision and Pattern Recognition, Salt Lake City, Utah, USA, 2018, arXiv preprint: 1708.05552.
- [11] H. Cai, T. Chen, W. Zhang, Y. Yu and J. Wang, Efficient architecture search by network transformation, AAAI Conf. Artificial Intelligence, New Orleans, Louisiana, USA, 2018, pp. 2787–2794
- [12] O. Chapelle and V. Vapnik, Model Selection for Support Vector Machines. In Advances in Neural Information Processing Systems, Vol 12, (1999)
- [13] Sandra Lach Arlinghaus, PHB Practical Handbook of Curve Fitting. CRC Press, 1994.
Comments
There are no comments yet.