Embedded hyper-parameter tuning by Simulated Annealing

06/04/2019
by   Matteo Fischetti, et al.
0

We propose a new metaheuristic training scheme that combines Stochastic Gradient Descent (SGD) and Discrete Optimization in an unconventional way. Our idea is to define a discrete neighborhood of the current SGD point containing a number of "potentially good moves" that exploit gradient information, and to search this neighborhood by using a classical metaheuristic scheme borrowed from Discrete Optimization. In the present paper we investigate the use of a simple Simulated Annealing (SA) metaheuristic that accepts/rejects a candidate new solution in the neighborhood with a probability that depends both on the new solution quality and on a parameter (the temperature) which is modified over time to lower the probability of accepting worsening moves. We use this scheme as an automatic way to perform hyper-parameter tuning, hence the title of the paper. A distinctive feature of our scheme is that hyper-parameters are modified within a single SGD execution (and not in an external loop, as customary) and evaluated on the fly on the current minibatch, i.e., their tuning is fully embedded within the SGD algorithm. The use of SA for training is not new, but previous proposals were mainly intended for non-differentiable objective functions for which SGD is not applied due to the lack of gradients. On the contrary, our SA method requires differentiability of (a proxy of) the loss function, and leverages on the availability of a gradient direction to define local moves that have a large probability to improve the current solution. Computational results on image classification (CIFAR-10) are reported, showing that the proposed approach leads to an improvement of the final validation accuracy for modern Deep Neural Networks such as ResNet34 and VGG16.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/08/2012

Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes

Stochastic Gradient Descent (SGD) is one of the simplest and most popula...
research
02/17/2018

An Alternative View: When Does SGD Escape Local Minima?

Stochastic gradient descent (SGD) is widely used in machine learning. Al...
research
04/12/2022

An Algebraically Converging Stochastic Gradient Descent Algorithm for Global Optimization

We propose a new stochastic gradient descent algorithm for finding the g...
research
11/16/2018

Minimum norm solutions do not always generalize well for over-parameterized problems

Stochastic gradient descent is the de facto algorithm for training deep ...
research
11/29/2016

Gossip training for deep learning

We address the issue of speeding up the training of convolutional networ...
research
03/21/2022

Training Quantised Neural Networks with STE Variants: the Additive Noise Annealing Algorithm

Training quantised neural networks (QNNs) is a non-differentiable optimi...

Please sign up or login with your details

Forgot password? Click here to reset