In many machine learning situations we want to find parametersminimizing some objective function averaged over given (size ) dataset:
where is objective function for -th object of dataset. We would like to minimize through gradient descent, however, dataset can be very large, and calculation of
is often relatively costly, for example in backpropagation of neural network. Therefore, we would like to work on gradients calculated from succeeding subsets of dataset (mini-batches), or even single objects (original SGD). Additionally, objective function can have some intrinsic randomness.
In stochastic gradient descent (SGD) framework , we can ask for some approximation of gradient for a chosen point in successive times :
where corresponds to average over succeeding subset (mini-batch) or even a single element - we can assume that averages to the real gradient over some time.
Hence we would like to exploit statistical trends from sequence to optimize evolution of - providing fast convergence to local minimum. As calculation of is relatively costly and such convergence is often very slow, improving it with more accurate (and costly) modelling might provide significant benefits especially for deep learning applications.
A natural basic approach is working on momentum : evolve toward some (e.g. exponential moving) average of . There are also many other approaches, for example popular ADAM  estimating both average gradient and coordinate-wise squared gradient.
Such standard simple methods leave two basic opportunities for improved exploitation of statistical trends from sequence, discarding valuable information this sequence contains. We would like to practically use them here:
They do not try to estimate distance from local extremum (), which is suggested in evolution of gradients, and could allow to optimize the crucial choice of step size. A direct approach to do it is estimating second derivative in the chosen direction, however, it is problematic due to knowing only noisy stochastic gradients. We will focus on indirect approach instead: parametrize local directional behavior with second order polynomial (of known extremum), and update its parameters to improve agreement with calculated .
They focus only on tracing a single most promising direction, erasing information about the remaining ones. In high dimensional optimization problems it might be worth to simultaneously trace multiple statistically promising directions from the
sequence, especially near saddles - problematic and frequent in deep learning. Intuitively, calculating PCA: covariance matrix of approximated gradients using e.g. exponential moving average, beside single dominant eigenvector, there are often multiple eigenvectors with close eigenvalues - also worth to be included in considerations.
Both these opportunities could be exploited if trying to model the entire Hessian, but it is usually much too costly: requiring memory and computations in every step. However, we can focus on some chosen number of looking most promising directions (as modelled eigenvectors), reducing this memory and computational cost to . There are also numerical problems with inverting this Hessian, especially in stochastic setting and close to zero eigenvalues. This article propose to use local parametrization to prevent them: containing necessary information for efficient evolution of , and simple enough for inexpensive update for agreement with newly calculated gradients.
Such considered dimensions can vary (in rotation and size) throughout the optimization to find the best (local) tradeoffs between modelling cost and convergence improvement. For we should get similar method as using only momentum, but additionally trying to estimate distance from extremum in considered direction: attracting if minimum, repelling if maximum, with step size depending on this estimated distance. For larger we perform such behavior in simultaneously directions, e.g. attracting in some directions and repulsing in others near modelled saddle. Finally, for we should get modelling of the entire Hessian, without requirement of inverting it.
Optimizing local parametrization should repair issues of other methods trying to exploit restricted Hessian, e.g.:
L-BFGS  uses differences of gradients to approximate Hessian. However, they are very noisy in stochastic setting. Statistical trends can be extracted from this noise with slowly evolving parameters of local approximation,
TONGA  trying to recreate non-centered covariance matrix of with exponential moving average. Such averaging does not include evolution of gradients with position, what is included in evolution of parameters of local approximation.
K-FAC  uses block approximation of Hessian - what is usually much more costly, and the real gradients can mix such blocks, what is excluded in such models.
The current version of this article describes work in progress: general concepts leaving large freedom. The difficult choice of details is planned to be explored in future.
Ii Local selective second order parametrization
We would like to model nearly orthonormal locally promising directions (: identity matrix):
as approximations of eigenvectors of Hessian of objective function in current point (upper index enumerates step):
We also need to parametrize first order terms - for convenience of direct optimization, we will use positions of extrema in these directions: denote their coordinates as . Also absolute height should be parameterized , however, it can be neglected if working only on gradients (unless exploiting values of objective function). Finally, in time , the used model of local behavior of objective function around has the following parametrisation and gradient:
The directions should describe recently statistically dominating trends in sequence. Then , parameters describe parabola in -th direction . Assuming objective function indeed decreases on considered sequence, decrease of gradients in direction suggests approaching its local minimum . In contrast, increase of gradients in suggests going away from its local maximum . We could also jump between opposite slopes of a valley. All such scenarios can be included if fitting parabola: estimating , parameters describing second order behavior in given direction. Based on approximation they provide, we can attract toward minimum or repulse from maximum, with step size depending on estimated distance and its uncertainty.
Ii-a Concepts for parametrization update
In every step we would like to update parametrization to reduce residue : difference between calculated gradient and gradient of current parametrization:
Let us decompose it into currently considered and neglected directions:
For orthonormal set of vectors, we would have : that gradient in not considered directions is treated as residue. As our parametrization completely neglects this behavior, it needs a special treatment - we will generally assume that rotation is around its central point and slow enough that it itself does not require modification of parabola, which will be separately updated.
Finally, residue reducing parameter optimization step should be split into a few concurrent tasks discussed later:
should be mainly used to update shape of parabola: in direction , to improve gradients agreement in these directions: .
The part is completely neglected in currently considered directions: should cause a tiny rotation of the entire toward this direction. Hence, multiple appearance of such new statistically significant direction should rotate to include it, at cost of recently less frequent directions.
The term should be also used for internal rotation in subspace spanned by to improve diagonality of Hessian in this basis: as .
The applied modification of considered vectors need to maintain approximate orhtonormality: .
There might be also other criteria worth to be included in minimization step, like reduction of value using current parametrization (lowest local valley) considered in the first version of this article. However, we should also have in mind negative eigenvalues, especially close to a saddle - this reduction could be optimized only over directions with positive . For simplicity it is omitted in this version of article.
Ii-B Proper optimization step
After reducing residue with parameter update step (we will discuss in the following sections), we can perform the proper optimization step for based on current parametrization, for example for some parameter describing confidence in the parametrization:
Thanks to considering the sign of estimated eigenvalue, it has separate behaviors for attracting and repulsive directions:
In attractive directions it reduces distance to estimated minimum times, allowing for exponential decrease of distance if its estimated position would remain unchanged.
In repulsive directions we would analogously get exponential increase of distance.
It might be worth to use more sophisticated update step, e.g. replacing sign with some activation function like: smooth and being 0 for , it additionally can be asymmetric: e.g. giving stronger repulsion than attraction.
We can also try to evaluate uncertainty of parametrization, especially of , e.g. from exponential weighted average of squared updates of this value. Then directions with higher than average uncertainty could use lower step, or higher step for lower uncertainty.
There remains question of initialization of parameters. Choosing
depends on the problem, for neural networks is usually made in a random way, preferably using probability distribution (maybe also correlations) appearing in a given type of network. A natural initial choice for eigenvalues is, of positions to make in the modelled extremum: . However, adding some bias at the beginning due to no information might be worth considering.
The choice of initial orthonormal directions seems difficult, but can be made completely random if eigenvalues are initialized with 0 - they will be learned from the first gradients, it might be worth increasing learning rates and decreasing confidence in parametrization in the beginning, e.g. by starting with large uncertainty.
Alternatively, we can try to start with considered directions, for example using the first calculated gradient (maybe on a larger mini-batch), or starting with some number of steps of standard e.g. momentum method choosing a single looking promising direction .
Then we can gradually increase the number of considered directions . Using current directions, we can average calculated gradients with subtracted the already used components:
finally normalizing it while including in the considered directions and increasing . We can also try to automatically reduce , for example when some eigenvalue remains close to zero.
Iii Details of parametrization update
The most difficult questions and choices of the discussed method regard update of parametrization, especially that its choice might be differently optimized for various tasks.
Iii-a Optimizing and for distance estimation
For each chosen direction , we would like to update and to reduce contribution of residue in this direction:
Minimization of gives minus gradient:
Suggesting to use gradient descent for some :
However, as we know the desired value: , we can try to optimize the step in analogy to Newton’s method. In linear approximation, (12) will reduce by . It suggests to use instead:
were would correspond to as in Newton’s method. In SGD should be chosen correspondingly to representativeness of mini-batch in the entire dataset.
The discussed gradient descent assumes that changes of and
have the same weights, modifying this balance might allow for additional optimization. Alternative approach to optimize these parameters, based on least square linear regression, is presented in Appendix.
Iii-B Rotating considered subspace
Due to complete negligence of all but modelled directions, there are required some assumptions of behavior in the remaining directions, like that slow rotation of the basis does not require modification of , parameters. Alternative choice might be e.g. slow reduction of during such rotation, however, it would lead to artificial reductions for oscillations.
Subspace rotation should only use residue outside the considered directions: . A basic choice of corresponding update step is just adding to each vector for some tiny .
It might be worth to include dependence on eigenvalue, e.g. adding to for some small positive . It would make high curvature directions more conservative, however, would cause neglecting long gentle slopes.
Adding to all and orthonormalizing would rotate the entire subspace toward the new vector, with strength proportional to its length - causing rotation toward steep slopes, which should correspond to large eigenvalues. It again brings question of customizations also focusing on gentle slopes. Additional question regards handling oscillations between opposite slopes of a valley - they should be rare in presented approach, if we would like to also learn these directions we can for example add : multiplied by for some of directions.
Iii-C Internal basis rotation
Part of gradient in the considered basis: should agree with the modelled behavior using diagonalized Hessian: . While we have discussed , update separately, we should also update internal rotation of this basis to maintain diagonalization, reduce possible non-diagonal terms.
A natural tool for this purpose is reducing residue in this subspace :
We would like to update in this direction. Its first term corresponds to shifting directions toward the residue. Its second term corresponds to modification of previous parametrization, and might be worth weakening or completely removing.
A basic update choice, for some parameter , is (rotation of subspace is neglected here for simplicity):
As for , , in analogy to Newton’s method, we can use the fact that we know the desired value: perfect agreement for . In linear approximation neglecting second (14) term, (15) will reduce by , suggesting to use update step:
for some describing how representative are chosen mini-batches for the entire sample, which can be chosen as from (13).
Iii-D Maintaining orthonormality
The modification of vectors should maintain approximate orthonormality: , what suggests to modify by adding satisfying . However, such update would maintain orthonormality only for infinitesimal steps.
A small divergence from orthonormality is not a problem for behavior of discussed approach, however, it can accumulate to problematic situations like gluing two considered directions. Hence, for example every some number of steps, there should be performed orthonormalization of the considered vectors. A standard choice is Gram-Schmidt, but it depends on order of vectors. As perfect orthonormality is not required, we can instead use approximated but symmetric orthonormalization step:
Iv Algorithm example
This section summarizes a basic choice of algorithm (simplified and neglecting upper time index).
Initialization - choose:
number of considered directions, can grow from initial ,
describing confidence in parametrization, can be reduced in the beginning,
describing confidence in used stochastic gradients, representativeness of used mini-batch,
tiny describing speed of exploration of new directions,
- initial parameters, e.g. chosen randomly using probability distribution of parameters of given type of network,
choose - for example a random size orthonormal set of vectors.
Then until some convergence condition do optimization step (for all indexes ):
Calculate stochastic gradient in current ,
Calculate decomposed residue:
Update , defining shapes of parabolas (can be added modelling of uncertainty especially of ):
Update - explore new directions from , and use to improve diagonal form of Hessian:
Orthonormalize (not necessarily in every step):
Update - attract to local directional minima , repulse from maxima :
The most difficult part, also leaving options for customizations, is updating directions . There was omitted here second term in formula, which might turn out crucial. The subspace rotation term can be reduced for large to stabilize steep directions, adding larger flexibility to those close to 0.
V Conclusions and further work
There was presented a general concept for exploiting opportunities missed by standard SGD convergence methods: trying to simultaneously model multiple promising directions and estimating distances from their local extrema. While optimizing the details will require a lot of experimental work (and is task dependent), the potential benefits from convergence improvements might be significantly larger than the cost of such more sophisticated modelling, especially for deep learning models.
There are a lot of details to choose from, like used mini-batches, hyperparameters (can be evolving in time), optimization criteria. Their choice might depend on the task and finally should be made automatically - what is planned for future work on concrete examples.
The discussed approach updates parametrization exploiting only calculated gradients. Calculating more information to improve this update might be also worth considering, like value of objective function, or some second derivatives, e.g. in the directions.
The case might be also worth considering: trying to model the entire Hessian, but not directly from second derivatives or differences of gradients, but by slowly updating parametrization being local approximation, trying to maintain diagonalized form. Parametrization should better handle stochasticity, extract statistical trends.
A tempting simplification is putting eigenvalues into eigenvectors: work on instead, what would allow to remove eigenvalues from considerations and enforcement of unitary vectors. They were separated for better presentation, and to separate and rotational optimization.
To estimate , parameters, alternative approach might be based on least squares linear regression.
Let us start with one-dimensional static situation: basing on sequence, we would like to find parameters of approximation of objective function, to minimize distances, let say MSE: .
Necessary condition of zeroing derivatives becomes:
Their solution is (least squares linear regression):
However, objective function is not exactly parabola - should be only locally approximated this way. Additionally, considered directions can rotate.
Above least-squares regression would be suitable if directions were fixed and objection function was just a paraboloida. Generally, the four averages (18) should be calculated for all considered coordinates: to be used in (19) separately for coordinates. They should also be calculated in adaptive way, e.g. using exponential moving average:
and analogously for . While it might be the most accurate approach, it can also lead to issues like zeroing of denominator.
-  H. Robbins and S. Monro, “A stochastic approximation method,” in Herbert Robbins Selected Papers. Springer, 1985, pp. 102–109.
-  D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” nature, vol. 323, no. 6088, p. 533, 1986.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  D. C. Liu and J. Nocedal, “On the limited memory bfgs method for large scale optimization,” Mathematical programming, vol. 45, no. 1-3, pp. 503–528, 1989.
-  N. L. Roux, P.-A. Manzagol, and Y. Bengio, “Topmoumoute online natural gradient algorithm,” in Advances in neural information processing systems, 2008, pp. 849–856.
-  J. Martens and R. Grosse, “Optimizing neural networks with kronecker-factored approximate curvature,” in International conference on machine learning, 2015, pp. 2408–2417.