A block coordinate descent optimizer for classification problems exploiting convexity

06/17/2020
by   Ravi G. Patel, et al.
0

Second-order optimizers hold intriguing potential for deep learning, but suffer from increased cost and sensitivity to the non-convexity of the loss surface as compared to gradient-based approaches. We introduce a coordinate descent method to train deep neural networks for classification tasks that exploits global convexity of the cross-entropy loss in the weights of the linear layer. Our hybrid Newton/Gradient Descent (NGD) method is consistent with the interpretation of hidden layers as providing an adaptive basis and the linear layer as providing an optimal fit of the basis to data. By alternating between a second-order method to find globally optimal parameters for the linear layer and gradient descent to train the hidden layers, we ensure an optimal fit of the adaptive basis to data throughout training. The size of the Hessian in the second-order step scales only with the number weights in the linear layer and not the depth and width of the hidden layers; furthermore, the approach is applicable to arbitrary hidden layer architecture. Previous work applying this adaptive basis perspective to regression problems demonstrated significant improvements in accuracy at reduced training cost, and this work can be viewed as an extension of this approach to classification problems. We first prove that the resulting Hessian matrix is symmetric semi-definite, and that the Newton step realizes a global minimizer. By studying classification of manufactured two-dimensional point cloud data, we demonstrate both an improvement in validation error and a striking qualitative difference in the basis functions encoded in the hidden layer when trained using NGD. Application to image classification benchmarks for both dense and convolutional architectures reveals improved training accuracy, suggesting possible gains of second-order methods over gradient descent.

READ FULL TEXT
research
03/31/2021

Research of Damped Newton Stochastic Gradient Descent Method for Neural Network Training

First-order methods like stochastic gradient descent(SGD) are recently t...
research
11/18/2011

Krylov Subspace Descent for Deep Learning

In this paper, we propose a second order optimization method to learn mo...
research
10/18/2018

First-order and second-order variants of the gradient descent: a unified framework

In this paper, we provide an overview of first-order and second-order va...
research
10/11/2022

Component-Wise Natural Gradient Descent – An Efficient Neural Network Optimization

Natural Gradient Descent (NGD) is a second-order neural network training...
research
07/16/2019

SGD momentum optimizer with step estimation by online parabola model

In stochastic gradient descent, especially for neural network training, ...
research
08/30/2021

A fast point solver for deep nonlinear function approximators

Deep kernel processes (DKPs) generalise Bayesian neural networks, but do...
research
06/11/2021

LocoProp: Enhancing BackProp via Local Loss Optimization

We study a local loss construction approach for optimizing neural networ...

Please sign up or login with your details

Forgot password? Click here to reset