Deep Gradient Boosting
Stochastic gradient descent (SGD) has been the dominant optimization method for training deep neural networks due to its many desirable properties. One of the more remarkable and least understood quality of SGD is that it generalizes relatively well on unseen data even when the neural network has millions of parameters. In this work, we show that SGD is an extreme case of deep gradient boosting (DGB) and as such is intrinsically regularized. The key idea of DGB is that back-propagated gradients calculated using the chain rule can be viewed as pseudo-residual targets. Thus at each layer the weight update is calculated by solving the corresponding gradient boosting problem. We hypothesize that some learning tasks can benefit from a more lax regularization requirement and this approach provides a way to control that. We tested this hypothesis on a number of benchmark data sets and show that indeed in a subset of cases DGB outperforms SGD and under-performs on tasks that are more prone to over-fitting, such as image recognition.
READ FULL TEXT