AI Chat AI Image Generator AI Video Text to Speech

On the benefits of non-linear weight updates

07/25/2022

∙

by Paul Norridge, et al.

∙

∙

Recent work has suggested that the generalisation performance of a DNN is related to the extent to which the Signal-to-Noise Ratio is optimised at each of the nodes. In contrast, Gradient Descent methods do not always lead to SNR-optimal weight configurations. One way to improve SNR performance is to suppress large weight updates and amplify small weight updates. Such balancing is already implicit in some common optimizers, but we propose an approach that makes this explicit. The method applies a non-linear function to gradients prior to making DNN parameter updates. We investigate the performance with such non-linear approaches. The result is an adaptation to existing optimizers that improves performance for many problem types.

page 8

page 9

page 10

page 18

page 19

page 20

page 21

page 22

research

∙ 02/11/2020

Think Global, Act Local: Relating DNN generalisation and node-level SNR

The reasons behind good DNN generalisation remain an open question. In t...

0 Paul Norridge, et al. ∙

research

∙ 01/09/2018

Convergence Analysis of Gradient Descent Algorithms with Proportional Updates

The rise of deep learning in recent years has brought with it increasing...

0 Igor Gitman, et al. ∙

research

∙ 04/30/2018

Interpreting weight maps in terms of cognitive or clinical neuroscience: nonsense?

Since machine learning models have been applied to neuroimaging data, re...

0 Jessica Schrouff, et al. ∙

research

∙ 11/05/2015

Symmetry-invariant optimization in deep networks

Recent works have highlighted scale invariance or symmetry that is prese...

0 Vijay Badrinarayanan, et al. ∙

research

∙ 06/10/2015

A Scale Mixture Perspective of Multiplicative Noise in Neural Networks

Corrupting the input and hidden layers of deep neural networks (DNNs) wi...

0 Eric Nalisnick, et al. ∙

research

∙ 04/17/2017

Sparse Communication for Distributed Gradient Descent

We make distributed stochastic gradient descent faster by exchanging spa...

0 Alham Fikri Aji, et al. ∙

research

∙ 07/01/2020

Gradient Temporal-Difference Learning with Regularized Corrections

It is still common to use Q-learning and temporal difference (TD) learni...

3 Sina Ghiassian, et al. ∙