Effect of Activation Functions on the Training of Overparametrized Neural Nets

08/16/2019
by   Abhishek Panigrahi, et al.
0

It is well-known that overparametrized neural networks trained using gradient-based methods quickly achieve small training error with appropriate hyperparameter settings. Recent papers have proved this statement theoretically for highly overparametrized networks under reasonable assumptions. The limiting case when the network size approaches infinity has also been considered. These results either assume that the activation function is ReLU or they crucially depend on the minimum eigenvalue of a certain Gram matrix depending on the data, random initialization and the activation function. In the latter case, existing works only prove that this minimum eigenvalue is non-zero and do not provide quantitative bounds. On the empirical side, a contemporary line of investigations has proposed a number of alternative activation functions which tend to perform better than ReLU at least in some settings but no clear understanding has emerged. This state of affairs underscores the importance of theoretically understanding the impact of activation functions on training. In the present paper, we provide theoretical results about the effect of activation function on the training of highly overparametrized 2-layer neural networks. We show that for smooth activations, such as tanh and swish, the minimum eigenvalue can be exponentially small depending on the span of the dataset implying that the training can be very slow. In contrast, for activations with a "kink," such as ReLU, SELU, ELU, all eigenvalues are large under minimal assumptions on the data. Several new ideas are involved. Finally, we corroborate our results empirically.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/10/2022

APTx: better activation function than MISH, SWISH, and ReLU's variants used in deep learning

Activation Functions introduce non-linearity in the deep neural networks...
research
09/09/2021

ErfAct and PSerf: Non-monotonic smooth trainable Activation Functions

An activation function is a crucial component of a neural network that i...
research
05/21/2018

On the Selection of Initialization and Activation Function for Deep Neural Networks

The weight initialization and the activation function of deep neural net...
research
05/24/2022

Imposing Gaussian Pre-Activations in a Neural Network

The goal of the present work is to propose a way to modify both the init...
research
05/30/2023

Benign Overfitting in Deep Neural Networks under Lazy Training

This paper focuses on over-parameterized deep neural networks (DNNs) wit...
research
11/15/2022

Characterizing the Spectrum of the NTK via a Power Series Expansion

Under mild conditions on the network initialization we derive a power se...
research
03/05/2018

How to Start Training: The Effect of Initialization and Architecture

We investigate the effects of initialization and architecture on the sta...

Please sign up or login with your details

Forgot password? Click here to reset