Sharper analysis of sparsely activated wide neural networks with trainable biases

01/01/2023
by   Hongru Yang, et al.
0

This work studies training one-hidden-layer overparameterized ReLU networks via gradient descent in the neural tangent kernel (NTK) regime, where, differently from the previous works, the networks' biases are trainable and are initialized to some constant rather than zero. The first set of results of this work characterize the convergence of the network's gradient descent dynamics. Surprisingly, it is shown that the network after sparsification can achieve as fast convergence as the original network. The contribution over previous work is that not only the bias is allowed to be updated by gradient descent under our setting but also a finer analysis is given such that the required width to ensure the network's closeness to its NTK is improved. Secondly, the networks' generalization bound after training is provided. A width-sparsity dependence is presented which yields sparsity-dependent localized Rademacher complexity and a generalization bound matching previous analysis (up to logarithmic factors). As a by-product, if the bias initialization is chosen to be zero, the width requirement improves the previous bound for the shallow networks' generalization. Lastly, since the generalization bound has dependence on the smallest eigenvalue of the limiting NTK and the bounds from previous works yield vacuous generalization, this work further studies the least eigenvalue of the limiting NTK. Surprisingly, while it is not shown that trainable biases are necessary, trainable bias helps to identify a nice data-dependent region where a much finer analysis of the NTK's smallest eigenvalue can be conducted, which leads to a much sharper lower bound than the previously known worst-case bound and, consequently, a non-vacuous generalization bound.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/21/2020

Tight Bounds on the Smallest Eigenvalue of the Neural Tangent Kernel for Deep ReLU Networks

A recent line of work has analyzed the theoretical properties of deep ne...
research
09/14/2023

How many Neurons do we need? A refined Analysis for Shallow Networks trained with Gradient Descent

We analyze the generalization properties of two-layer neural networks in...
research
09/26/2019

Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks

Recent work has revealed that overparameterized networks trained by grad...
research
05/11/2020

On Radial Isotropic Position: Theory and Algorithms

We review the theory of, and develop algorithms for transforming a finit...
research
02/20/2020

Kernel and Rich Regimes in Overparametrized Models

A recent line of work studies overparametrized neural networks in the "k...

Please sign up or login with your details

Forgot password? Click here to reset