Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods

12/06/2020
by   Taiji Suzuki, et al.
0

Establishing a theoretical analysis that explains why deep learning can outperform shallow learning such as kernel methods is one of the biggest issues in the deep learning literature. Towards answering this question, we evaluate excess risk of a deep learning estimator trained by a noisy gradient descent with ridge regularization on a mildly overparameterized neural network, and discuss its superiority to a class of linear estimators that includes neural tangent kernel approach, random feature model, other kernel methods, k-NN estimator and so on. We consider a teacher-student regression model, and eventually show that any linear estimator can be outperformed by deep learning in a sense of the minimax optimal rate especially for a high dimension setting. The obtained excess bounds are so-called fast learning rate which is faster than O(1/√(n)) that is obtained by usual Rademacher complexity analysis. This discrepancy is induced by the non-convex geometry of the model and the noisy gradient descent used for neural network training provably reaches a near global optimal solution even though the loss landscape is highly non-convex. Although the noisy gradient descent does not employ any explicit or implicit sparsity inducing regularization, it shows a preferable generalization performance that dominates linear estimators.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/30/2022

Excess Risk of Two-Layer ReLU Neural Networks in Teacher-Student Settings and its Superiority to Kernel Methods

While deep learning has outperformed other methods for various tasks, th...
research
05/22/2019

On the minimax optimality and superiority of deep neural network learning over sparse parameter spaces

Deep learning has been applied to various tasks in the field of machine ...
research
05/19/2020

One Size Fits All: Can We Train One Denoiser for All Noise Levels?

When training an estimator such as a neural network for tasks like image...
research
08/26/2021

Comparing Classes of Estimators: When does Gradient Descent Beat Ridge Regression in Linear Models?

Modern methods for learning from data depend on many tuning parameters, ...
research
02/28/2022

On the Benefits of Large Learning Rates for Kernel Methods

This paper studies an intriguing phenomenon related to the good generali...
research
01/17/2022

Generalization in Supervised Learning Through Riemannian Contraction

We prove that Riemannian contraction in a supervised learning setting im...
research
07/27/2021

Stability Generalisation of Gradient Descent for Shallow Neural Networks without the Neural Tangent Kernel

We revisit on-average algorithmic stability of Gradient Descent (GD) for...

Please sign up or login with your details

Forgot password? Click here to reset