
What Can ResNet Learn Efficiently, Going Beyond Kernels?
How can neural networks such as ResNet efficiently learn CIFAR10 with t...
read it

Polytime universality and limitations of deep learning
The goal of this paper is to characterize function distributions that de...
read it

Beating the Perils of NonConvexity: Guaranteed Training of Neural Networks using Tensor Methods
Training neural networks is a challenging nonconvex optimization proble...
read it

Towards Understanding Hierarchical Learning: Benefits of Neural Representations
Deep neural networks can empirically perform efficient hierarchical lear...
read it

An Algorithm for Training Polynomial Networks
We consider deep neural networks, in which the output of each node is a ...
read it

Provable limitations of deep learning
As the success of deep learning reaches more grounds, one would like to ...
read it

A MultiLayer Kmeans Approach for MultiSensor Data Pattern Recognition in MultiTarget Localization
Datatarget association is an important step in multitarget localizatio...
read it
Backward Feature Correction: How Deep Learning Performs Deep Learning
How does a 110layer ResNet learn a highcomplexity classifier using relatively few training examples and short training time? We present a theory towards explaining this in terms of hierarchical learning. We refer hierarchical learning as the learner learns to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multilayer neural networks can perform such hierarchical learning efficiently and automatically simply by applying stochastic gradient descent (SGD). On the conceptual side, we present, to the best of our knowledge, the FIRST theory result indicating how very deep neural networks can still be sample and time efficient on certain hierarchical learning tasks, when NO KNOWN nonhierarchical algorithms (such as kernel method, linear regression over feature mappings, tensor decomposition, sparse coding) are efficient. We establish a new principle called "backward feature correction", which we believe is the key to understand the hierarchical learning in multilayer neural networks. On the technical side, we show for regression and even for binary classification, for every input dimension d > 0, there is a concept class consisting of degree ω(1) multivariate polynomials so that, using ω(1)layer neural networks as learners, SGD can learn any target function from this class in poly(d) time using poly(d) samples to any 1/poly(d) error, through learning to represent it as a composition of ω(1) layers of quadratic functions. In contrast, we present lower bounds stating that several nonhierarchical learners, including any kernel methods, neural tangent kernels, must suffer from d^ω(1) sample or time complexity to learn functions in this concept class even to any d^0.01 error.
READ FULL TEXT
Comments
There are no comments yet.