
ArbitraryDepth Universal Approximation Theorems for Operator Neural Networks
The standard Universal Approximation Theorem for operator neural network...
read it

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
A longstanding goal in deep learning research has been to precisely char...
read it

Training of Deep Neural Networks based on Distance Measures using RMSProp
The vanishing gradient problem was a major obstacle for the success of d...
read it

Universal Approximation Power of Deep Neural Networks via Nonlinear Control Theory
In this paper, we explain the universal approximation capabilities of de...
read it

Bidirectional SelfNormalizing Neural Networks
The problem of exploding and vanishing gradients has been a longstandin...
read it

The edge of chaos: quantum field theory and deep neural networks
We explicitly construct the quantum field theory corresponding to a gene...
read it
The Principles of Deep Learning Theory
This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a firstprinciples componentlevel picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layertolayer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are described by nearlyGaussian distributions, with the depthtowidth aspect ratio of the network controlling the deviations from the infinitewidth Gaussian description. We explain how these effectivelydeep networks learn nontrivial representations from training and more broadly analyze the mechanism of representation learning for nonlinear models. From a nearlykernelmethods perspective, we find that the dependence of such models' predictions on the underlying learning algorithm can be expressed in a simple and universal way. To obtain these results, we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem. We further explain how RG flow leads to nearuniversal behavior and lets us categorize networks built from different activation functions into universality classes. Altogether, we show that the depthtowidth ratio governs the effective model complexity of the ensemble of trained networks. By using informationtheoretic techniques, we estimate the optimal aspect ratio at which we expect the network to be practically most useful and show how residual connections can be used to push this scale to arbitrary depths. With these tools, we can learn in detail about the inductive bias of architectures, hyperparameters, and optimizers.
READ FULL TEXT
Comments
There are no comments yet.