
-
Provable Memorization via Deep Neural Networks using Sub-linear Parameters
It is known that Θ(N) parameters are sufficient for neural networks to m...
read it
-
A Unifying View on Implicit Bias in Training Linear Neural Networks
We study the implicit bias of gradient flow (i.e., gradient descent with...
read it
-
Minimum Width for Universal Approximation
The universal approximation property of width-bounded networks has been ...
read it
-
SGD with shuffling: optimal rates without component convexity and large epoch requirements
We study without-replacement SGD for solving finite-sum optimization pro...
read it
-
O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers
Transformer networks use pairwise attention to compute contextual embedd...
read it
-
Low-Rank Bottleneck in Multi-head Attention Models
Attention based Transformer architecture has enabled significant advance...
read it
-
Are Transformers universal approximators of sequence-to-sequence functions?
Despite the widespread adoption of Transformer models for NLP tasks, the...
read it
-
Are deep ResNets provably better than linear predictors?
Recently, a residual network (ResNet) with a single residual block has b...
read it
-
Finite sample expressive power of small-width ReLU networks
We study universal finite sample expressivity of neural networks, define...
read it
-
Efficiently testing local optimality and escaping saddles for ReLU networks
We provide a theoretical algorithm for checking local optimality and esc...
read it
-
A Critical View of Global Optimality in Deep Learning
We investigate the loss surface of deep linear and nonlinear neural netw...
read it
-
Global optimality conditions for deep neural networks
We study the error landscape of deep linear and nonlinear neural network...
read it