Cohen et al. (2021) empirically study the evolution of the largest eigen...
In Reinforcement Learning (RL), enhancing sample efficiency is crucial,
...
Sharpness-Aware Minimization (SAM) is an optimizer that takes a descent ...
We investigate how pair-wise data augmentation techniques like Mixup aff...
We study convergence lower bounds of without-replacement stochastic grad...
We uncover how SGD interacts with batch normalization and can exhibit
un...
Stochastic gradient descent-ascent (SGDA) is one of the main workhorses ...
In distributed learning, local SGD (also known as federated averaging) a...
We propose matrix norm inequalities that extend the Recht-Ré (2012)
conj...
It is known that Θ(N) parameters are sufficient for neural networks to
m...
We study the implicit bias of gradient flow (i.e., gradient descent with...
The universal approximation property of width-bounded networks has been
...
We study without-replacement SGD for solving finite-sum optimization
pro...
Transformer networks use pairwise attention to compute contextual embedd...
Attention based Transformer architecture has enabled significant advance...
Despite the widespread adoption of Transformer models for NLP tasks, the...
Recently, a residual network (ResNet) with a single residual block has b...
We study universal finite sample expressivity of neural networks, define...
We provide a theoretical algorithm for checking local optimality and esc...
We investigate the loss surface of deep linear and nonlinear neural netw...
We study the error landscape of deep linear and nonlinear neural network...