Going beyond stochastic gradient descent (SGD), what new phenomena emerg...
We identify incremental learning dynamics in transformers, where the
dif...
We study when the neural tangent kernel (NTK) approximation is valid for...
Training stability is of great importance to Transformers. In this work,...
The grokking phenomenon as reported by Power et al. ( arXiv:2201.02177 )...
In this paper, we study the representation of neural networks from the v...
Deep linear networks trained with gradient descent yield low rank soluti...
We analyze the learning dynamics of infinitely wide neural networks with...
Yang (2020a) recently showed that the Neural Tangent Kernel (NTK) at
ini...
Modern neural network performance typically improves as model size incre...
Recent results in the theoretical study of deep learning have shown that...
A recent body of work has focused on the theoretical study of neural net...
The Hessian of neural networks can be decomposed into a sum of two matri...
Normalization techniques play an important role in supporting efficient ...
Deep Residual Networks present a premium in performance in comparison to...
Deep learning techniques are renowned for supporting effective transfer
...