Foundation models, i.e. large neural networks pre-trained on large text
...
In this work, we study the large-scale pretraining of BERT-Large with
di...
Optimization in machine learning, both theoretical and applied, is prese...
Adaptive gradient-based optimizers such as AdaGrad and Adam are among th...
We characterize the singular values of the linear transformation associa...
Preconditioned gradient methods are among the most general and powerful ...
We describe a framework for deriving and analyzing online optimization
a...
Messages often refer to entities such as people, places and events. Corr...