It is widely acknowledged that large models have the potential to delive...
Large language models, which are often trained for hundreds of thousands...
All-MLP architectures have attracted increasing interest as an alternati...
Mixture of Experts layers (MoEs) enable efficient scaling of language mo...
Large-scale autoregressive language models such as GPT-3 are few-shot
le...
During pretraining, the Pre-LayerNorm transformer suffers from a gradien...
Stateful optimizers maintain gradient statistics over time, e.g., the
ex...
Recent state-of-the-art approaches to summarization utilize large pre-tr...
We present a series of modifications which improve upon Graph WaveNet's
...
Generative seq2seq dialogue systems are trained to predict the next word...
Generative seq2seq dialogue systems are trained to predict the next word...
One of the biggest bottlenecks in a machine learning workflow is waiting...
In computer vision, virtually every state of the art deep learning syste...