In this work, we develop and release Llama 2, a collection of pretrained...
We present a theory for the previously unexplained divergent behavior no...
Recent work has shown that fine-tuning large pre-trained language models...
Large language models, which are often trained for hundreds of thousands...
Mixture of Experts layers (MoEs) enable efficient scaling of language mo...
Large-scale autoregressive language models such as GPT-3 are few-shot
le...