Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model

05/23/2023
by   Leo Z. Liu, et al.
0

Large and sparse feed-forward networks (S-FFN) such as Mixture-of-Experts (MoE) have demonstrated to be an efficient approach for scaling up Transformers model size for pretraining large language models. By only activating part of the FFN parameters conditioning on input, S-FFN improves generalization performance while keeping training and inference costs (in FLOPs) fixed. In this work, we analyzed the two major design choices of S-FFN: the memory block (or expert) size and the memory block selection method under a general conceptual framework of sparse neural memory. Using this unified framework, we compare several S-FFN architectures for language modeling and provide insights into their relative efficacy and efficiency. From our analysis results, we found a simpler selection method – Avg-K that selects blocks through their mean aggregated hidden states, achieves lower perplexity in language modeling pretraining compared to existing MoE architectures.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/22/2023

Parallel Attention and Feed-Forward Net Design for Pre-training and Inference on Transformers

In this paper, we introduce Parallel Attention and Feed-Forward Net Desi...
research
05/24/2023

Dynamic Masking Rate Schedules for MLM Pretraining

Most works on transformers trained with the Masked Language Modeling (ML...
research
12/28/2018

Looking for ELMo's friends: Sentence-Level Pretraining Beyond Language Modeling

Work on the problem of contextualized word representation -- the develop...
research
04/08/2021

Revisiting Simple Neural Probabilistic Language Models

Recent progress in language modeling has been driven not only by advance...
research
07/31/2022

Neural Knowledge Bank for Pretrained Transformers

The ability of pretrained Transformers to remember factual knowledge is ...
research
05/16/2023

Application-Agnostic Language Modeling for On-Device ASR

On-device automatic speech recognition systems face several challenges c...
research
10/16/2019

Adaptive and Iteratively Improving Recurrent Lateral Connections

The current leading computer vision models are typically feed forward ne...

Please sign up or login with your details

Forgot password? Click here to reset