Scaling Pre-trained Language Models to Deeper via Parameter-efficient Architecture

03/27/2023
by   Peiyu Liu, et al.
0

In this paper, we propose a highly parameter-efficient approach to scaling pre-trained language models (PLMs) to a deeper model depth. Unlike prior work that shares all parameters or uses extra blocks, we design a more capable parameter-sharing architecture based on matrix product operator (MPO). MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts: the major part that contains the major information (central tensor) and the supplementary part that only has a small proportion of parameters (auxiliary tensors). Based on such a decomposition, our architecture shares the central tensor across all layers for reducing the model size and meanwhile keeps layer-specific auxiliary tensors (also using adapters) for enhancing the adaptation flexibility. To improve the model training, we further propose a stable initialization algorithm tailored for the MPO-based architecture. Extensive experiments have demonstrated the effectiveness of our proposed model in reducing the model size and achieving highly competitive performance.

READ FULL TEXT
research
03/02/2022

Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models

The state-of-the-art Mixture-of-Experts (short as MoE) architecture has ...
research
06/04/2021

Enabling Lightweight Fine-tuning for Pre-trained Language Model Compression based on Matrix Product Operators

This paper presents a novel pre-trained language models (PLM) compressio...
research
09/15/2023

Frustratingly Simple Memory Efficiency for Pre-trained Language Models via Dynamic Embedding Pruning

The extensive memory footprint of pre-trained language models (PLMs) can...
research
05/18/2023

Parameter-Efficient Learning for Text-to-Speech Accent Adaptation

This paper presents a parameter-efficient learning (PEL) to develop a lo...
research
03/21/2023

Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning

Contrastive vision-language models (e.g. CLIP) are typically created by ...
research
04/21/2021

Searching to Sparsify Tensor Decomposition for N-ary Relational Data

Tensor, an extension of the vector and matrix to the multi-dimensional c...
research
07/02/2023

TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition

High-dimensional token embeddings underpin Large Language Models (LLMs),...

Please sign up or login with your details

Forgot password? Click here to reset