GLM-130B: An Open Bilingual Pre-trained Model

by   Aohan Zeng, et al.

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and disconvergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B – the largest Chinese language model – across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization, without quantization aware training and with almost no performance loss, making it the first among 100B-scale models. More importantly, the property allows its effective inference on 4×RTX 3090 (24G) or 8×RTX 2080 Ti (11G) GPUs, the most ever affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at .


page 2

page 29

page 34

page 36


StyleBERT: Chinese pretraining by font style information

With the success of down streaming task using English pre-trained langua...

HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec

Audio codec models are widely used in audio communication as a crucial t...

BBT-Fin: Comprehensive Construction of Chinese Financial Domain Pre-trained Language Model, Corpus and Benchmark

To advance Chinese financial natural language processing (NLP), we intro...

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive languag...

DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

This paper presents a new pre-trained language model, DeBERTaV3, which i...

OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch

Large language models (LLMs) with billions of parameters have demonstrat...

BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

We introduce the Bittensor Language Model, called "BTLM-3B-8K", a new st...

Please sign up or login with your details

Forgot password? Click here to reset