CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language Model

03/03/2020
by   Liang Xu, et al.
0

In this paper, we introduce the Chinese corpus from CLUE organization, CLUECorpus2020, a large-scale corpus that can be used directly for self-supervised learning such as pre-training of a language model, or language generation. It has 100G raw corpus with 35 billion Chinese characters, which is retrieved from Common Crawl. To better understand this corpus, we conduct language understanding experiments on both small and large scale, and results show that the models trained on this corpus can achieve excellent performance on Chinese. We release a new Chinese vocabulary with a size of 8K, which is only one-third of the vocabulary size used in Chinese Bert released by Google. It saves computational cost and memory while works as good as original vocabulary. We also release both large and tiny versions of the pre-trained model on this corpus. The former achieves the state-of-the-art result, and the latter retains most precision while accelerating training and prediction speed for eight times compared to Bert-base. To facilitate future work on self-supervised learning on Chinese, we release our dataset, new vocabulary, codes, and pre-trained models on Github.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/03/2020

CLUECorpus2020: A Large-scale Chinese Corpus for Pre-trainingLanguage Model

In this paper, we introduce the Chinese corpus from CLUE organization, C...
research
04/13/2020

CLUE: A Chinese Language Understanding Evaluation Benchmark

We introduce CLUE, a Chinese Language Understanding Evaluation benchmark...
research
08/01/2023

JIANG: Chinese Open Foundation Language Model

With the advancements in large language model technology, it has showcas...
research
07/31/2020

Improving NER's Performance with Massive financial corpus

Training large deep neural networks needs massive high quality annotatio...
research
03/27/2023

Meeting Action Item Detection with Regularized Context Modeling

Meetings are increasingly important for collaborations. Action items in ...
research
11/19/2022

Metaphorical Language Change Is Self-Organized Criticality

One way to resolve the actuation problem of metaphorical language change...
research
06/06/2023

The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter

Large pre-trained transformers are show-stealer in modern-day deep learn...

Please sign up or login with your details

Forgot password? Click here to reset