LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

08/30/2021
by   Jian Guan, et al.
0

Standard multi-task benchmarks are essential for driving the progress of general pretraining models to generalize to various downstream tasks. However, existing benchmarks such as GLUE and GLGE tend to focus on short text understanding and generation tasks, without considering long text modeling, which requires many distinct capabilities such as modeling long-range commonsense and discourse relations, as well as the coherence and controllability of generation. The lack of standardized benchmarks makes it difficult to fully evaluate these capabilities of a model and fairly compare different models, especially Chinese pretraining models. Therefore, we propose LOT, a benchmark including two understanding and two generation tasks for Chinese long text modeling evaluation. We construct the datasets for the tasks based on various kinds of human-written Chinese stories. Besides, we release an encoder-decoder Chinese long text pretraining model named LongLM with up to 1 billion parameters. We pretrain LongLM on 120G Chinese novels with two generative tasks including text infilling and conditional continuation. Extensive experiments on LOT demonstrate that LongLM matches the performance of similar-sized pretraining models on the understanding tasks and outperforms strong baselines substantially on the generation tasks.

READ FULL TEXT

page 1

page 7

research
03/18/2021

All NLP Tasks Are Generation Tasks: A General Pretraining Framework

There have been various types of pretraining architectures including aut...
research
03/30/2020

InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining

Multi-modal pretraining for learning high-level multi-modal representati...
research
03/01/2021

M6: A Chinese Multimodal Pretrainer

In this work, we construct the largest dataset for multimodal pretrainin...
research
07/04/2023

CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity and Infant Care

The recent advances in NLP, have led to a new trend of applying LLMs to ...
research
08/18/2023

Long-range Multimodal Pretraining for Movie Understanding

Learning computer vision models from (and for) movies has a long-standin...
research
05/23/2023

CGCE: A Chinese Generative Chat Evaluation Benchmark for General and Financial Domains

Generative chat models, such as ChatGPT and GPT-4, have revolutionized n...
research
08/28/2023

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Although large language models (LLMs) demonstrate impressive performance...

Please sign up or login with your details

Forgot password? Click here to reset