AnchiBERT: A Pre-Trained Model for Ancient ChineseLanguage Understanding and Generation

09/24/2020
by   Huishuang Tian, et al.
0

Ancient Chinese is the essence of Chinese culture. There are several natural language processing tasks of ancient Chinese domain, such as ancient-modern Chinese translation, poem generation, and couplet generation. Previous studies usually use the supervised models which deeply rely on parallel data. However, it is difficult to obtain large-scale parallel data of ancient Chinese. In order to make full use of the more easily available monolingual ancient Chinese corpora, we release AnchiBERT, a pre-trained language model based on the architecture of BERT, which is trained on large-scale ancient Chinese corpora. We evaluate AnchiBERT on both language understanding and generation tasks, including poem classification, ancient-modern Chinese translation, poem generation, and couplet generation. The experimental results show that AnchiBERT outperforms BERT as well as the non-pretrained models and achieves state-of-the-art results in all cases.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/31/2019

NEZHA: Neural Contextualized Representation for Chinese Language Understanding

The pre-trained language models have achieved great successes in various...
research
09/18/2023

Proposition from the Perspective of Chinese Language: A Chinese Proposition Classification Evaluation Benchmark

Existing propositions often rely on logical constants for classification...
research
11/04/2022

Generation of Chinese classical poetry based on pre-trained model

In order to test whether artificial intelligence can create qualified cl...
research
07/02/2022

Can Language Models Make Fun? A Case Study in Chinese Comical Crosstalk

Language is the principal tool for human communication, in which humor i...
research
10/12/2020

OCNLI: Original Chinese Natural Language Inference

Despite the tremendous recent progress on natural language inference (NL...
research
10/11/2022

HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea

Historical records in Korea before the 20th century were primarily writt...
research
02/01/2021

Polyphone Disambiguition in Mandarin Chinese with Semi-Supervised Learning

The majority of Chinese characters are monophonic, i.e.their pronunciati...

Please sign up or login with your details

Forgot password? Click here to reset