HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea

10/11/2022
by   Haneul Yoo, et al.
18

Historical records in Korea before the 20th century were primarily written in Hanja, an extinct language based on Chinese characters and not understood by modern Korean or Chinese speakers. Historians with expertise in this time period have been analyzing the documents, but that process is very difficult and time-consuming, and language models would significantly speed up the process. Toward building and evaluating language models for Hanja, we release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks. We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats. We compare the models with several baselines on all tasks and show there are significant improvements gained by training on the two corpora. Additionally, we run zero-shot experiments on the Daily Records of the Royal Court and Important Officials (DRRI). The DRRI dataset has not been studied much by the historians, and not at all by the NLP community.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/31/2019

NEZHA: Neural Contextualized Representation for Chinese Language Understanding

The pre-trained language models have achieved great successes in various...
research
03/29/2021

Contextual Text Embeddings for Twi

Transformer-based language models have been changing the modern Natural ...
research
09/24/2020

AnchiBERT: A Pre-Trained Model for Ancient ChineseLanguage Understanding and Generation

Ancient Chinese is the essence of Chinese culture. There are several nat...
research
11/19/2019

Towards Lingua Franca Named Entity Recognition with BERT

Information extraction is an important task in NLP, enabling the automat...
research
03/30/2023

Yes but.. Can ChatGPT Identify Entities in Historical Documents?

Large language models (LLMs) have been leveraged for several years now, ...
research
08/25/2020

Conceptualized Representation Learning for Chinese Biomedical Text Mining

Biomedical text mining is becoming increasingly important as the number ...
research
04/19/2021

Modeling "Newsworthiness" for Lead-Generation Across Corpora

Journalists obtain "leads", or story ideas, by reading large corpora of ...

Please sign up or login with your details

Forgot password? Click here to reset