Mining Local Gazetteers of Literary Chinese with CRF and Pattern based Methods for Biographical Information in Chinese History

by   Chao-Lin Liu, et al.

Person names and location names are essential building blocks for identifying events and social networks in historical documents that were written in literary Chinese. We take the lead to explore the research on algorithmically recognizing named entities in literary Chinese for historical studies with language-model based and conditional-random-field based methods, and extend our work to mining the document structures in historical documents. Practical evaluations were conducted with texts that were extracted from more than 220 volumes of local gazetteers (Difangzhi). Difangzhi is a huge and the single most important collection that contains information about officers who served in local government in Chinese history. Our methods performed very well on these realistic tests. Thousands of names and addresses were identified from the texts. A good portion of the extracted names match the biographical information currently recorded in the China Biographical Database (CBDB) of Harvard University, and many others can be verified by historians and will become as new additions to CBDB.



There are no comments yet.


page 1


Textual Analysis for Studying Chinese Historical Documents and Literary Novels

We analyzed historical and literary documents in Chinese to gain insight...

Mining and discovering biographical information in Difangzhi with a language-model-based approach

We present results of expanding the contents of the China Biographical D...

Topic Modeling the Hàn diăn Ancient Classics

Ancient Chinese texts present an area of enormous challenge and opportun...

Complicating the Social Networks for Better Storytelling: An Empirical Study of Chinese Historical Text and Novel

Digital humanities is an important subject because it enables developmen...

Classical Chinese Sentence Segmentation for Tomb Biographies of Tang Dynasty

Tomb biographies of the Tang dynasty provide invaluable information abou...

Color Aesthetics and Social Networks in Complete Tang Poems: Explorations and Discoveries

The Complete Tang Poems (CTP) is the most important source to study Tang...

The Person Index Challenge: Extraction of Persons from Messy, Short Texts

When persons are mentioned in texts with their first name, last name and...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.