Active Learning for Chinese Word Segmentation in Medical Text

by   Tingting Cai, et al.

Electronic health records (EHRs) stored in hospital information systems completely reflect the patients' diagnosis and treatment processes, which are essential to clinical data mining. Chinese word segmentation (CWS) is a fundamental and important task for Chinese natural language processing. Currently, most state-of-the-art CWS methods greatly depend on large-scale manually-annotated data, which is a very time-consuming and expensive work, specially for the annotation in medical field. In this paper, we present an active learning method for CWS in medical text. To effectively utilize complete segmentation history, a new scoring model in sampling strategy is proposed, which combines information entropy with neural network. Besides, to capture interactions between adjacent characters, K-means clustering features are additionally added in word segmenter. We experimentally evaluate our proposed CWS method in medical text, experimental results based on EHRs collected from the Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine show that our proposed method outperforms other reference methods, which can effectively save the cost of manual annotation.



There are no comments yet.


page 1

page 2

page 3

page 4


Confident Coreset for Active Learning in Medical Image Analysis

Recent advances in deep learning have resulted in great successes in var...

The Application of Active Query K-Means in Text Classification

Active learning is a state-of-art machine learning approach to deal with...

Cost-Quality Adaptive Active Learning for Chinese Clinical Named Entity Recognition

Clinical Named Entity Recognition (CNER) aims to automatically identity ...

Onto Word Segmentation of the Complete Tang Poems

We aim at segmenting words in the Complete Tang Poems (CTP). Although it...

Active Learning for Segmentation by Optimizing Content Information for Maximal Entropy

Segmentation is essential for medical image analysis tasks such as inter...

Multi-Fusion Chinese WordNet (MCW) : Compound of Machine Learning and Manual Correction

Princeton WordNet (PWN) is a lexicon-semantic network based on cognitive...

A realistic and robust model for Chinese word segmentation

A realistic Chinese word segmentation tool must adapt to textual variati...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.