Active Learning for Chinese Word Segmentation in Medical Text

08/22/2019
by   Tingting Cai, et al.
0

Electronic health records (EHRs) stored in hospital information systems completely reflect the patients' diagnosis and treatment processes, which are essential to clinical data mining. Chinese word segmentation (CWS) is a fundamental and important task for Chinese natural language processing. Currently, most state-of-the-art CWS methods greatly depend on large-scale manually-annotated data, which is a very time-consuming and expensive work, specially for the annotation in medical field. In this paper, we present an active learning method for CWS in medical text. To effectively utilize complete segmentation history, a new scoring model in sampling strategy is proposed, which combines information entropy with neural network. Besides, to capture interactions between adjacent characters, K-means clustering features are additionally added in word segmenter. We experimentally evaluate our proposed CWS method in medical text, experimental results based on EHRs collected from the Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine show that our proposed method outperforms other reference methods, which can effectively save the cost of manual annotation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/05/2020

Confident Coreset for Active Learning in Medical Image Analysis

Recent advances in deep learning have resulted in great successes in var...
research
08/19/2022

End-to-end Clinical Event Extraction from Chinese Electronic Health Record

Event extraction is an important work of medical text processing. Accord...
research
07/16/2021

The Application of Active Query K-Means in Text Classification

Active learning is a state-of-art machine learning approach to deal with...
research
08/28/2020

Cost-Quality Adaptive Active Learning for Chinese Clinical Named Entity Recognition

Clinical Named Entity Recognition (CNER) aims to automatically identity ...
research
08/28/2019

Onto Word Segmentation of the Complete Tang Poems

We aim at segmenting words in the Complete Tang Poems (CTP). Although it...
research
02/05/2020

Multi-Fusion Chinese WordNet (MCW) : Compound of Machine Learning and Manual Correction

Princeton WordNet (PWN) is a lexicon-semantic network based on cognitive...
research
05/21/2019

A realistic and robust model for Chinese word segmentation

A realistic Chinese word segmentation tool must adapt to textual variati...

Please sign up or login with your details

Forgot password? Click here to reset