De-identification of medical records using conditional random fields and long short-term memory networks

by   Zhipeng Jiang, et al.

The CEGS N-GRID 2016 Shared Task 1 in Clinical Natural Language Processing focuses on the de-identification of psychiatric evaluation records. This paper describes two participating systems of our team, based on conditional random fields (CRFs) and long short-term memory networks (LSTMs). A pre-processing module was introduced for sentence detection and tokenization before de-identification. For CRFs, manually extracted rich features were utilized to train the model. For LSTMs, a character-level bi-directional LSTM network was applied to represent tokens and classify tags for each token, following which a decoding layer was stacked to decode the most probable protected health information (PHI) terms. The LSTM-based system attained an i2b2 strict micro-F_1 measure of 89.86 system.


page 1

page 2

page 3

page 4


Natural Language Processing Accurately Categorizes Indications, Findings and Pathology Reports from Multicenter Colonoscopy

Colonoscopy is used for colorectal cancer (CRC) screening. Extracting de...

SEAL: Scientific Keyphrase Extraction and Classification

Automatic scientific keyphrase extraction is a challenging problem facil...

De-identification In practice

We report our effort to identify the sensitive information, subset of da...

Scope resolution of predicted negation cues: A two-step neural network-based approach

Neural network-based methods are the state of the art in negation scope ...

Effective Neural Solution for Multi-Criteria Word Segmentation

We present a simple yet elegant solution to train a single joint model o...

An Efficient Architecture for Predicting the Case of Characters using Sequence Models

The dearth of clean textual data often acts as a bottleneck in several n...

Document classification using a Bi-LSTM to unclog Brazil's supreme court

The Brazilian court system is currently the most clogged up judiciary sy...