SemClinBr – a multi institutional and multi specialty semantically annotated corpus for Portuguese clinical NLP tasks

by   Lucas Emanuel Silva e Oliveira, et al.

The high volume of research focusing on extracting patient's information from electronic health records (EHR) has led to an increase in the demand for annotated corpora, which are a very valuable resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multi-purpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field. In this study, we developed a semantically annotated corpus using clinical texts from multiple medical specialties, document types, and institutions. We present the following: (1) a survey listing common aspects and lessons learned from previous research, (2) a fine-grained annotation schema which could be replicated and guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations. The result of this work is the SemClinBr, a corpus that has 1,000 clinical notes, labeled with 65,117 entities and 11,263 relations, and can support a variety of clinical NLP tasks and boost the EHR's secondary use for the Portuguese language.



There are no comments yet.


page 1

page 6

page 12

page 15


Building a comprehensive syntactic and semantic corpus of Chinese clinical texts

Objective: To build a comprehensive corpus covering syntactic and semant...

Hierarchical Annotation for Building A Suite of Clinical Natural Language Processing Tasks: Progress Note Understanding

Applying methods in natural language processing on electronic health rec...

A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature

We present a corpus of 5,000 richly annotated abstracts of medical artic...

NUBES: A Corpus of Negation and Uncertainty in Spanish Clinical Texts

This paper introduces the first version of the NUBes corpus (Negation an...

An Empirical Study of UMLS Concept Extraction from Clinical Notes using Boolean Combination Ensembles

Our objective in this study is to investigate the behavior of Boolean op...

Prague Dependency Treebank – Consolidated 1.0

We present a richly annotated and genre-diversified language resource, t...

Implementing a Portable Clinical NLP System with a Common Data Model - a Lisp Perspective

This paper presents a Lisp architecture for a portable NLP system, terme...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.