Annotated Dataset Creation through General Purpose Language Models for non-English Medical NLP

08/30/2022
by   Johann Frei, et al.
0

Obtaining text datasets with semantic annotations is an effortful process, yet crucial for supervised training in natural language processsing (NLP). In general, developing and applying new NLP pipelines in domain-specific contexts for tasks often requires custom designed datasets to address NLP tasks in supervised machine learning fashion. When operating in non-English languages for medical data processing, this exposes several minor and major, interconnected problems such as lack of task-matching datasets as well as task-specific pre-trained models. In our work we suggest to leverage pretrained language models for training data acquisition in order to retrieve sufficiently large datasets for training smaller and more efficient models for use-case specific tasks. To demonstrate the effectiveness of your approach, we create a custom dataset which we use to train a medical NER model for German texts, GPTNERMED, yet our method remains language-independent in principle. Our obtained dataset as well as our pre-trained models are publicly available at: https://github.com/frankkramer-lab/GPTNERMED

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/16/2023

Pre-Training to Learn in Context

In-context learning, where pre-trained language models learn to perform ...
research
04/03/2023

Creating Custom Event Data Without Dictionaries: A Bag-of-Tricks

Event data, or structured records of “who did what to whom” that are aut...
research
10/27/2020

On the diminishing return of labeling clinical reports

Ample evidence suggests that better machine learning models may be stead...
research
06/29/2022

GERNERMED++: Transfer Learning in German Medical NLP

We present a statistical model for German medical natural language proce...
research
10/20/2020

Neural Language Modeling for Contextualized Temporal Graph Generation

This paper presents the first study on using large-scale pre-trained lan...
research
05/26/2023

TADA: Task-Agnostic Dialect Adapters for English

Large Language Models, the dominant starting point for Natural Language ...
research
03/14/2023

MEDBERT.de: A Comprehensive German BERT Model for the Medical Domain

This paper presents medBERTde, a pre-trained German BERT model specifica...

Please sign up or login with your details

Forgot password? Click here to reset