FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection

09/19/2023
by   Jamil Zaghir, et al.
0

Natural language processing (NLP) applications such as named entity recognition (NER) for low-resource corpora do not benefit from recent advances in the development of large language models (LLMs) where there is still a need for larger annotated datasets. This research article introduces a methodology for generating translated versions of annotated datasets through crosslingual annotation projection. Leveraging a language agnostic BERT-based approach, it is an efficient solution to increase low-resource corpora with few human efforts and by only using already available open data resources. Quantitative and qualitative evaluations are often lacking when it comes to evaluating the quality and effectiveness of semi-automatic data generation strategies. The evaluation of our crosslingual annotation projection approach showed both effectiveness and high accuracy in the resulting dataset. As a practical application of this methodology, we present the creation of French Annotated Resource with Semantic Information for Medical Entities Detection (FRASIMED), an annotated corpus comprising 2'051 synthetic clinical cases in French. The corpus is now available for researchers and practitioners to develop and refine French natural language processing (NLP) applications in the clinical field (https://zenodo.org/record/8355629), making it the largest open annotated corpus with linked medical concepts in French.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 8

research
09/15/2023

AlbNER: A Corpus for Named Entity Recognition in Albanian

Scarcity of resources such as annotated text corpora for under-resourced...
research
08/04/2021

An Empirical Study of UMLS Concept Extraction from Clinical Notes using Boolean Combination Ensembles

Our objective in this study is to investigate the behavior of Boolean op...
research
06/18/2020

AMALGUM – A Free, Balanced, Multilayer English Web Corpus

We present a freely available, genre-balanced English web corpus totalin...
research
06/02/2021

Multilingual Medical Question Answering and Information Retrieval for Rural Health Intelligence Access

In rural regions of several developing countries, access to quality heal...
research
11/04/2022

CLSE: Corpus of Linguistically Significant Entities

One of the biggest challenges of natural language generation (NLG) is th...
research
08/21/2023

DepreSym: A Depression Symptom Annotated Corpus and the Role of LLMs as Assessors of Psychological Markers

Computational methods for depression detection aim to mine traces of dep...
research
04/19/2022

Councils in Action: Automating the Curation of Municipal Governance Data for Research

Large scale comparative research into municipal governance is often proh...

Please sign up or login with your details

Forgot password? Click here to reset