A Dataset for Plain Language Adaptation of Biomedical Abstracts

10/21/2022
by   Kush Attal, et al.
0

Though exponentially growing health-related literature has been made available to a broad audience online, the language of scientific articles can be difficult for the general public to understand. Therefore, adapting this expert-level language into plain language versions is necessary for the public to reliably comprehend the vast health-related literature. Deep Learning algorithms for automatic adaptation are a possible solution; however, gold standard datasets are needed for proper evaluation. Proposed datasets thus far consist of either pairs of comparable professional- and general public-facing documents or pairs of semantically similar sentences mined from such documents. This leads to a trade-off between imperfect alignments and small test sets. To address this issue, we created the Plain Language Adaptation of Biomedical Abstracts dataset. This dataset is the first manually adapted dataset that is both document- and sentence-aligned. The dataset contains 750 adapted abstracts, totaling 7643 sentence pairs. Along with describing the dataset, we benchmark automatic adaptation on the dataset with state-of-the-art Deep Learning approaches, setting baselines for future research.

READ FULL TEXT

page 1

page 11

research
05/05/2019

BVS Corpus: A Multilingual Parallel Corpus of Biomedical Scientific Texts

The BVS database (Health Virtual Library) is a centralized source of bio...
research
05/30/2023

DEPLAIN: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification

Text simplification is an intralingual translation task in which documen...
research
06/17/2022

The ITU Faroese Pairs Dataset

This article documents a dataset of sentence pairs between Faroese and D...
research
01/18/2022

Sectioning of Biomedical Abstracts: A Sequence of Sequence Classification Task

Rapid growth of the biomedical literature has led to many advances in th...
research
03/13/2023

PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents

Foundation models trained on large-scale dataset gain a recent surge in ...
research
09/17/2018

Open Subtitles Paraphrase Corpus for Six Languages

This paper accompanies the release of Opusparcus, a new paraphrase corpu...

Please sign up or login with your details

Forgot password? Click here to reset