PLOD: An Abbreviation Detection Dataset for Scientific Documents

04/26/2022
by   Leonardo Zilio, et al.
0

The detection and extraction of abbreviations from unstructured texts can help to improve the performance of Natural Language Processing tasks, such as machine translation and information retrieval. However, in terms of publicly available datasets, there is not enough data for training deep-neural-networks-based models to the point of generalising well over data. This paper presents PLOD, a large-scale dataset for abbreviation detection and extraction that contains 160k+ segments automatically annotated with abbreviations and their long forms. We performed manual validation over a set of instances and a complete automatic validation for this dataset. We then used it to generate several baseline models for detecting abbreviations and long forms. The best models achieved an F1-score of 0.92 for abbreviations and 0.89 for detecting their corresponding long forms. We release this dataset along with our code and all the models publicly in https://github.com/surrey-nlp/PLOD-AbbreviationDetection

READ FULL TEXT

page 5

page 8

research
01/09/2022

An Ensemble Approach to Acronym Extraction using Transformers

Acronyms are abbreviated units of a phrase constructed by using initial ...
research
09/29/2022

TERMinator: A system for scientific texts processing

This paper is devoted to the extraction of entities and semantic relatio...
research
05/23/2023

Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction

The robustness to distribution changes ensures that NLP models can be su...
research
06/09/2023

JABBERWOCK: A Tool for WebAssembly Dataset Generation and Its Application to Malicious Website Detection

Machine learning is often used for malicious website detection, but an a...
research
12/04/2019

WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset

Over the past years, deep learning methods allowed for new state-of-the-...
research
04/28/2023

CED: Catalog Extraction from Documents

Sentence-by-sentence information extraction from long documents is an ex...
research
11/12/2022

Dark patterns in e-commerce: a dataset and its baseline evaluations

Dark patterns, which are user interface designs in online services, indu...

Please sign up or login with your details

Forgot password? Click here to reset