MeSHup: A Corpus for Full Text Biomedical Document Indexing

04/28/2022
by   Xindi Wang, et al.
9

Medical Subject Heading (MeSH) indexing refers to the problem of assigning a given biomedical document with the most relevant labels from an extremely large set of MeSH terms. Currently, the vast number of biomedical articles in the PubMed database are manually annotated by human curators, which is time consuming and costly; therefore, a computational system that can assist the indexing is highly valuable. When developing supervised MeSH indexing systems, the availability of a large-scale annotated text corpus is desirable. A publicly available, large corpus that permits robust evaluation and comparison of various systems is important to the research community. We release a large scale annotated MeSH indexing corpus, MeSHup, which contains 1,342,667 full text articles in English, together with the associated MeSH labels and metadata, authors, and publication venues that are collected from the MEDLINE database. We train an end-to-end model that combines features from documents and their associated labels on our corpus and report the new baseline.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/14/2022

KenMeSH: Knowledge-enhanced End-to-end Biomedical Text Labelling

Currently, Medical Subject Headings (MeSH) are manually assigned to ever...
research
11/04/2016

Learning to Rank Scientific Documents from the Crowd

Finding related published articles is an important task in any science, ...
research
05/04/2023

PGB: A PubMed Graph Benchmark for Heterogeneous Network Representation Learning

There has been a rapid growth in biomedical literature, yet capturing th...
research
11/26/2019

Doc2Vec on the PubMed corpus: study of a new approach to generate related articles

PubMed is the biggest and most used bibliographic database worldwide, ho...
research
01/23/2023

Large-scale fine-grained semantic indexing of biomedical literature based on weakly-supervised deep learning

Semantic indexing of biomedical literature is usually done at the level ...
research
03/19/2021

Biomedical Convergence Facilitated by the Emergence of Technological and Informatic Capabilities

We analyzed Medical Subject Headings (MeSH) from 21.6 million research a...
research
03/24/2022

Kratt: Developing an Automatic Subject Indexing Tool for The National Library of Estonia

Manual subject indexing in libraries is a time-consuming and costly proc...

Please sign up or login with your details

Forgot password? Click here to reset