Applying deep learning techniques on medical corpora from the World Wide Web: a prototypical system and evaluation

BACKGROUND: The amount of biomedical literature is rapidly growing and it is becoming increasingly difficult to keep manually curated knowledge bases and ontologies up-to-date. In this study we applied the word2vec deep learning toolkit to medical corpora to test its potential for identifying relationships from unstructured text. We evaluated the efficiency of word2vec in identifying properties of pharmaceuticals based on mid-sized, unstructured medical text corpora available on the web. Properties included relationships to diseases ('may treat') or physiological processes ('has physiological effect'). We compared the relationships identified by word2vec with manually curated information from the National Drug File - Reference Terminology (NDF-RT) ontology as a gold standard. RESULTS: Our results revealed a maximum accuracy of 49.28 regularities on the collected medical corpora compared with other published results. We were able to document the influence of different parameter settings on result accuracy and found and unexpected trade-off between ranking quality and accuracy. Pre-processing corpora to reduce syntactic variability proved to be a good strategy for increasing the utility of the trained vector models. CONCLUSIONS: Word2vec is a very efficient implementation for computing vector representations and for its ability to identify relationships in textual data without any prior domain knowledge. We found that the ranking and retrieved results generated by word2vec were not of sufficient quality for automatic population of knowledge bases and ontologies, but could serve as a starting point for further manual curation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/24/2021

Extraction of common conceptual components from multiple ontologies

We describe a novel method for identifying and extracting conceptual com...
research
01/08/2023

Semantic rule Web-based Diagnosis and Treatment of Vector-Borne Diseases using SWRL rules

Vector-borne diseases (VBDs) are a kind of infection caused through the ...
research
01/15/2022

An Automatic Ontology Generation Framework with An Organizational Perspective

Ontologies have been known for their semantic representation of knowledg...
research
05/08/2020

Literature Triage on Genomic Variation Publications by Knowledge-enhanced Multi-channel CNN

Background: To investigate the correlation between genomic variation and...
research
09/17/2020

PhenoTagger: A Hybrid Method for Phenotype Concept Recognition using Human Phenotype Ontology

Automatic phenotype concept recognition from unstructured text remains a...
research
07/27/2016

Harmonization of conflicting medical opinions using argumentation protocols and textual entailment - a case study on Parkinson disease

Parkinson's disease is the second most common neurodegenerative disease,...
research
10/22/2020

Text Mining to Identify and Extract Novel Disease Treatments From Unstructured Datasets

Objective: We aim to learn potential novel cures for diseases from unstr...

Please sign up or login with your details

Forgot password? Click here to reset