Log In Sign Up

Knowledge Graph-based Neurodegenerative Diseases and Diet Relationship Discovery

To date, there are no effective treatments for most neurodegenerative diseases. However, certain foods may be associated with these diseases and bring an opportunity to prevent or delay neurodegenerative progression. Our objective is to construct a knowledge graph for neurodegenerative diseases using literature mining to study their relations with diet. We collected biomedical annotations (Disease, Chemical, Gene, Species, SNP Mutation) in the abstracts from 4,300 publications relevant to both neurodegenerative diseases and diet using PubTator, an NIH-supported tool that can extract biomedical concepts from literature. A knowledge graph was created from these annotations. Graph embeddings were then trained with the node2vec algorithm to support potential concept clustering and similar concept identification. We found several food-related species and chemicals that might come from diet and have an impact on neurodegenerative diseases.


Towards context in large scale biomedical knowledge graphs

Contextual information is widely considered for NLP and knowledge discov...

Semantic integration of disease-specific knowledge

Biomedical researchers working on a specific disease need up-to-date and...

Mining Misdiagnosis Patterns from Biomedical Literature

Diagnostic errors can pose a serious threat to patient safety, leading t...

Formal Concept Analysis of Rodent Carriers of Zoonotic Disease

The technique of Formal Concept Analysis is applied to a dataset describ...

A Literature Review of Recent Graph Embedding Techniques for Biomedical Data

With the rapid development of biomedical software and hardware, a large ...

Literature Triage on Genomic Variation Publications by Knowledge-enhanced Multi-channel CNN

Background: To investigate the correlation between genomic variation and...

Predicting microRNA-disease associations from knowledge graph using tensor decomposition with relational constraints

Motivation: MiRNAs are a kind of small non-coding RNAs that are not tran...

1 Scientific Background

Neurodegenerative diseases are a heterogeneous group of disorders that are characterized by the progressive degeneration of the structure and function of the central nervous system or peripheral nervous system.[1] Common neurodegenerative diseases, such as Alzheimer’s disease and Parkinson’s disease, are usually incurable and difficult to stop the degeneration of nerve cells. Neurodegenerative disease affects humans in different activities, such as balance, movement, talking, and breathing. Studies have indicated that diets could be related to prevent or delay neurodegenerative diseases and cognitive decline [2]. However, further research is needed to better understand the backend mechanisms and to reveal the potential interactions with clinical and pharmacokinetic factors.

The objective of this paper is to study potential relations between neurodegenerative diseases and diet using a knowledge graph-based approach. The concept of knowledge graph originated from Google and was used to enhance information retrieval from different sources. In this paper, we encode biomedical concepts and their rich relations into a network (knowledge graph) through literature mining. Literature Mining is a data mining technique that identifies the entities such as genes, diseases, and chemicals from literature, discovers global trends, and facilitates hypothesis generation based on existing knowledge. Literature mining enables researchers to study a massive amount of literature quickly and reveal hidden relations between entities that might be hard to discover by manual analysis. In this paper, we introduce a biomedical knowledge graph that specifically focuses on neurodegenerative diseases and diet, which could be used to study and discover underlying relations between diet and neurodegenerative diseases and give researchers a comprehensive overview of food, diet, and neurodegenerative diseases.

2 Materials and Methods

We first retrieved the abstracts that were related to neurodegenerative diseases and food/diet from PubMed [3]. Biomedical entities in the abstracts were then extracted using PubTator [4]. In this paper, relations between entities were determined by co-occurrence, based on which the knowledge graph was constructed. Finally, we generated node embeddings and analyzed two representative neurodegenerative diseases based on the cluster of embeddings. A visualization of our knowledge graph is provided using Neo4j. An overview of this pipeline is illustrated in Figure 1.

Figure 1: Overview of pipeline.

2.1 Biomedical Concept Extraction

To retrieve abstracts that are relevant to both neurodegenerative diseases and diet, we used the query “(Alzheimer’s disease OR Parkinson’s disease OR Prion disease OR Huntington disease OR (neurodegenerative disease) AND (eat or diet or food)”, to search the PubMed database. PubMed is a public database that comprises more than 32 million citations for biomedical literature from MEDLINE, life science journals, and online books [3]. The abstracts collected were from different types of studies: Randomized Controlled Trials, Clinical Trials, and Meta-Analyses. The publication dates of these studies ranged from 1975 to 2020.

2.2 Pubtator

Pubtator Central2 (PTC) [4] is a web-based tool developed by NCBI that provides automatic annotations of biomedical concepts such as genes, chemicals and diseases. Using Pubtator, we extracted different biomedical entities from the relevant abstracts. We further classified these entities into five concept categories: Disease, Chemical, Species, Gene, and SNP&Mutation (including DNA and protein mutations). We assumed that co-occurrence in the same abstract indicates a certain relationship between the two entities. For every occurrence of Disease in one abstract, we linked it with all other co-occurring concepts. We iterated this process for all abstracts and these pairs are used to construct our knowledge graph. In the knowledge graph, we created a node for each biomedical concept. For each pair of nodes, we also incorporated information such as their occurrence frequency and their source literature into the edge.

2.3 Network embeddings representation learning

We leveraged Neo4j [5], a well-acknowledged graph database management system to construct a knowledge graph for all Disease-concept relationships. Above is the visualization of this knowledge graph where different colors represent different types of biomedical concepts. To quantitatively compare the relationship between different biomedical concepts, we leveraged node2vec [6] to map each graph node into a fix-length vector that maximizes the likelihood of preserving network neighborhoods of nodes. Specifically, node2vec generates fixed-length random walks using different nodes as initial nodes and feeds them into the word2vec algorithm to get the numerical representation. Here we used a random walk length of 10 and the number of dimensions for representations is 100. Moreover, we also used the occurrence frequency of each Disease-related pair as the weight between every two nodes. Node2vec calculates the transition probabilities between nodes using this weight and the probability is leveraged to generate random walks. To visualize the embedding space, we further applied the t-distributed stochastic neighbor embedding (t-SNE) [7] approach to reduce the dimension from 100 to 2 while still preserving the local structure. Specifically, t-SNE:

1. Converts the shortest distance between points into a probability distribution of similar points.

2. Calculates a similar pairwise conditional probability in low dimensional space using a heavy-tailed t-distribution,

3. Minimizes the sum of the difference in conditional probabilities using Kullback-Leibler divergences between step 1 and step 2 with gradient descent.

Figure 2: Neo4j knowledge graph visualization.
Figure 3: T-SNE visualization of Node2vec embeddings.

A graphical result using t-SNE from node embedding data is shown in Figure 3. Clear clusters of diseases can be seen in the lower part of Figure 3. Several diseases at the bottom right are separated from the majority and we found that they are diseases spreading among both animals and humans, for example, Chronic Wasting Disease, Creutzfeldt-Jakob Syndrome, and Prion Disease.

3 Results

From the 4,300 abstracts, there were 1,188, 1,309, 822, 322, and 40 unique entities (concepts) for Disease, Chemical, Gene, Species, and SNP&Mutation, respectively. These biomedical concepts form 21,521, 8,048, 5,042, and 161 unique relationships: Disease-Chemical, Disease-Gene, Disease-species, and Disease-SNP&Mutation respectively. The most frequent Disease-Concept pairs can be seen in Table 1. We noticed that polyphenols, which are usually found in fruits and vegetables, have high co-existence with multiple neurodegenerative diseases. Polyphenols are well known for their function to reduce the risk of neurodegenerative disease [8]. Moreover, plenty of epidemiological studies reported the relation of intaking Omega-3 on Alzheimer’s Disease [9]. The appearance of Omega-3 in chemical-disease pairs in Table 1 further verified this. Olea europaea and Curcuma longa are the two top-ranked food-related species that are associated with multiple diseases. Diabetes Mellitus, Cardiovascular Diseases, and Obesity are the three top-ranked non-neurodegenerative diseases that appear in disease-disease pairs. From all the 4,300 abstracts, only 20 abstracts contain SNP&Mutation-Disease pairs, which indicates that there might be a lack of research focusing on this area. We did not show these pairs in Table 1 as their occurrences are quite sparse.

Chemical-Disease Pair Gene-Disease Pair
Chemical Name Disease Name Count Gene Name Disease Name Count

Neurodegenerative Diseases 175 Abeta Alzheimer Disease 169
Lipids Alzheimer Disease 167 tau Alzheimer Disease 110
Fatty Acids, Omega-3 Alzheimer Disease 131 insulin Alzheimer Disease 108
Lipids Neurodegenerative Diseases 124 Apo-E Alzheimer Disease 80
Polyphenols Alzheimer Disease 122 Abeta Neurodegenerative Diseases 75
Polyphenols Neoplasms 108 insulin Diabetes Mellitus 74
Species-Disease Pair Disease-Disease Pair
Chemical Name Disease Name Count Disease Name Disease Name Count

Olea europaea
Neurodegenerative Diseases 37 Alzheimer Disease Diabetes Mellitus 275

Olea europaea
Neoplasms 29 Neurodegenerative Diseases Diabetes Mellitus 261
Olea europaea Alzheimer Disease 28 Diabetes Mellitus Cardiovascular Diseases 212
Curcuma longa Alzheimer Disease 26 Alzheimer Disease Cardiovascular Diseases 194
Curcuma longa Neurodegenerative Diseases 22 Parkinson Disease Obesity 146
Curcuma longa Neoplams 20 Neurodegenerative Diseases Obesity 135
Table 1: Most frequent entity-entity pairs extracted from Pubmed

In the embedding space, we found the top-10 nearest neighbors of two representative neurodegenerative diseases: Alzheimer’s Disease and Parkinson’s Disease. Chemicals and food-related species that might be related to diet are highlighted. We also included a more general concept, neurodegenerative diseases, which may refer to one or more diseases from literature, in Table 2.

Alzheimer’s Disease Parkinson’s Disease Neurodegenerative Diseases
Name Distance Type Name Distance Type Name Distance Type

Oryctolagus cuniculus
1.32 Species Amines 1.65 Chemical Age 1.52 Gene
AChE 1.43 Gene rasagiline 1.66 Chemical Endocannabinoids 1.56 Chemical

insulin receptor
1.44 Gene Nicotine 1.70 Chemical Polysaccharides 1.56 Chemical

Panax ginseng
1.45 Species Dronabinol 1.77 Chemical Allium sativum 1.57 Species

1.45 Chemical entacapone 1.78 Chemical Sphingolipids 1.57 Chemical

1.46 Chemical CB2 1.80 Gene PX clade 1.58 Species

1.47 Gene Arrhythmias, Cardiac 1.81 Disease Isoflavones 1.58 Chemical

Fluorodeoxyglucose F18
1.47 Chemical Mucuna pruriens 1.82 Species Thiamine 1.58 Chemical

1.48 Chemical Uric Acid 1.82 Chemical Agaricus bisporus 1.58 Species
Silicon 1.51 Chemical Cysteine 1.83 Chemical Crocus sativus 1.61 Species

Table 2: Nearest neighbours of 3 diseases

4 Conclusion

In this study, we built a framework to construct and visualize a knowledge graph to link neurodegenerative diseases-related biomedical knowledge from PubMed. Specifically, we focused on relationships between neurodegenerative diseases and food/diet. Our preliminary analysis indicated that the pipeline can be used to identify biomedical concepts that are semantically closed to each other as well as to reveal relationships between diet and diseases of interest. A breadth of possibilities exists to further improve this framework, such as linking more concepts with existing knowledge bases, implementing graph network analysis for hidden relationships, and adding supervised/semi-supervised methods for relationship extraction. Linking sparse knowledge from fast-growing literature would be beneficial for existing knowledge/information retrieval, and may promote uncovering of new knowledge. This framework is flexible and can be used for other applications such as drug repurposing, therapeutic discovery, and clinical decision support for neurodegenerative diseases and other diseases. The knowledge graph constructed can facilitate researchers for data-driven knowledge discovery and new hypothesis generation.


This paper is partially supported by the National Institute of Health under award number RF1AG072799.