1 Scientific Background
Neurodegenerative diseases are a heterogeneous group of disorders that are characterized by the progressive degeneration of the structure and function of the central nervous system or peripheral nervous system. Common neurodegenerative diseases, such as Alzheimer’s disease and Parkinson’s disease, are usually incurable and difficult to stop the degeneration of nerve cells. Neurodegenerative disease affects humans in different activities, such as balance, movement, talking, and breathing. Studies have indicated that diets could be related to prevent or delay neurodegenerative diseases and cognitive decline . However, further research is needed to better understand the backend mechanisms and to reveal the potential interactions with clinical and pharmacokinetic factors.
The objective of this paper is to study potential relations between neurodegenerative diseases and diet using a knowledge graph-based approach. The concept of knowledge graph originated from Google and was used to enhance information retrieval from different sources. In this paper, we encode biomedical concepts and their rich relations into a network (knowledge graph) through literature mining. Literature Mining is a data mining technique that identifies the entities such as genes, diseases, and chemicals from literature, discovers global trends, and facilitates hypothesis generation based on existing knowledge. Literature mining enables researchers to study a massive amount of literature quickly and reveal hidden relations between entities that might be hard to discover by manual analysis. In this paper, we introduce a biomedical knowledge graph that specifically focuses on neurodegenerative diseases and diet, which could be used to study and discover underlying relations between diet and neurodegenerative diseases and give researchers a comprehensive overview of food, diet, and neurodegenerative diseases.
2 Materials and Methods
We first retrieved the abstracts that were related to neurodegenerative diseases and food/diet from PubMed . Biomedical entities in the abstracts were then extracted using PubTator . In this paper, relations between entities were determined by co-occurrence, based on which the knowledge graph was constructed. Finally, we generated node embeddings and analyzed two representative neurodegenerative diseases based on the cluster of embeddings. A visualization of our knowledge graph is provided using Neo4j. An overview of this pipeline is illustrated in Figure 1.
2.1 Biomedical Concept Extraction
To retrieve abstracts that are relevant to both neurodegenerative diseases and diet, we used the query “(Alzheimer’s disease OR Parkinson’s disease OR Prion disease OR Huntington disease OR (neurodegenerative disease) AND (eat or diet or food)”, to search the PubMed database. PubMed is a public database that comprises more than 32 million citations for biomedical literature from MEDLINE, life science journals, and online books . The abstracts collected were from different types of studies: Randomized Controlled Trials, Clinical Trials, and Meta-Analyses. The publication dates of these studies ranged from 1975 to 2020.
Pubtator Central2 (PTC)  is a web-based tool developed by NCBI that provides automatic annotations of biomedical concepts such as genes, chemicals and diseases. Using Pubtator, we extracted different biomedical entities from the relevant abstracts. We further classified these entities into five concept categories: Disease, Chemical, Species, Gene, and SNP&Mutation (including DNA and protein mutations). We assumed that co-occurrence in the same abstract indicates a certain relationship between the two entities. For every occurrence of Disease in one abstract, we linked it with all other co-occurring concepts. We iterated this process for all abstracts and these pairs are used to construct our knowledge graph. In the knowledge graph, we created a node for each biomedical concept. For each pair of nodes, we also incorporated information such as their occurrence frequency and their source literature into the edge.
2.3 Network embeddings representation learning
We leveraged Neo4j , a well-acknowledged graph database management system to construct a knowledge graph for all Disease-concept relationships. Above is the visualization of this knowledge graph where different colors represent different types of biomedical concepts. To quantitatively compare the relationship between different biomedical concepts, we leveraged node2vec  to map each graph node into a fix-length vector that maximizes the likelihood of preserving network neighborhoods of nodes. Specifically, node2vec generates fixed-length random walks using different nodes as initial nodes and feeds them into the word2vec algorithm to get the numerical representation. Here we used a random walk length of 10 and the number of dimensions for representations is 100. Moreover, we also used the occurrence frequency of each Disease-related pair as the weight between every two nodes. Node2vec calculates the transition probabilities between nodes using this weight and the probability is leveraged to generate random walks. To visualize the embedding space, we further applied the t-distributed stochastic neighbor embedding (t-SNE)  approach to reduce the dimension from 100 to 2 while still preserving the local structure. Specifically, t-SNE:
1. Converts the shortest distance between points into a probability distribution of similar points.
2. Calculates a similar pairwise conditional probability in low dimensional space using a heavy-tailed t-distribution,
3. Minimizes the sum of the difference in conditional probabilities using Kullback-Leibler divergences between step 1 and step 2 with gradient descent.
A graphical result using t-SNE from node embedding data is shown in Figure 3. Clear clusters of diseases can be seen in the lower part of Figure 3. Several diseases at the bottom right are separated from the majority and we found that they are diseases spreading among both animals and humans, for example, Chronic Wasting Disease, Creutzfeldt-Jakob Syndrome, and Prion Disease.
From the 4,300 abstracts, there were 1,188, 1,309, 822, 322, and 40 unique entities (concepts) for Disease, Chemical, Gene, Species, and SNP&Mutation, respectively. These biomedical concepts form 21,521, 8,048, 5,042, and 161 unique relationships: Disease-Chemical, Disease-Gene, Disease-species, and Disease-SNP&Mutation respectively. The most frequent Disease-Concept pairs can be seen in Table 1. We noticed that polyphenols, which are usually found in fruits and vegetables, have high co-existence with multiple neurodegenerative diseases. Polyphenols are well known for their function to reduce the risk of neurodegenerative disease . Moreover, plenty of epidemiological studies reported the relation of intaking Omega-3 on Alzheimer’s Disease . The appearance of Omega-3 in chemical-disease pairs in Table 1 further verified this. Olea europaea and Curcuma longa are the two top-ranked food-related species that are associated with multiple diseases. Diabetes Mellitus, Cardiovascular Diseases, and Obesity are the three top-ranked non-neurodegenerative diseases that appear in disease-disease pairs. From all the 4,300 abstracts, only 20 abstracts contain SNP&Mutation-Disease pairs, which indicates that there might be a lack of research focusing on this area. We did not show these pairs in Table 1 as their occurrences are quite sparse.
|Chemical-Disease Pair||Gene-Disease Pair|
|Chemical Name||Disease Name||Count||Gene Name||Disease Name||Count|
|Neurodegenerative Diseases||175||Abeta||Alzheimer Disease||169|
|Lipids||Alzheimer Disease||167||tau||Alzheimer Disease||110|
|Fatty Acids, Omega-3||Alzheimer Disease||131||insulin||Alzheimer Disease||108|
|Lipids||Neurodegenerative Diseases||124||Apo-E||Alzheimer Disease||80|
|Polyphenols||Alzheimer Disease||122||Abeta||Neurodegenerative Diseases||75|
|Species-Disease Pair||Disease-Disease Pair|
|Chemical Name||Disease Name||Count||Disease Name||Disease Name||Count|
|Neurodegenerative Diseases||37||Alzheimer Disease||Diabetes Mellitus||275|
|Neoplasms||29||Neurodegenerative Diseases||Diabetes Mellitus||261|
|Olea europaea||Alzheimer Disease||28||Diabetes Mellitus||Cardiovascular Diseases||212|
|Curcuma longa||Alzheimer Disease||26||Alzheimer Disease||Cardiovascular Diseases||194|
|Curcuma longa||Neurodegenerative Diseases||22||Parkinson Disease||Obesity||146|
|Curcuma longa||Neoplams||20||Neurodegenerative Diseases||Obesity||135|
In the embedding space, we found the top-10 nearest neighbors of two representative neurodegenerative diseases: Alzheimer’s Disease and Parkinson’s Disease. Chemicals and food-related species that might be related to diet are highlighted. We also included a more general concept, neurodegenerative diseases, which may refer to one or more diseases from literature, in Table 2.
|Alzheimer’s Disease||Parkinson’s Disease||Neurodegenerative Diseases|
|1.48||Chemical||Uric Acid||1.82||Chemical||Agaricus bisporus||1.58||Species|
In this study, we built a framework to construct and visualize a knowledge graph to link neurodegenerative diseases-related biomedical knowledge from PubMed. Specifically, we focused on relationships between neurodegenerative diseases and food/diet. Our preliminary analysis indicated that the pipeline can be used to identify biomedical concepts that are semantically closed to each other as well as to reveal relationships between diet and diseases of interest. A breadth of possibilities exists to further improve this framework, such as linking more concepts with existing knowledge bases, implementing graph network analysis for hidden relationships, and adding supervised/semi-supervised methods for relationship extraction. Linking sparse knowledge from fast-growing literature would be beneficial for existing knowledge/information retrieval, and may promote uncovering of new knowledge. This framework is flexible and can be used for other applications such as drug repurposing, therapeutic discovery, and clinical decision support for neurodegenerative diseases and other diseases. The knowledge graph constructed can facilitate researchers for data-driven knowledge discovery and new hypothesis generation.
This paper is partially supported by the National Institute of Health under award number RF1AG072799.
-  “Neurodegenerative diseases - Latest research and news — Nature.” https://www.nature.com/subjects/neurodegenerative-diseases (accessed Sep. 09, 2021).
-  J. Joseph, G. Cole, E. Head, and D. Ingram. ”Nutrition, brain aging, and neurodegeneration”. Journal of Neuroscience, 2009, vol. 29, no. 41, doi: 10.1523/JNEUROSCI.3520-09.2009.
-  “PubMed.” https://pubmed.ncbi.nlm.nih.gov/ (accessed Sep. 09, 2021).
-  C. H. Wei, A. Allot, R. Leaman, and Z. Lu. “PubTator central: automated concept annotation for biomedical full text articles”. Nucleic Acids Res, vol. 47, no. W1, 2019, doi: 10.1093/nar/gkz389.
-  “Graph Database Platform — Graph Database Management System — Neo4j.” https://neo4j.com/ (accessed Sep. 09, 2021).
-  A. Grover and J. Leskovec. “Node2vec: Scalable feature learning for networks”. in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, vol. 13-17-August-2016, doi: 10.1145/2939672.2939754.
-  L. Van Der Maaten and G. Hinton. ”Visualizing data using t-SNE”. J. Mach. Learn. Res., vol. 9, 2008.
-  K. S. Bhullar and H. P. V. Rupasinghe ”Polyphenols: Multipotent therapeutic agents in neurodegenerative diseases”. Oxid. Med. Cell. Longev., 2013, doi: 10.1155/2013/891748.
-  G. M. Cole, Q. L. Ma, and S. A. Frautschy ”Omega-3 fatty acids and dementia”. Prostaglandins Leukot. Essent. Fat. Acids., vol. 81, no. 2–3, 2009, doi: 10.1016/j.plefa.2009.05.015..