Towards context in large scale biomedical knowledge graphs

01/23/2020 ∙ by Jens Dörpinghaus, et al. ∙ 0

Contextual information is widely considered for NLP and knowledge discovery in life sciences since it highly influences the exact meaning of natural language. The scientific challenge is not only to extract such context data, but also to store this data for further query and discovery approaches. Here, we propose a multiple step knowledge graph approach using labeled property graphs based on polyglot persistence systems to utilize context data for context mining, graph queries, knowledge discovery and extraction. We introduce the graph-theoretic foundation for a general context concept within semantic networks and show a proof-of-concept based on biomedical literature and text mining. Our test system contains a knowledge graph derived from the entirety of PubMed and SCAIView data and is enriched with text mining data and domain specific language data using BEL. Here, context is a more general concept than annotations. This dense graph has more than 71M nodes and 850M relationships. We discuss the impact of this novel approach with 27 real world use cases represented by graph queries.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background

The amount of available and stored data is constantly increasing in many areas in the course of digitalization. The increasing amount of data represents a great challenge for storage and requires the development of new storage technologies. At the same time, with more available data and different storage technologies, new applications based on the data are of great interest. Large data collections are used for data mining and knowledge discovery to answer new and complex questions more efficiently. For this purpose, data is often stored in non-relational databases, and while there are many types available, one of the more interesting and promising types are knowledge graphs. In this database structure, the entities of a domain are stored as nodes in a graph while connections between these entities are represented by edges. This allows for visualization and analysis of networks between the data in order to discover new applications.

Current systems use RDF (Resource Description Framework) Triple Stores, systems that inherently have some serious limitations especially when compared to a labeled property graph. For example nodes and edges have no internal structure which does not allow complex queries like subgraph matchings or traversals and it is not possible to uniquely identify instances of relationships which have the same type, see [1]. Several approaches have been made to create RDF knowledge graphs, for example Bio2RDF (see [2] and [3], reviewed by [4] or [5]). For our generalized concept of context, we require labeled property graph structures.

Context is a widely discussed topic in text mining and knowledge extraction since it is an important factor in determining the correct semantic sense of unstructured text. In [6]

, Nenkova and McKeown discuss the influence of context on text summarization. Ambiguity is an issue for both common language words and those in scientific context. The challenge in this field is not only to extract such context data, but also to be able to store this data for further natural language processing (NLP), querying and discovery approaches. Here, we propose a multiple step knowledge graph based approach to utilize context data for biological resarch and knowledge expression based on our results published in

[7]

. We present a proof of concept using biomedical literature and present an outlook on additional improvements which can be implemented in the next generation of knowledge extraction e.g. training approaches from artificial intelligence and machine learning.

Knowledge graphs have been shown to play an important role in recent knowledge mining and discovery. A knowledge graph (sometimes also called a semantic network) is a systematic way to connect information and data to knowledge on a more abstract level compared to language graphs. This type of data structure has many advantages in terms of searching within biomedical data and serves as a vital tool capable of generating novel ideas. Another important attribute when generating knowledge is context and therefore connecting knowledge graphs using contextual information can further enhance data anlysis and hypothesis generation.

As a basis for this work, we generated a knowledge graph that initially contains publication metadata from PubMed333https://www.ncbi.nlm.nih.gov/pubmed which has more than 30 million documents at its disposal, including biomedical publications. In subsequent steps, the knowledge graph was expanded to include BEL (Biological Expression Language) relations and named entities obtained from text mining using JProMiner (see [8]) and stored in SCAIView444https://www.scaiview.com/ as well as ontologies or terminologies like MeSH. This results in a large amount of data for the graph with a very high number of nodes and edges. Saving and managing such a graph poses challenges due to the horizontal scalability of graph databases, therefore, it is to be expected that search queries on the graph have a long runtime. This paper presents a polyglot persistence approach to tackle this challenge using Neo4j555https://neo4j.com/, a graph database with a native graph storage.

Here, we use a general definition of context data assuming that each information entity can also be contextual information for other entities, for example a document can also serve as context for other documents (e.g. by citing or referring to the other publication). An author is both metainformation for a document, but also itself context (by other publications, affiliations, co-author networks, …). Other data is more obviously purely context: named entities, topic maps, keywords, etc. extracted with text mining from documents. However, relations extracted from a text document may stand for themselves, occurring in multiple documents and still valuable without the original textual information.

Figure 1:

Proposed workflow to extend a knowledge graph. First starting with a document graph, the basic meta information like authors, keywords etc. are added. This can be used as a basis for text mining which can be used to extend the graph again, for example named entity recognition (NER) may use keywords as context. Topic detection may also benefit from already assigned keywords, journals or author information. The graph can also be extended by knowledge discovery processes, for example finding parameters of a clinical trial, progression within electronic health records, etc. In any case new context information are added to the initial graph and improve the input of further algorithms.

To start, we begin with a simple document graph and, in the first step, we added context metainformation (see Figure 1). This leads to an initial knowledge graph which can be used for preliminary context-based text mining approaches. In doing so, additional context data is be added to the knowledge graph, such as entities or concepts from ontologies or relations extracted from the analyzed text. The resulting knowledge graph can be used as starting basis for more detailed text mining approaches which utilize the novel context data. These steps can be repeated several times to further enrich the graph.

In fact, using a graph structure to house data has several additional advantages for knowledge extraction: biological and medical researchers, for example, are interested in exploring the mechanisms of living organisms and gaining a better understanding of underlying fundamental biological processes of life. Systems biology approaches, such as integrative knowledge graphs, are important to decipher the mechanism of a disease by considering the system as a whole, which is also known as the holistic approach. To this end, disease modeling and pathway databases both play an important role. Knowledge graphs built using BEL are widely applied in biomedical domain to convert unstructured textual knowledge into a computable form. The BEL statements that form knowledge graphs are semantic triples that consist of concepts, functions and relationships [9]. In addition, several databases and ontologies can implicitly form a knowledge graph. For example Gene Ontology, see [10] or DrugBank, see [11] or [12] cover a large amount of relations and references to which reference other fields.

There are still several crucial issues to consider when converting literature to knowledge such as evaluating the quality and completeness of such networks. Furthermore, in order to generate new knowledge, context of concepts in a knowledge graph must be considered.

To start, we first present a preliminary overview about information theory and management. Afterwards, we will introduce and discuss the novel approach of managing and mining contextual data of knowledge graphs. Finally, we will give a detailed list of issues that need to be addressed and show the results from evaluating real use cases.

1.1 Preliminaries

A knowledge graph is a systematic way to connect information and data to knowledge. It is thus a crucial concept on the way to generate knowledge and wisdom, to search within data, information and knowledge. As described above, context is the most important topic to generate knowledge or even wisdom. Thus, connecting knowledge graphs with context is a crucial feature.

Definition 1.1.

(Knowledge Graph) We define a knowledge graph as graph with entities coming from a formal structure like ontologies.

The relations can be ontology relations, thus in general we can say every ontology which is part of the data model is a subgraph of indicating . In addition, we allow inter-ontology relations between two nodes with , and . In more general terms, we define as a list of either inter-ontology or inner-ontology relations. Both as well as are finite discrete spaces.

Every entity may have some additional metainformation which needs to be defined with respect to the application of the knowledge graph. For instance, there may be several node sets (some ontologies, some document spaces (patents, research data, …), author sets, journal sets, …) so that and . The same holds for when several context relations come together such as ”is cited by”, ”has annotation”, ”has author”, ”is published in”, etc.

Definition 1.2.

(Context) We define context as a set with context subsets . This is a finite, discrete set. Every node and every edge may have one or more contexts denoted by or .

It is also possible to set . Thus we have a mapping . If we use a quite general approach towards context, we may set . Therefore, every inter-ontology relation defines context of two entities, but also the relations within an ontology can be seen as context,

With the neighborhood every node set induces a subgraph :

Definition 1.3.

(Extended Context Subgraph, Graph Embeddings) With we denote the extended context subgraph which also contains the neighbors of each node in , which is context of that node.

For a graph drawing perspective, if defines a proper surface, we can think about a graph embedding of another subgraph on . This concept was introduced in [17]. Here, semantic knowledge graph embeddings were displayed between different layers. Every layer (for example: molecular layer, document layer, mechanism layer) corresponds to another context defining new contexts on other layers. See Figure 2 for an illustration.

Definition 1.4.

(Context Metagraph) We can create the metagraph of these contexts. Each context is identified by a node in . If there is a connection in between two contexts, we add an edge . This means if or or .

Figure 2: Illustration of a knowledge graph with context (left). Context is illustrated by colored nodes (green, red, orange) connected to nodes. The colored areas describe the extended context subgraph or context embedding of these contexts. At the right the corresponding context metagraph is described. Every context in the knowledge graphs refers to a node in the metagraph. The references in the original knowledge graph are illustrated by a blue edge. The edges within the metagraph describe if in the original graph an edge from one context to the next exist.

Adding edges between the knowledge graph or a subgraph and the metagraph in will lead to a novel graph. This can be either seen as inverse mapping or as the hypergraph given by

This graph can be seen as an extension of the original knowledge graph where contexts connect not only to the initial nodes, but also every two nodes in are connected by a hyperedge if they share the same context as shown in Figure 3.

If , this will lead to new edges in thus enriching the original graph. This step should be performed after every additional extension of graph .

Figure 3: This figure describes the hypergraph between the context metagraph and the original knowledge graph or a subgraph . This graph is sorted by contexts. The hyperedges, illustrated by sets and indicated by non-hyperedges, connect nodes with context, but also nodes with the same context.

We denote this hypergraph on a knowledge graph and a metagraph with . We can add multiple metagraphs and which is denoted by .

The resulting graph can thus be seen as an enrichment of the original knowledge graph with contexts. It can be used to answer several research questions and to find graph-theoretic formulations of research questions.

If the mapping is well defined for the domain set, then Graph can be generated in polynomial time. Since this is generally not the case, this step usually contains data or text mining task to generate other contexts from free texts or knowledge graph entities. With respect to the notation described in [18] this problem can be formulated as

(1)

Here, the domain set is explicitly given by or – if additional full-texts supporting the knowledge Graph exist – , which in our case is the domain subset . Therefore, we need to find a description function with a description set which holds all contexts. To find relevant contexts, we also need to measure the error as defined by .

Several research questions must be considered. First, what metainformation can be used to generate context for a new metagraph? Several promising candidates include authors, citations, affiliation, journal, MeSH-terms and other keywords since they are all available in most databases. We also need to discuss text mining results such as NER, relationship mining etc. Having more general data including study data, genomics, images, etc. we might also consider side effects; disease labels, population labels (male; female; age; social class; etc.). Figure 1 shows a proof of concept for a less complex text mining metadata approach which describes the process of starting with a simple document graph that can be extended with more context data derived from text mining. We discuss this in more detail in the next section.

The second research question addresses the application of this novel approach for both biomedical research as well as text classification and clustering, NLP and knowledge discovery, with a focus on Artificial Intelligence (AI). How can we use the context metagraph to answer biomedical questions? What can we learn from connections between contexts and how do they look like in the knowledge graph? How can we use efficient graph queries utilizing context? It may also be useful to filter paths in the knowledge graph according to a given context or to generate novel visualizations. A possible question might be to learn about mechanisms linked to co-morbidities or mechanisms being contextualized by drug information. The meta-graph may also contain information about cause-and-effect relationships in the knowledge graph that are “valid” in a biomedical sense under certain conditions as well as contextualization based on demographic information or polypharmacy information. We will discuss several use cases in the last section of this paper.

1.2 Method

1.2.1 Technical setup

We illustrate the following methods with example runs on PubMed and PMC data. Both sources are already included in the SCAIView NLP-pipeline. PubMed contains 30 million abstracts from biomedical literature, while PMC houses nearly 4 million full-text articles.

First and foremost, the knowledge graph must be stored and accessed by the software in an efficient manner. To this end, a software component was written to integrate the knowledge graph into our SCAIView microservice architecture, see [19]. This integration also ensures that the knowledge graph is constantly updated with preprocessed data. The software component also provides an API to execute several queries on the knowledge graph and is capable of returning the result in JSON Graph Format666http://jsongraphformat.info/ which can be easily displayed by many frontend frameworks.

Our software component was written in Java using Spring Boot777http://spring.io/projects/spring-boot and Spring Data888https://spring.io/projects/spring-data to be able to access the database backend in an abstract way and ensure the exchangeability of the database technology. The database backend in our case is the graph database Neo4j999https://neo4j.com/. Neo4j supports the possibility to perform an initial bulk import, allowing us to import the massive knowledge graph in one easy step. The bulk import tool of Neo4j requires that the input data is in the CSV file format. To this end, we designed a software component that exports the data derived from SCAIView as CSV files.

Storing a large knowledge graph from PubMed, such as the one presented here, in a single database is not a simple task, and we expected the execution of our graph queries to be very slow due to the size of the knowledge graph. To speed up the run times of the queries, we decided to implement an approach that divides the graph using polyglot persistence. Polyglot persistence is defined as combining heterogenous data storing technologies into a single application. Instead of storing all of the data in one database, we chose to store different parts of the data in different database technologies. The benefit of polyglot persistence is that each database technology has different strengths and the application can take advantage of them all.

In Neo4j, the graph structure is stored separately from the properties of nodes and edges. This organization structure makes traversing the knowledge graph easier, however, storing and accessing string attributes takes longer than integer attributes because of this property [20]. To take advantage of this characteristic of Neo4j, we designed a storing system that encodes the string attributes of the graph as integers using polyglot persistence. By encoding and storing these attributes in key-value databases, we reduced the data size of the knowledge graph and were able to speed up the property access of Neo4j. Figure 4 provides an illustration of the designed polyglot persistence system.

Figure 4: Example of a stored document node in Neo4j. On the left side a PubMed document is stored with all of its attributes. Using polyglot persistence we see on the right side the same document storing integer encoding for two attributes in Neo4j. The encoding of the used attributes is stored in the key-value database Redis.

In two iterations, we selected suitable attributes of all node types thus leading to three systems: the original one using only Neo4j (called Full) and two polyglot persistence systems (called Poly1 and Poly2). Full stores all data directly in Neo4j. Poly1 stores a few information in another redis database while Poly2 combines multiple redis databases and the Neo4j graph database.

We implemented another software component to execute the data preprocessing step for Poly1 and Poly2. It uses the created CSV input files of Full to run the data encoding in key-value databases and generates CSV input files for the Neo4j graph databases of the polyglot persistence systems. The whole process is illustrated in Figure 5.

Figure 5: The software component scaiview-neo4j-csv creates CSV files for the bulk import in Neo4j from SCAIView data. The created files are used as input for the system called Full. The second software component cdv-scenario-creator uses the CSV files, runs the encoding of the selected string attributes and created CSV import files for Poly1 and Poly2.

To compare the execution runtime of queries on all three systems Full, Poly1 and Poly2, we collected 27 real word graph queries using the given knowledge graph. The results of the query runtimes are discussed in Section 2.

1.2.2 Creating a document and context graph with basic context extraction

The first step in creating a document and context graph with basic context extraction is to define the entity sets and their relations. The articles and abstracts from PubMed and PMC already contain a lot of contextual data. We may define as the document set containing nodes, with each one representing one document. Furthermore, we may add a set as the source of a document. Thus, each document can be interpreted as contextual data of a particular data source.

All meta data are stored in new node sets. stores the set of authors and stores their affiliation, which is again considered context for the authors. Another relevant piece of contextual information is the publisher, in our case . PubMed has several classifications for including: Books and Documents, Case Reports, Classical Article, Clinical Study, Clinical Trial, Journal Article, and Review. We store this classification in .

Other important context is which stores multiple types of annotations such as named entities or keywords, all of which come from the MeSH tree, see [21] and https://www.nlm.nih.gov/mesh/intro_trees.html. Therefore, inherently contains a hierarchy and edges . The value of MeSH terms and their hierarchy for knowledge extraction was shown in several recent studies [22]. Figure 6 depicts the knowledge graph of a single document.

Figure 6: This figure is an illustration of a single document within the context graph. The document node (purple) has several gray annotation nodes, four red publication type nodes, an pink author node with a gray affiliation. The source (PubMed) is annotated in a green node, the journal in a yellow node.

All other relations can be added between the sets , for example , , etc. With this information, it is – from an algorithmic point of view – quite easy to combine all context relations such as , , , etc, though these edges should also store additional provenance information as shown in Figure 7.

Figure 7: This figure is an illustration of the initial document and context graph. A PubMed node is the source of document nodes (lila). There are several context annotations like article type (red), keywords (gray), authors (pink) and journal (yellow). Authors have additional context (affiliations, gray).

1.2.3 Extending the knowledge graph using NLP-technologies

The initial knowledge graph can be extended by NLP-technologies. Terminologies and Ontologies are a widely considered topic in research during the last years. They play an important role in data and text mining as well as knowledge representation in the semantic web. They have become increasingly more important once data providers began publishing their data in a semantic web formats, namely RDF ([23]) and OWL ([24]), to increase integratability. The term terminology refers to the SKOS meta-model [25] which can be summarized as concepts, unit of thoughts which can be identified, labeled with lexical strings, assigned notations (lexical codes), documented with various types of note, linked to other concepts and organized into informal hierarchies and association networks, aggregated, grouped into labeled and/or ordered collections, and mapped to concepts. Several complex models have been proposed in literature and have been implemented in software, see [26]. Controlled Vocabularies contain lists of entities which may be completed to a Synonym Ring to control synonyms. Ontologies also present properties and can establish associative relationships which can also be done by Thesauri or Terminologies. See [27] and [28] for a complete list of all models.

Here we define Terminologies similar to Thesauri as a set of concepts. They form a DAG with child and parent concepts. Additionally, we have an associative relation which identifies related concepts. Each concept has at least one label, one of which is used as the preferred identifier while all others are synonyms. To sum up, using ontologies or terminologies for NER has several advantages. In particular, it leads to a hierarchy within these ontologies and orders named entities according to these relations. Though, we must not only consider ontologies and terminologies, but also controlled vocabularies such as MeSH. Here, we have additional annotations with different provenances, one derived as keywords with the data and one obtained from NER.

Another example of a terminology is the Alzheimer’s Disease Ontology (ADO, see [29]) or the Neuro-Image Terminology (NIFT, see [30]) coming with their hierarchy , . The process of NER leads to another context relation . Since not all ontologies or terminologies are described using the RDF or OBO format, we have to add data using multiple external sources via a central tool capable of providing all the necessary ontology data. We use a semantic lookup platform containing OLS and OxO (see [31]).

Additional context data useful for knowledge extraction are citations such as the edges between two nodes in . Data from PMC already contains citation data with unique identifiers (PubMed IDs). Some data is available with WikiData, see [32] and [33]. Other sources are rare, but exist, see [34]. Especially for PubMed a lot of research is working on this difficult topic, see for example [35].

Figure 8: This figure is an illustration of biological knowledge within the context graph. The document node (purple) has several gray annotation nodes which come from different terminologies found with NER. The relation extraction task found the relation ”Levomilnacipran” inhibts ”BACE1”, ”BACE1” improves ”Neuroprotection” and ”BACE1” improves ”Memory”. These relations are illustrated with red edges. Since the document describes a clinical trial, this is also context for the relations as well. All other context is illustrated by colored sets, defining subgraphs.
Figure 9: This figure is an extract illustration of a single entity (MESHD:Alzheimers) within the context graph. The node (gray) has several gray annotation nodes, green context nodes, documents as references (purple) and biological events (red). Whereas figure 8 shows a small example, we can see here, that the knowledge graph might get very complex.

Furthermore, we can consider the relational information between entities. For example, BEL statements naturally form knowledge graphs by way of semantic triples that consist of concepts, functions and relationships [9]. To tackle such complex tasks they constantly gather and accumulate new knowledge by performing experiments, and also studying scientific literature that includes results of further experiments performed by researchers. Existing solutions are primarily based on the methods of biomedical text mining which consists of extracting key information from unstructured biomedical text (such as publications, patents, and electronic health records). Several information systems have been introduced to support curators in generating these networks such as BELIEF, a workflow that builds BEL-like statements semi-automatically by retrieving publications from a relevant corpus generator system called SCAIView, see [36] and [37].

Figure 8 illustrates a few basic relations such as ”Levomilnacipran” inhibts ”BACE1”, ”BACE1” improves ”Neuroprotection” and ”BACE1” improves ”Memory”, all of which were found using relation extraction methods on named entities in a document. It is important to note that context for a document can also be context for the derived relations and vice versa. If an entity that forms part of a relation has synonyms, or is found within another document with a different context, this may lead to a deeper understanding about the statement. An example of this interconnectedness is shown Figure 9. Due to the complexity, the resulting graph structures become difficult to manually parse and intepret thus requiring algorithmic approaches to properly analyze.

2 Results

2.1 Real world usecases for testing

We collected 27 real world questions and queries in scientific projects. They are of varying complexity (Table 1) and can be used to test the biomedical knowledge graph. Some of them use local structures, for example conjunctive regular path queries (CRPQ, see [38]) which combine subgraph pattern with queries regarding paths (problems 1,3,5,7,9,10,13,15,20) or the extended version ECRPQ (8,18,22). Other local structures include Regular Path Queries (RPQ, see [39]) (problems 2,11,14,16,17,19,21) and finding shortest path (problems 4,12). Additional queries use global structures such as centrality which include Page Rank (6,23), Betweenness Centrality (25) or Degree Centrality (26). Another global problem is community detection, for example Louvain Modularity (24) or Connected Components (27).

# Query Input Example Output
1 Which author was the first to state that {Entity1} has an enhancing effect on {Entity2}? APP, gamma Secretase Complex Author and document title
2 Which genes {Entity1} play a role in two diseases {Entity2}? Entity.source = HGNC, MESH subgraph of genes with 2 diseases
3 In which journal was it published that {Entity1} has an enhancing effect on {Entity2}? APP, gamma Secretase Complex Document and Journal
4 What is the shortest way between {Entity1} and {Entity2} and what is on that way? axonal transport, LRP3 path between nodes
5 Where was it published that {Entity1} has an enhancing effect on {Entity2} and what documents cite this? APP, gamma Secretase Complex List of publishing and citing documents
6 What are the most important entities in context of {Entity1} disease? Alzheimer’s Page Rank of neighboring entities
7 Which authors publish in the same journal on the topic {Entity1} and have not yet published together? Alzheimer’s disease List of author couples
8 Find a path of biological entities that connects {Entity1} with {Entity2} Alzheimer’s disease, ACHE path of entities
9 Are there authors within the same affiliation who make contradictory statements regarding protein {Entity1} and protein {Entity2}? apoptotic process, SLC25A21 number of statements for both variants
10 Do the data in the literature correlate with the concomitant diseases for illness {Entity1}? So are the genes mentioned in {Entity1} documents also mentioned in {Entity2} documents of the concomitant disease? Alzheimer’s, Down syndrome genes involved in both diseases in the literature
11 Does the function of a gene {Entity} differ in different contexts? IL1B List of all functions in contexts
12 How far apart are {document1} and {document2}? PMID:16160056, PMID:16160050 Shortest path between documents
13 Does the biological process on gene {Entity1} also exist in context of {Entity2}? And what author describes it? APOE, brain outcome graph in context of the brain
14 Are there BEL statements that have no source, so should be checked? - List of relations
15 How many sources are there for the statements of a contradictory BEL statement? hasRelation. function = increases, decreases number of sources for each of the cases
16 Is there also a relation between the documents describing the entities {Entity1} and {Entity2} that matches the relation in a BEL statement with the entities {Entity1} and {Entity2}? APP, Alzheimer document pairs
17 Find the oldest document describing an entity {entity} APP Oldest Document
18 Is a reviewer {Author1} suitable for a proposal with the author {Author} or is there a conflict of interest? Does the reviewer have relationships with the author in the form of joint work or equal affiliation? Ulrich Rothe, A. Castillo Potential Graph between the authors
19 On which topics does the author {Author} write most? Ulrich Rothe List of the most frequent annotations
20 In which other journals could the author {Author} write with his main topics? Which journal in which he has not yet published would suit him from his main topics? Ulrich Rothe List of journals that could fit him
21 Which Affiliation has the most publications on the topic {Entity} in the Journal {Journal}? D008358, Biotechnology letters Affiliation with the highest number of publications
22 From when is the document cited in documents dealing with the subject {Entity}? D017629 publication date of cited document
23 Which document is the most cited paper in connection with {Entity}, of papers that also annotate {Entity}? Determined by PageRank. D017629 Most cited paper-type document
24 Which entities have many relations with {Entity}? Determined by Community Detection. APP surrounding community graph
25 Which author connects the two subject areas {Entity1} and {Entity2} most strongly? Alzheimer Disease, Parkinson Author with highest betweenness centrality
26 Which gene {Entity} is the most important? Entity.source = HGNC Entity with highest degree centrality
27 Are there strongly connected components between the entities? Assignment of the entities to cliques
Table 1: Biomedical example queries on knowledge graphs with context data

Because the general subgraph isomorphism problem is known to be NP-complete, we expect that some of our queries, such as finding the shortest paths in P, to require a wide range runtimes. The queries given in Table 1 are formulated as Cypher-queries. Query 2 is a relatively simple query given by match (sickness1:Entity source: "MESH") <-[:hasRelation]- (gene:Entity source: "HGNC") -[:hasRelation]-> (sickness2:Entity source: "MESH") return gene, sickness1, sickness2, however, more complex subgraph patterns can also be generated such as Query 20 which is given by match p=(author:Author forename: "Ulrich", surname: "Rothe")-[:isAuthor]-(doc:Document)-[:isAuthor]-(reviewer:Author forename: "A", surname: "Castillo"), p2=(author) -[]- (doc1:Document)-[:hasCitation]- (doc2) -[:isAuthor]- (reviewer) , p3=(author)-[]- (a:Affiliation)- []-(reviewer) return p,p2,p3 limit 10.

2.2 Storing the Knowledge Graph

Storing all of the data in one graph database without using Redis (Full) uses 58,9 GB of memory, while Poly1 only uses 50,82 GB (Neo4j) and 0,9 GB (Redis) of memory. The third system, Poly2, uses 50,74 + 10,2 GB (Neo4j) and 1,4 GB (Redis) memory.

The import data is about 50 GB and generates nearly 160M nodes with relations. These nodes are merged by Neo4j to unique nodes. In the end we obtained 71M unique nodes and 860M relationships. Given the input data, we create ~30M nodes describing documents from PubMed and PMC, about 17M dedicated to authors, 21M affiliations and around 5M entities. The graph contains 554M annotation relationships and in total 850M relationships.

2.3 Polyglot persistence systems

Figure 10 shows the runtime results of the 27 real world queries described in Table 1.

Figure 10: Runtime results of 27 real world queries. The queries are grouped in four diagrams with similar runtimes for a better overview. We see that the execution time of most queries is improved with Poly1 and Poly2. In the best case the improvement is 43%.

We see that execution of some queries required a large amount of time with the longest query taking more than one hour. Interestingly, the execution time for most of the queries improved when ran using either the Poly1 or Poly2 implementation. Seven out of the 27 queries did not terminate.

For most queries, the polyglot persistence systems achieve better results, in the best case up to 43%. However, there are differences between the systems for a few of the queries tested in that Poly1 can sometimes have better results than Poly2 and vice versa. Contrary to expectations, Full was found to have the best query time in most cases. The advantage of Poly1 over Poly2 can be explained by the fact that the memory consumption of Poly2 increased significantly due to the process of converting from string to integer and therefore the execution of the queries is slowed down. For the queries in which Poly2 performed better, this can be explained by the fact that the queries take advantage of the optimized polyglot data schema despite the higher memory consumption of the database. This is significant for example in queries 8 and 17.

The differences in the results become clearer when you look at the differences in runtimes in percent and compare them with each other. The differences in the observed running times becomes clearer when analyzing the percent change in the runtime when compared to Full as shown in Table 2. For both systems, the average percent decrease in runtimes is calculated for all queries, in order to compare both polyglot systems each other and with Full.

Query Poly1 Poly2 Problem
14 26,8% 25,8% RPQ
27 23,8% -2,6% Connected Components
11 22,5% 17,7% RPQ
8 18,2% 43,3% ECRPQ
2 11,5% 22,9% RPQ
15 10,3% 4,5% CRPQ
20 9,2% 2,5% CRPQ


23
7,7% 6,8% Page Rank
26 6,8% 2,4% Degree Centrality
16 6,6% 5,1% RPQ

5
5,4% 4,6% CRPQ
22 3,8% 3,5% ECRPQ
17 3,1% 31,9% RPQ

10
-0,2% 7,0% CRPQ
3 -2,3% 7,9% CRPQ
19 -2,3% 8,0% RPQ
1 -2,5% 4,9% CRPQ
13 -4,1% 4,8% CRPQ
21 -11,0% -0,3% RPQ
18 -15,7% -15,1% ECRPQ


Average
5,8% 9,8%
Table 2: Decrease of the terms of and compared to in %, sorted by Poly1 decreasing.

There is no information for queries 4, 6, 7, 9, 12, 24 and 25, for which no runtime could be determined on the systems as they did not go to completion. These queries are primarily graph algorithms categorized as local and global structures in the schema discussed earlier.

The results do not show a clear trend for any of the categories discussed. The RPQ class improves on average by 15.8% while the ECRPQ class by 10.5%. The classes CRPQ, Page Rank, Degree Centrality and Connected Components are in the single-digit percentage range. In general, the subcategories of local structures seem to benefit more from the polyglot persistence designs. In addition, there is a tendency for queries that only need to consider a few node and edge types (often entity and hasRelation) to experience a greater decrease in runtimes than queries with many node and edge types.

2.4 Graph Queries

Here, we present results of some of those 27 queries introduced. Query 1 returns a subgraph: Which author was the first to state that {Entity1} has an enhancing effect on {Entity2}? We may execute this query using match (n:Entity preferredLabel: "APP")-[r:hasRelation function: "increases"]->(m:Entity preferredLabel: "gamma Secretase Complex"), (doc:Document documentID: r.context)<-[r2:isAuthor]-(author:Author) return doc, author order by doc.publicationDate limit.

Figure 11: A result subgraph example for query 1: Which author was the first to state that {Entity1} has an enhancing effect on {Entity2}? On the left the first author (blue node) and the publication (orange), on the left the result shows the most recent 10 authors (blue) with their publications on this topic (orange).

A result graph can be found in figure 11. On the left the isAuthor relation with the most recent author can be found. On the left the limit parameter was changed to 10 and thus the result graph shows the most recent 10 publications and authors.

Query 2 returns a subgraph: Which genes {Entity1} play a role in two diseases {Entity2}? We may execute this query using match (sickness1:Entity source: "MESH", preferredLabel:"Alzheimer Disease") <-[:hasRelation]- (gene:Entity source: "HGNC", preferredLabel: "Down Syndrome") -[:hasRelation]-> (sickness2:Entity source: "MESH") return gene, sickness1, sickness2 limit 25. One example output graph can be found in figure 12. Due to the limitation of our model to Alzheimer’s Disease, it is not surprising to find only one gene – APP. If we remove the limitation to two distinct diseases, the database returns a larger graph, see figure 13. Here we see, that we may need to utilize inherent ontology information to filter those nodes, that cover diseases. But we also see a second gene – TNF – with other diseases like Diabetes.

Figure 12: A result subgraph example for query 2: Which genes {Entity1} play a role in two diseases {Entity2}?
Figure 13: A result subgraph example for query 2 without limitation to two distinct diseases: Which genes {Entity1} play a role in two diseases {Entity2}?

Other queries return no subgraph, but rather values. For example query 25 may use built in functions from cypher: CALL algo.degree(’MATCH (e:Entity source: "HGNC") RETURN id(e) as id’, ’MATCH (e1:Entity) <-[:hasAnnotation]- (d1:Document) RETURN id(e1) as source, id(d1) as target’,graph:’cypher’, write:false). This query answers the question ”Which gene {Entity} is the most important?” as it returns the entity with highest degree centrality.

3 Discussion and Conlusion

Here we introduce the graph-theoretic foundation for a general context concept within semantic networks and show a proof-of-concept based on biomedical literature and text mining. Our test system contains a knowledge graph derived from PubMed data which is then enriched with text mining data and domain specific language data coming from BEL. This dense graph has more than 71M nodes and 850M relationships. We discuss the impact of this novel approach using 27 real world use cases and graph queries.

This proof-of-concept of a biomedical knowledge graph combines several sources of data by relating their contextual data to one another. We processed data from PubMed and PMC which generated more than 30M document and metadata nodes. This initial knowledge graph was extended using results from text mining and NLR-tools already included in our software as well as with named entities from ontologies also stored in SCAView. In addition, we added data generated by domain specific languages such as BEL. Thus, we were able to assess both small datasets as well as large collections of data.

There were several issues with data integration and missing data. Initially, we tried to integrate publication data from several external sources, but some publishers used OCR technologies to convert PDF documents in XML structures. These proved problematic to process as some fields were either missing or incorrectly filled out.

We have not yet solved the issue of author and affiliation disambiguation which remains a widely discussed topic, see [40]. An interesting novel approach – also based on Neo4j database technology – was introduced in [41]. Franzoni used topological and semantic structures within the graph for author disambiguation. Taking this into consideration, we plan to integrate such state-of-the-art technologies into our software in the future.

Furthermore, performance for some semantic queries remains a major problem due to the massive latency for request. Although the software is integrating in our microservice architecture, see [19], some queries did not run to completion. Here we attempt to improve our initial setup by establishing a polyglot persistence architecture in the database backend [7]. The results generated through this modification are very encouraging and we will discuss additional topics for further research.

Storing and querying a giant knowledge graph as a labeled property graph is still a technological challenge. Here we demonstrate how our data model is able to support the understanding and interpretation of biomedical data. We present several real world use cases that utilize our massive, generated knowledge graph derived from PubMed data and enriched with additional contextual data. Finally, we show a working example in context of biologically relevant information using SCAIView.

Appendix

Funding

Fraunhofer Society under the MAVO Projec.

Author’s contributions

This new approach goes back to an initial idea of JD and was developed by JD, AS and BS. The datasets for evaluation were produced by MJ. The manuscript was written by JD, AS and BS. All authors read and approved the final manuscript.

Acknowledgements

Valuable suggestions during the development of this method were provided by Jürgen Klein and Vanessa Lage-Rupprecht. We thank Tim Steinbach for providing some illustrations to this work. In addition we thank Alexander Esser for his input on the initial research paper. We thank Martin Hofmann-Apitius for supporting this research activity and his valuable input.

References

  • [1] Desai, M., G Mehta, R., P Rana, D.: Issues and challenges in big graph modelling for smart city: An extensive survey. International Journal of Computational Intelligence & IoT 1(1) (2018)
  • [2] Dumontier, M., Callahan, A., Cruz-Toledo, J., Ansell, P., Emonet, V., Belleau, F., Droit, A.: Bio2rdf release 3: a larger connected network of linked data for the life sciences. In: Proceedings of the 2014 International Conference on Posters & Demonstrations Track, vol. 1272, pp. 401–404 (2014)
  • [3] Callahan, A., Cruz-Toledo, J., Ansell, P., Dumontier, M.: Bio2rdf release 2: improved coverage, interoperability and provenance of life science linked data. In: Extended Semantic Web Conference, pp. 200–212 (2013). Springer
  • [4] Li, S., Xin, L.: Research on integration and sharing of scientific data based on linked data——a case study of bio2rdf. Research on Library Science 21 (2014)
  • [5] Natsiavas, P., Koutkias, V., Maglaveras, N.: Exploring the capacity of open, linked data sources to assess adverse drug reaction signals. In: SWAT4LS, pp. 224–226 (2015)
  • [6] Aggarwal, C.C., Zhai, C.: An introduction to text mining. In: Mining Text Data, pp. 1–10. Springer, ??? (2012)
  • [7] Dörpinghaus, J., Stefan, A.: Knowledge extraction and applications utilizing context data in knowledge graphs. In: 2019 Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 265–272 (2019). IEEE
  • [8] Hanisch, D., Fundel, K., Mevissen, H.-T., Zimmer, R., Fluck, J.: ProMiner: rule-based protein and gene entity recognition. BMC bioinformatics 6 Suppl 1, 14 (2005)
  • [9] Fluck, J., Klenner, A., Madan, S., Ansari, S., Bobic, T., Hoeng, J., Hofmann-Apitius, M., Peitsch, M.: Bel networks derived from qualitative translations of bionlp shared task annotations. In: Proceedings of the 2013 Workshop on Biomedical Natural Language Processing, pp. 80–88 (2013)
  • [10] Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al.: Gene ontology: tool for the unification of biology. Nature genetics 25(1), 25 (2000)
  • [11] Wishart, D.S., Feunang, Y.D., Guo, A.C., Lo, E.J., Marcu, A., Grant, J.R., Sajed, T., Johnson, D., Li, C., Sayeeda, Z., et al.: Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic acids research 46(D1), 1074–1082 (2017)
  • [12] Khan, K., Benfenati, E., Roy, K.: Consensus qsar modeling of toxicity of pharmaceuticals to different aquatic organisms: Ranking and prioritization of the drugbank database compounds. Ecotoxicology and environmental safety 168, 287–297 (2019)
  • [13] Hey, J.: The data, information, knowledge, wisdom chain: the metaphorical link. Intergovernmental Oceanographic Commission 26, 1–18 (2004)
  • [14] Zeleny, M.: Management support systems: Towards integrated knowledge management. Human systems management 7(1), 59–70 (1987)
  • [15] Ackoff, R.L.: From data to wisdom. Journal of applied systems analysis 16(1), 3–9 (1989)
  • [16] Rowley, J.: The wisdom hierarchy: representations of the dikw hierarchy. Journal of Information Science 33(2), 163–180 (2007)
  • [17] Dörpinghaus, J., Jacobs, M.: Semantic knowledge graph embeddings for biomedical research: Data integration using linked open data. Posters and Demo Track of the 15th International Conference on Semantic Systems. (Poster and Demo Track at SEMANTiCS 2019) (2451), 46–50 (2019)
  • [18] Dörpinghaus, J., Darms, J., Jacobs, M.: What was the question? a systematization of information retrieval and nlp problems. In: 2018 Federated Conference on Computer Science and Information Systems (FedCSIS) (2018). IEEE
  • [19] Dörpinghaus, J., Klein, J., Darms, J., Madan, S., Jacobs, M.: Scaiview – a semantic search engine for biomedical research utilizing a microservice architecture. In: Proceedings of the Posters and Demos Track of the 14th International Conference on Semantic Systems - SEMANTiCS2018 (2018)
  • [20] Webber, J., Eifrem, E.: Graph Databases, (2015)
  • [21] Rogers, F.B.: Medical subject headings. Bulletin of the Medical Library Association 51, 114–116 (1963)
  • [22] Yang, H., Lee, H.: Research trend visualization by mesh terms from pubmed. International journal of environmental research and public health 15(6), 1113 (2018)
  • [23] Cyganiak, R., Wood, D., Lanthaler, M.: RDF 1.1 concepts and abstract syntax. W3C recommendation, W3C (February 2014). http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/
  • [24] Patel-Schneider, P., Rudolph, S., Krötzsch, M., Hitzler, P., Parsia, B.: OWL 2 web ontology language primer (second edition). Technical report, W3C (December 2012). http://www.w3.org/TR/2012/REC-owl2-primer-20121211/
  • [25] Summers, E., Isaac, A.: SKOS simple knowledge organization system primer. W3C note, W3C (August 2009). http://www.w3.org/TR/2009/NOTE-skos-primer-20090818/
  • [26] Zeng, M., Hlava, M., Qin, J., Hodge, G., Bedford, D.: Knowledge organization systems (kos) standards. Proceedings of the Association for Information Science and Technology 44(1), 1–3 (2007)
  • [27] Guidelines for the construction, format, and management of monolingual controlled vocabularies. Standard, National Information Standards Organization, Baltimore, Maryland, U.S.A. (2005)
  • [28] Zeng, M.: Knowledge organization systems (kos) 35, 160–182 (2008)
  • [29] Malhotra, A., Younesi, E., Gündel, M., Müller, B., Heneka, M.T., Hofmann-Apitius, M.: Ado: A disease ontology representing the domain knowledge specific to alzheimer’s disease. Alzheimer’s & Dementia 10(2), 238–246 (2014)
  • [30] Iyappan, A., Younesi, E., Redolfi, A., Vrooman, H., Khanna, S., Frisoni, G.B., Hofmann-Apitius, M.: Neuroimaging feature terminology: A controlled terminology for the annotation of brain imaging features. Journal of Alzheimer’s Disease 59(4), 1153–1169 (2017)
  • [31] Madan, S., Fiosins, M., Bonn, S., Fluck, J.: A Semantic Data Integration Methodology for Translational Neurodegenerative Disease Research. (2018). doi:10.6084/m9.figshare.7339244.v1
  • [32] Voß, J.: Classification of knowledge organization systems with wikidata. In: NKOS@ TPDL, pp. 15–22 (2016)
  • [33] Vrandečić, D.: Toward an abstract wikipedia. In: Ortiz, M., Schneider, T. (eds.) 31st International Workshop on Description Logics (DL). CEUR Workshop Proceedings, Aachen (2018)
  • [34] Oßwald, A., Schöpfel, J., Jacquemin, B.: Continuing professional education in open access. a french-german survey. LIBER Quarterly. The journal of the Association of European Research Libraries 26(2), 43–66 (2015)
  • [35] Volanakis, A., Krawczyk, K.: Sciride finder: a citation-based paradigm in biomedical literature search. Scientific reports 8(1), 6193 (2018)
  • [36] Madan, S., Hodapp, S., Senger, P., Ansari, S., Szostak, J., Hoeng, J., Peitsch, M., Fluck, J.: The BEL information extraction workflow (BELIEF): evaluation in the BioCreative V BEL and IAT track. Database 2016 (2016)
  • [37] Madan, S., Szostak, J., Dörpinghaus, J., Hoeng, J., Fluck, J.: Overview of BEL Track: Extraction of Complex Relationships and their Conversion to BEL. Proceedings of the BioCreative VI Workshop (2017)
  • [38] Wood, P.T.: Query Languages for Graph Databases. SIGMOD Rec. 41(1), 50–60 (2012). doi:10.1145/2206869.2206879
  • [39] Angles, R., Arenas, M., Barceló, P., Hogan, A., Reutter, J., Vrgoč, D.: Foundations of Modern Query Languages for Graph Databases. ACM Comput. Surv. 50(5), 68–16840 (2017). doi:10.1145/3104031
  • [40] Kim, J.: Correction to: Evaluating author name disambiguation for digital libraries: a case of dblp. Scientometrics 118(1), 383–383 (2019)
  • [41] Franzoni, V., Lepri, M., Milani, A.: Topological and semantic graph-based author disambiguation on dblp data in neo4j. arXiv preprint arXiv:1901.08977 (2019)
  • [42] Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)
  • [43] Cai, D., Wu, G.: Content-aware attributed entity embedding for synonymous named entity discovery. Neurocomputing 329, 237–247 (2019)
  • [44]

    Prajapati, P., Sivakumar, P.: Context dependency relation extraction using modified evolutionary algorithm based on web mining. In: Emerging Technologies in Data Mining and Information Security, pp. 259–267. Springer, Göttingen (2019)

  • [45]

    Cook, S.A.: The complexity of theorem-proving procedures. In: Proceedings of the Third Annual ACM Symposium on Theory of Computing, pp. 151–158 (1971). ACM

  • [46] Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L.B., Bourne, P.E., et al.: The fair guiding principles for scientific data management and stewardship. Scientific data 3 (2016)