Generating Explanations for Biomedical Queries

09/24/2013
by   Esra Erdem, et al.
Sabancı University
0

We introduce novel mathematical models and algorithms to generate (shortest or k different) explanations for biomedical queries, using answer set programming. We implement these algorithms and integrate them in BIOQUERY-ASP. We illustrate the usefulness of these methods with some complex biomedical queries related to drug discovery, over the biomedical knowledge resources PHARMGKB, DRUGBANK, BIOGRID, CTD, SIDER, DISEASE ONTOLOGY and ORPHADATA. To appear in Theory and Practice of Logic Programming (TPLP).

READ FULL TEXT VIEW PDF

Authors

page 28

12/09/2010

Querying Biomedical Ontologies in Natural Language using Answer Set

In this work, we develop an intelligent user interface that allows users...
04/18/2021

Generating explanations for answer set programming applications

We present an explanation system for applications that leverage Answer S...
07/07/2016

Mapping Data to Ontologies with Exceptions Using Answer Set Programming

In ontology-based data access, databases are connected to an ontology vi...
03/20/2014

Interactive Debugging of ASP Programs

Broad application of answer set programming (ASP) for declarative proble...
11/20/2021

Explainable Biomedical Recommendations via Reinforcement Learning Reasoning on Knowledge Graphs

For Artificial Intelligence to have a greater impact in biology and medi...
12/18/2019

Semantic integration of disease-specific knowledge

Biomedical researchers working on a specific disease need up-to-date and...
09/23/2013

An evolutionary approach to Function

Background: Understanding the distinction between function and role is v...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in health and life sciences have led to generation of a large amount of biomedical data, represented in various biomedical databases or ontologies. That these databases and ontologies are represented in different formats and constructed/maintained independently from each other at different locations, have brought about many challenges for answering complex biomedical queries that require integration of knowledge represented in these ontologies and databases. One of the challenges for the users is to be able to represent such a biomedical query in a natural language, and get its answers in an understandable form. Another challenge is to extract relevant knowledge from different knowledge resources, and integrate them appropriately using also definitions, such as, chains of gene-gene interactions, cliques of genes based on gene-gene relations, or similarity/diversity of genes/drugs. Furthermore, once an answer is found for a complex query, the experts may need further explanations about the answer.

Table 1 displays a list of complex biomedical queries that are important from the point of view of drug discovery. In the queries, drug-drug interactions present negative interactions among drugs, and gene-gene interactions present both negative and positive interactions among genes. Consider, for instance the query Q6. New molecule synthesis by changing substitutes of parent compound may lead to different biochemical and physiological effects; and each trial may lead to different indications. Such studies are important for fast inventions of new molecules. For example, while developing the drug Lovastatin (a member of the drug class of Hmg-coa reductase inhibitors, used for lowering cholesterol) from Aspergillus terreus (a sort of fungus) in 1979, scientists at Merck derived a new molecule named Simvastatin that also belongs to the same drug category (a hypolipidemic drug used to control elevated cholesterol) targeting the same gene. Therefore, identifying genes targeted by a group of drugs automatically by means of queries like Q6 may be useful for experts.

Once an answer to a query is found, the experts may ask for an explanation to have a better understanding. For instance, an answer for the query Q3 in Table 1 is “ADRB1”. A shortest explanation for this answer is as follows:

  • The drug Epinephrine targets the gene ADRB1 according to CTD.
    The gene DLG4 interacts with the gene ADRB1 according to BioGRID.

An answer for the query Q8 is “CASK”. A shortest explanation for this answer is as follows:

The distance of the gene CASK from the start gene is 2.

The gene CASK interacts with the gene DLG4 according to BioGRID.

The distance of the gene DLG4 from the start gene is 1.

The gene DLG4 interacts with the gene ADRB1 according to BioGRID.

ADRB1 is the start gene.

(Statements with more indentations provide explanations for statements with less indentations.)

What are the drugs that treat the disease Asthma and that target the gene ADRB1? What are the side effects of the drugs that treat the disease Asthma and that target the gene ADRB1? What are the genes that are targeted by the drug Epinephrine and that interact with the gene DLG4? What are the genes that interact with at least genes and that are targeted by the drug Epinephrine? What are the drugs that treat the disease Asthma or that react with the drug Epinephrine? What are the genes that are targeted by all the drugs that belong to the category Hmg-coa reductase inhibitors? What are the cliques of genes, that contain the gene DLG4? What are the genes that are related to the gene ADRB1 via a gene-gene interaction chain of length at most ? What are the most similar genes that are targeted by the drug Epinephrine? What are the genes that are related to the gene DLG4 via a gene-gene interaction chain of length at most and that are targeted by the drugs that belong to the category Hmg-coa reductase inhibitors? What are the drugs that treat the disease Depression and that do not target the gene ACYP1? What are the symptoms of diseases that are treated by the drug Triadimefon? What are the most similar drugs that target the gene DLG4? What are the closest drugs to the drug Epinephrine?

Table 1: A list of complex biomedical queries.

To address the first two challenges described above (i.e., representing complex queries in natural language and finding answers to queries efficiently), novel methods and a software system, called BioQuery-ASP [Erdem et al. (2011)] (Figure 1), have been developed using Answer Set Programming (ASP) [Marek and Truszczyński (1999), Niemelä (1999), Lifschitz (2002), Baral (2003), Lifschitz (2008), Brewka et al. (2011)]:

  • Erdem and Yeniterzi [Erdem and Yeniterzi (2009)] developed a controlled natural language, BioQuery-CNL, for expressing biomedical queries related to drug discovery. For instance, queries Q1–Q10 in Table 1 are in this language. Recently, this language has been extended (called BioQuery-CNL*) to cover queries Q11–Q13 [Oztok (2012)]. Some algorithms have been introduced to translate a given query in BioQuery-CNL (resp. BioQuery-CNL*) to a program in ASP as well.

  • Bodenreider et al. [Bodenreider et al. (2008)] introduced methods to extract biomedical information from various knowledge resources and integrate them by a rule layer. This rule layer not only integrates those knowledge resources but also provides definitions of auxiliary concepts.

  • Erdem et al. [Erdem et al. (2011)] have introduced an algorithm for query answering by identifying the relevant parts of the rule layer and the knowledge resources with respect to a given query.

The details of representing biomedical queries in natural language and answering them using ASP are explained in a companion article. The focus of this article is the last challenge: generating explanations for biomedical queries.

Figure 1: System overview of BioQuery-ASP.

Most of the existing biomedical querying systems (e.g., web services built over the available knowledge resources) support keyword search but not complex queries like the queries in Table 1. None of the existing systems can provide informative explanations about the answers, but point to related web pages of the knowledge resources available online.

The contributions of this article can be summarized as follows.

  • We have formally defined “explanations” in ASP, utilizing properties of programs and graphs. We have also defined variations of explanations, such as “shortest explanations” and “ different explanations”.

  • We have introduced novel generic algorithms to generate explanations for biomedical queries. These algorithms can compute shortest or different explanations. We have analyzed the termination, soundness, and complexity of those algorithms.

  • We have developed a computational tool, called ExpGen-ASP, that implements these explanation generation algorithms.

  • We have showed the applicability of our methods to generate explanations for answers of complex biomedical queries related to drug discovery.

  • We have embedded ExpGen-ASP into BioQuery-ASP so that the experts can obtain explanations regarding the answers of biomedical queries, in a natural language.

The rest of the article is organized as follows. In Section 2, we provide a summary of Answer Set Programming. Next, in Section 3, we give an overview of BioQuery-ASP, in particular, the earlier work done on answering biomedical queries in ASP. Then, in Sections 46, we provide some definitions and algorithms related to generating shortest or different explanations for an answer, also in ASP. Next, Section 7 illustrates the usefulness of these algorithms on some complex queries over the biomedical knowledge resources PharmGKB [McDonagh et al. (2011)],111http://www.pharmgkb.org/ DrugBank [Knox et al. (2010)],222http://www.drugbank.ca/ BioGRID [Stark et al. (2006)],333http://thebiogrid.org/ CTD [Davis et al. (2011)],444http://ctd.mdibl.org/ SIDER [Kuhn et al. (2010)],555http://sideeffects.embl.de/ Disease Ontology [Schriml et al. (2012)]666http://disease-ontology.org and Orphadata.777http://www.orphadata.org In Sections 8 and 9, we discuss how to present explanations to the user in a natural language, and embedding of these algorithms in BioQuery-ASP. In Section 10, we provide a detailed analysis of the related work on “justifications” [Pontelli et al. (2009)] in comparison to explanations; and in Section 11, we briefly discuss other related work. We conclude in Section 12 by summarizing our contributions and pointing out some possible future work. Proofs are provided in the online appendix of the paper.

2 Answer Set Programming

Answer Set Programming (ASP) [Marek and Truszczyński (1999), Niemelä (1999), Lifschitz (2002), Baral (2003), Lifschitz (2008), Brewka et al. (2011)] is a form of declarative programming paradigm oriented towards solving combinatorial search problems as well as knowledge-intensive problems. The idea is to represent a problem as a “program” whose models (called “answer sets” [Gelfond and Lifschitz (1988), Gelfond and Lifschitz (1991)]) correspond to the solutions. The answer sets for the given program can then be computed by special implemented systems called answer set solvers. ASP has a high-level representation language that allows recursive definitions, aggregates, weight constraints, optimization statements, and default negation.

ASP also provides efficient solvers, such as clasp [Gebser et al. (2007)]. Due to the continuous improvement of the ASP solvers and highly expressive representation language of ASP which is supported by a strong theoretical background that results from a years of intensive research, ASP has been applied fruitfully to a wide range of areas. Here are, for instance, three applications of ASP used in industry:

  • Decision Support Systems: An ASP-based system was developed to help flight controllers of space shuttle solve some planning and diagnostic tasks [Nogueira et al. (2001)] (used by United Space Alliance).

  • Automated Product Configuration: A web-based commercial system uses an ASP-based product configuration technology [Tiihonen et al. (2003)] (used by Variantum Oy).

  • Workforce Management: An ASP-based system is developed to build teams of employees to handle incoming ships by taking into account a variety of requirements, e.g., skills, fairness, regulations [Ricca et al. (2012)] (used by Gioia Tauro seaport).

Let us briefly explain the syntax and semantics of ASP programs and describe how a computational problem can be solved in ASP.

2.1 Programs

Syntax

The input language of ASP programs are composed of three sets namely constant symbols, predicate symbols, and variable symbols where intersection of constant symbols and variable symbols is empty. The basic elements of the ASP programs are atoms. An atom is composed of a predicate symbol and terms where each () is either a constant or a variable. A literal is either an atom  or its negated form .

An ASP program is a finite set of rules of the form:

(1)

where and each is an atom; whereas, is an atom or .

For a rule of the form (1), is called the head of the rule and denoted by . The conjunction of the literals is called the body of . The set of atoms (called the positive part of the body) is denoted by , and the set of atoms (called the negative part of the body) is denoted by , and all the atoms in the body are denoted by .

We say that a rule is a fact if , and we usually omit the sign. Furthermore, we say that a rule is a constraint if the head of is , and we usually omit the sign.

Semantics (Answer Sets)

Answer sets of a program are defined over ground programs. We call an atom, rule, or program ground, if it does not contain any variables. Given a program , the set represents all the constants in , and the set represents all the ground atoms that can be constructed from atoms in with constants in . Also, denotes the set of all the ground rules which are obtained by substituting all variables in rules with the set of all possible constants in .

A subset of is called an interpretation for . A ground atom is true with respect to an interpretation if ; otherwise, it is false. Similarly, a set of atoms is true (resp., false) with respect to if each atom is true (resp., false) with respect to . An interpretation satisfies a ground rule , if is true with respect to whenever is true and is false with respect to . An interpretation is called a model of a program if it satisfies all the rules in .

The reduct of a program with respect to an interpretation is defined as follows:

An interpretation is an answer set for a program , if it is a subset-minimal model for , and denotes the set of all the answer sets of a program .

For example, consider the following program :

(2)

and take an interpretation . The reduct is as follows:

(3)

The interpretation  is a model of the reduct (3). Let us take a strict subset of which is . Then, the reduct  is again equal to (3); however, does not satisfy (3). Therefore, is a subset-minimal model; hence an answer set of . Note also that is the only answer set of .

2.2 Generate-And-Test Representation Methodology with Special ASP Constructs

The idea of ASP [Lifschitz (2008)] is to represent a computational problem as a program whose answer sets correspond to the solutions of the problem, and to find the answer sets for that program using an answer set solver.

When we represent a problem in ASP, two kinds of rules play an important role: those that “generate” many answer sets corresponding to “possible solutions”, and those that can be used to “eliminate” the answer sets that do not correspond to solutions. The rules

(4)

are of the former kind: they generate the answer sets and . Constraints are of the latter kind. For instance, adding the constraint

to program (4) eliminates the answer sets for the program that contain .

In ASP, we use special constructs of the form

(5)

(called choice expressions), and of the form

(6)

(called cardinality expressions) where each is an atom and and are nonnegative integers denoting the “lower bound” and the “upper bound” [Simons et al. (2002)]. Programs using these constructs can be viewed as abbreviations for normal nested programs defined in [Ferraris and Lifschitz (2005)]. Expression (5) describes subsets of . Such expressions can be used in heads of rules to generate many answer sets. For instance, the answer sets for the program

(7)

are arbitrary subsets of . Expression (6) describes the subsets of the set whose cardinalities are at least and at most . Such expressions can be used in constraints to eliminate some answer sets. For instance, adding the constraint

to program (7) eliminates the answer sets for (7) whose cardinalities are at least 2. We abbreviate the rules

by the rule

In ASP, there are also special constructs that are useful for optimization problems. For instance, to compute answer sets that contain the maximum number of elements from the set , we can use the following optimization statement.

2.3 Presenting Programs to Answer Set Solvers

Once we represent a computational problem as a program whose answer sets correspond to the solutions of the problem, we can use an answer set solver to compute the solutions of the problem. To present a program to an answer set solver, like clasp, we need to make some syntactic modifications.

Recall that answer sets for a program are defined over ground programs. Thus, the input of ASP solvers should be ground instantiations of the programs. For that, programs go through a “grounding” phase in which variables in the program (if exists) are substituted by constants. For clasp, we use the “grounder” gringo [Gebser et al. (2011)].

Although the syntax of the input language of gringo is somewhat more restricted than the class of programs defined above, it provides a number of useful special constructs. For instance, the head of a rule can be an expression of one of the forms

but the superscript and the sign are dropped. The body can also contain cardinality expressions but the sign is dropped. In the input language of gringo, :- stands for , and each rule is followed by a period. For facts is dropped. For instance, the rule

can be presented to gringo as follow:

1{p,q,r}1.

Variables in a program are represented by strings whose initial letters are capitalized. The constants and predicate symbols, on the other hand, start with a lowercase letter. For instance, the program

can be presented to gringo as follows:

index(1..n).
p(I) :- not p(I+1), index(I).

Here, the auxiliary predicate index is a “domain predicate” used to describe the ranges of variables. Variables can be also used “locally” to describe the list of formulas. For instance, the rule

can be expressed in gringo as follows

index(1..n).
1{p(I) : index(I)}1.

3 Answering Biomedical Queries

We have earlier developed the software system BioQuery-ASP [Erdem et al. (2011)] (see Figure 1) to answer complex queries that require appropriate integration of relevant knowledge from different knowledge resources and auxiliary definitions such as chains of drug-drug interactions, cliques of genes based on gene-gene relations, or similar/diverse genes. As depicted in Figure 1, BioQuery-ASP takes a query in a controlled natural language and transforms it into ASP. Meanwhile, it extracts knowledge from biomedical databases and ontologies, and integrates them in ASP. Afterwards, it computes an answer to the given query using an ASP solver.

Let us give an example to illustrate these stages; the details of representing biomedical queries in natural language and answering them using ASP are explained in a companion article though.

First of all, let us mention that knowledge related to drug discovery is extracted from the biomedical databases/ontologies and represented in ASP. If the biomedical ontology is in RDF(S)/OWL then we can extract such knowledge using the ASP solver dlvhex [Eiter et al. (2006)] by making use of external predicates. For instance, consider as an external theory a Drug Ontology described in RDF. All triples from this theory can be exported using the external predicate &rdf:

triple_drug(X,Y,Z) :- &rdf["URI for Drug Ontology"](X,Y,Z).

Then the names of drugs can be extracted by dlvhex using the rule:

drug_name(A) :- triple_drug(_,"drugproperties:name",A).

Some knowledge resources are provided as relational databases, or more often as a set of triples (probably extracted from ontologies in RDF). In such cases, we use short scripts to transform the relations into ASP.

To relate the knowledge extracted from the biomedical databases or ontologies and also provide auxiliary definitions, a rule layer is constructed in ASP. For instance, drugs targeting genes are described by the relation drug_gene defined in the rule layer as follows:

drug_gene(D,G) :- drug_gene_pharmgkb(D,G).
drug_gene(D,G) :- drug_gene_ctd(D,G).

where drug_gene_pharmgkb and drug_gene_ctd are relations for extracting knowledge from relevant knowledge resources. The auxiliary concept of reachability of a gene from another gene by means of a chain of gene-gene interactions is defined in the rule layer as well:

gene_reachable_from(X,1) :- gene_gene(X,Y), start_gene(Y).
gene_reachable_from(X,N+1) :- gene_gene(X,Z),
   gene_reachable_from(Z,N), 0 < N, N < L,
   max_chain_length(L).

Now, consider, for instance, the query Q11 from Table 1.

  • What are the drugs that treat the disease Depression and that do not target the gene ACYP1?

This type of queries might be important in terms of drug repurposing [Chong and Sullivan (2007)] which has achieved a number of successes in drug development, including the famous example of Pfizer’s Viagra [Gower (2009)].

This query is then translated into the following program in the language of gringo:

what_be_drugs(DRG) :-  cond1(DRG), cond2(DRG).
cond1(DRG) :- drug_disease(DRG,"Depression").
cond2(DRG) :- drug_name(DRG), not drug_gene(DRG,"ACYP1").
answer_exists :- what_be_drugs(DRG).
:- not answer_exists.

where cond1 and cond2 are invented relations, drug_name, drug_disease and drug_gene are defined in the rule layer.

Once the query and the rule layer are in ASP, the parts of the rule layer that are relevant to the given query are identified by an algorithm [Erdem et al. (2011)]. For some queries, the relevant part of the program is almost 100 times smaller than the whole program (considering the number of ground rules).

Then, given the query as an ASP program and the relevant knowledge as an ASP program, we can find answers to the query by computing an answer set for the union of these two programs using clasp. For the query above an answer computed in this way is “Fluoxetine”.

4 Explaining an Answer for a Query

Once an answer is found for a complex biomedical query, the experts may need informative explanations about the answer, as discussed in the introduction. With this motivation, we study generating explanations for complex biomedical queries. Since the queries, knowledge extracted from databases and ontologies, and the rule layer are in ASP, our studies focus on explanation generation within the context of ASP.

Before we introduce our methods to generate explanations for a given query, let us introduce some definitions regarding explanations in ASP.

Let be the relevant part of a ground ASP program with respect to a given biomedical query (also a ground ASP program), that contains rules describing the knowledge extracted from biomedical ontologies and databases, the knowledge integrating them, and the background knowledge. Rules in generally do not contain cardinality/choice expressions in the head; therefore, we assume that in only bodies of rules contain cardinality expressions. Let be an answer set for . Let be an atom that characterizes an answer to the query . The goal is to find an “explanation” as to why is computed as an answer to the query , i.e., why is in ? Before we introduce a definition of an explanation, we need the following notations and definitions.

We say that a set  of atoms satisfies a cardinality expression of the form

if the cardinality of is within the lower bound and upper bound . Also satisfies a set of cardinality expressions (denoted by ), if satisfies every element of .

Let be a ground ASP program, be a rule in , be an atom in , and and be two sets of atoms. Let denote the set of cardinality expressions that appear in the body of . We say that supports an atom using atoms in but not in (or with respect to but ), if the following hold:

(8)

We denote the set of rules in that support with respect to but , by .

We now introduce definitions about explanations in ASP. We first define a generic tree whose vertices are labeled by either atoms or rules.

Definition 1 (Vertex-labeled tree)

A vertex-labeled tree for a program and a set of atoms is a tree whose vertices are labeled by a function that maps to . In this tree, the vertices labeled by an atom (resp., a rule) are called atom vertices (resp., rule vertices).

For a vertex-labeled tree and a vertex in , we introduce the following notations:

  • denotes the set of atoms which are labels of ancestors of .

  • denotes the set of rule vertices which are descendants of .

  • denotes the set of children of .

  • denotes the set of siblings of .

  • denotes the set of out-going edges of .

  • denotes the degree of and equals to .

  • If , then is a leaf vertex.

  • denotes the set of leaves of .

  • The root of is the root of .

  • is empty if .

We now define a specific class of vertex-labeled trees which contains all possible “explanations” for an atom.

Definition 2 (And-or explanation tree)

Let be a ground ASP program, be an answer set for , be an atom in . The and-or explanation tree for with respect to and is a vertex-labeled tree that satisfies the following:

  • for the root of the tree, ;

  • for every atom vertex ,

  • for every rule vertex ,

  • each leaf vertex is a rule vertex.

Let us explain conditions in Definition 2 in detail.

  1. The root of the and-or explanation tree is labeled by the atom . Intuitively, contains all possible explanations for .

  2. For every atom vertex , there is an out-going edge to a rule vertex under the following conditions: the rule that labels supports the atom that labels , using atoms in but not any atom that labels an ancestor of . We want to exclude the atoms labeling ancestors of to ensure that the height of the and-or explanation tree is finite (e.g., otherwise, due to cyclic dependencies the tree may be infinite).

  3. For every rule vertex , there is an out-going edge to an atom vertex if the atom that labels is in the positive body of the rule that labels . In this way, we make sure that every atom in the positive body of the rule that labels takes part in explaining the head of the rule that labels .

  4. Together with Conditions and above, this condition guarantees that the leaves of the and-or explanation tree are rule vertices that are labeled by facts in the reduct of the given ASP program with respect to the given answer set . Intuitively, this condition expresses that the leaves are self-explanatory.

Example 1

Let be the program

and . The and-or explanation tree for with respect to and is shown in Figure 2. Intuitively, the and-or explanation tree includes all possible “explanations” for an atom. For instance, according to Figure 2, the atom has two explanations:

  • One explanation is characterized by the rules that label the vertices in the left subtree of the root: is in because the rule

    support . Moreover, this rule can be “applied to generate ” because and , the atoms in its positive body, are in . Further, is in because the rule

    supports . Further, is in because is supported by the rule

    which is self-explanatory.

  • The other explanation is characterized by the rules that label the vertices in the right subtree of the root: is in because the rule

    supports . Further, this rule can be “applied to generate ” because is in . In addition, is in because is supported by the rule

    which is self-explanatory.

Figure 2: The and-or explanation tree for Example 1.
Proposition 1

Let be a ground ASP program and be an answer set for . For every in , the and-or explanation tree for with respect to and is not empty.

Note that in the and-or explanation tree, atom vertices are the “or” vertices, and rule vertices are the “and” vertices. Then, we can obtain a subtree of the and-or explanation tree that contains an explanation, by visiting only one child of every atom vertex and every child of every rule vertex, starting from the root of the and-or explanation tree. Here is precise definition of such a subtree, called an explanation tree.

Definition 3 (Explanation tree)

Let be a ground ASP program, be an answer set for , be an atom in , and be the and-or explanation tree for with respect to and . An explanation tree in is a vertex-labeled tree such that

  • is a subtree of ;

  • the root of is the root of ;

  • for every atom vertex , ;

  • for every rule vertex , .

Example 2

Let be the and-or explanation tree in Figure 2. Then, Figure 3 illustrates the explanation trees in . These explanation trees characterize the two explanations for explained in Example 1.

Figure 3: Explanation trees for Example 2.

After having defined the and-or explanation tree and an explanation tree for an atom, let us now define an explanation for an atom.

Definition 4 (Explanation)

Let be a ground ASP program, be an answer set for , and be an atom in . A vertex-labeled tree is an explanation for with respect to and if there exists an explanation tree in the and-or explanation tree for with respect to and such that

  • ;

  • .

Intuitively, an explanation can be obtained from an explanation tree by “ignoring” its atom vertices.

Example 3

Let and be defined as in Example 1. Then, Figure 4 depicts two explanations for with respect to and , described in Example 1.

(a)

(b)
Figure 4: Explanations for Example 3.

So far, we have considered only positive programs in the examples. Our definitions can also be used in programs that contain negation and aggregates in the bodies of rules.

Example 4

Let be the program

and . The and-or explanation tree for with respect to and is shown in Figure 5(a). Here, the rule is not included in the tree as is in , whereas the rule is in the tree as is not in and, and are in . Also, the rule is in the tree as is in and the cardinality expression is satisfied by . An explanation for with respect to and is shown in Figure 5(b).

(a)

(b)
Figure 5: (a) The and-or explanation tree for and (b) an explanation for .

Note that our definition of an and-or explanation tree considers positive body parts of the rules only to provide explanations. Therefore, explanation trees do not provide further explanations for negated literals (e.g., why an atom is not included in the answer set), or aggregates (e.g., why a cardinality constraint is satisfied) as seen in the example above.

5 Generating Shortest Explanations

As can be seen in Figure 4, there might be more than one explanation for a given atom. Hence, it is not surprising that one may prefer some explanations to others. Consider biomedical queries about chains of gene-gene interactions like the query Q8 in Table 1. Answers of such queries may contain chains of gene-gene interactions with different lengths. For instance, an answer for this query is “CASK”. Figure 6 shows an explanation for this answer. Here, “CASK” is related to “ADRB1” via a gene-gene chain interaction of length (the chain “CASK”–“DLG4”–“ADRB1”). Another explanation is partly shown in Figure 7. Now, “CASK” is related to “ADRB1” via a gene-gene chain interaction of length (the chain “CASK”–“ DLG1”–“DLG4”–“ADRB1”). Since gene-gene interactions are important for drug discovery, it may be more desirable for the experts to reason about chains with shortest lengths.

With this motivation, we consider generating shortest explanations. Intuitively, an explanation  is shorter than another explanation  if the number of rule vertices involved in  is less than the number of rule vertices involved in . Then we can define shortest explanations as follows.

   gene_gene("CASK","DLG4"),
  

   gene_gene_biogrid("CASK","DLG4")

gene_gene_biogrid("CASK","DLG4")
dummy

   gene_gene("DLG4","ADRB1"),
   start_gene("ADRB1")

   gene_gene_biogrid("DLG4","ADRB1")

gene_gene_biogrid("DLG4","ADRB1")

start_gene("ADRB1")
dummy
Figure 6: A shortest explanation for Q8.

   gene_gene("CASK","DLG1"),
  

   gene_gene_biogrid("CASK","DLG1")

  
  

Figure 7: Another explanation for Q8.
Definition 5 (Shortest explanation)

Let be a ground ASP program, be an answer set for , be an atom in , and be an explanation (with vertices ) for with respect to and . Then, is a shortest explanation for with respect to and if there exists no explanation (with vertices ) for with respect to and such that .

Example 5

Let and be defined as in Example 1. Then, Figure 4(b) is the shortest explanation for with respect to and .

To compute shortest explanations, we define a weight function that assigns weights to the vertices of the and-or explanation tree. Basically, the weight of an atom vertex (“or” vertex) is equal to the minimum weight among weights of its children and the weight of a rule vertex (“and” vertex) is equal to sum of weights of its children plus . Then the idea is to extract a shortest explanation by propagating the weights of the leaves up and then traversing the vertices that contribute to the weight of the root. Let us define the weight of vertices in the and-or explanation tree.

Definition 6 (Weight function)

Let be a ground ASP program, be an answer set for , be an atom in , and be the and-or explanation tree for with respect to and . The weight function for maps vertices in to a positive integer and it is defined as follows.

Input: ground ASP program, answer set for , atom in .
Output: a shortest explanation for w.r.t and , or an empty vertex-labeled tree.
1 ;
2 if  is not empty then
3       root of ;
4       ;
5       ;
6       return ;
7      
8else
9       return ;
10      
Algorithm 1 Generating Shortest Explanations

Using this weight function, we develop Algorithm 1 to generate shortest explanations. Let us describe this algorithm. Algorithm 1 starts by creating the and-or explanation tree for with respect to and (Line ); for that it uses Algorithm 2. If is not empty, then Algorithm 1 assigns weights to the vertices of (Line ), using Algorithm 3. As the final step, Algorithm 1 extracts a shortest explanation from (Line ), using Algorithm 4. The idea is to traverse an explanation tree of , by the help of the weight function, and construct an explanation, which would be a shortest one, by contemplating only the rule vertices in the traversed explanation tree. If is empty, Algorithm 1 returns an empty vertex-labeled tree.

Algorithm 2 (with the call ) creates the and-or explanation tree for with respect to and recursively. With a call , where intuitively denotes the atoms labeling the atom vertices created so far, the algorithm considers two cases: being an atom or a rule. In the former case, 1) the algorithm creates an atom vertex for , 2) it identifies the rules that support , 3) for each such rule, it creates a vertex labeled tree (i.e., a subtree of the resulting and-or explanation tree), and 4) it connects these trees to the atom vertex . In the latter case, if is a rule in , 1) the algorithm creates a rule vertex for , 2) it identifies the atoms in the positive part of the rule, 3) it creates the and-or explanation tree for each such atom, and 4) it connects these trees to the rule vertex .

Once the and-or explanation tree is created, Algorithm 3 assigns weights to all vertices in the tree by propagating the weights of the leaves (i.e., 1) up to the root in a bottom-up fashion using the weight function definition (i.e., Definition 6).

After that, Algorithm 1 (with the call ) extracts a shortest explanation in a top-down fashion starting from the root by examining the weights of the vertices. In particular, if a visited vertex is an atom vertex then the algorithm proceeds with the child of with the minimum weight; otherwise, it considers all the children of .

Input: ground ASP program, answer set for , an atom in or a rule in , set of atoms in .
Output: A vertex-labeled tree.
1 ;
2 if  then
3       Create an atom vertex s.t. ;
4       ,   ;
5       foreach  do
6             ;
7             if  then
8                   root of s.t. ;
9                   ,   ;
10                  
11            
12      if  then  return ;
13       ;
14      
15else if  then
16       Create a rule vertex s.t. ;
17       foreach  do
18             ;
19             if  then  return ;
20             ;
21             root of s.t. ;
22             ,  ;
23            
24      
25return ;
Algorithm 2 createTree
Input: ground ASP program, answer set for , set of vertices, , vertex in , set of edges, candidate weight function.
Output: Weight of .
1 if  then
2       foreach  do  ;
3       ;
4       ;
5      
6else if  then
7       ;
8       foreach  do  ;
9       ;
10      
11return ;
Algorithm 3 calculateWeight
Input: ground ASP program, answer set for , set of vertices, , vertex in , set of edges, weight function of , rule vertex in or , op: string min or max.
Output: A vertex-labeled tree .
1 ;
2 if  then
3       Pick op weighted child of ;
4       if  then  };
5       ;
6       ;
7       ,  ;
8      
9else if  then
10       ;
11       foreach  do
12             ;
13             ,  ;
14            
15      
16return ;
Algorithm 4 extractExp

The execution of Algorithm 1 is also illustrated in Figure 8. First, the and-or explanation tree is generated, which has a generic structure as in Figure 8(a). Here, yellow vertices denote atom vertices and blue vertices denote rule vertices. Then, this tree is weighted as in Figure 8(b). Then, starting from the root, a subtree of the and-or explanation tree is traversed by visiting minimum weighted child of every atom vertex and every child of every rule vertex. This process is shown in Figure 8(c), where red vertices form the traversed subtree. From this subtree, an explanation is extracted by ignoring atom vertices and keeping the parent-child relationship of the tree as it is. The resulting explanation is depicted in Figure 8(d).

Figure 8: A generic execution of Algorithm 1.
Proposition 2

Given a ground ASP program , an answer set for , and an atom in , Algorithm 1 terminates.

Proposition 3

Given a ground ASP program , an answer set for , and an atom in , Algorithm 1 either finds a shortest explanation for with respect to and or returns an empty vertex-labeled tree.

Proposition 4

Given a ground ASP program , an answer set for , and an atom in , the time complexity of Algorithm 1 is .

We generate the complete and-or explanation tree while finding a shortest explanation. In fact, we can find a shortest explanation by creating a partial and-or explanation tree using a branch-and-bound idea. In particular, the idea is to compute the weights of vertices during the creation of the and-or explanation tree and, in case there exists a branch of the and-or explanation tree that exceeds the weight of a vertex computed so far, to stop branching on unnecessary parts of the and-or explanation tree. Then, a shortest explanation can be extracted by the same method used previously, i.e., by traversing a subtree of the and-or explanation tree and ignoring the atom vertices in this subtree. For instance, consider Figure 8(b). Assume that we first create the right branch of the root. Since the weight of an atom vertex is equal to the minimum weight among its children weights, we know that the weight of the root is at most 2. Now, we check whether it is necessary to branch on the left child of the root. Note that the weight of a rule vertex is equal to 1 plus the sum of its children weights. As has two children, its weight is at least 3. Therefore, it is redundant to branch on the left child of the root. This improvement is not implemented and is a future work.

6 Generating Different Explanations

When there is more than one explanation for an answer of a query, it might be useful to provide the experts with several more explanations that are different from each other. For instance, consider the query Q5 in Table 1.

  • What are the drugs that treat the disease Asthma or that react with the drug Epinephrine?

An answer for this query is “Doxepin”. According to one explanation, “Doxepin” reacts with “Epinephrine” with respect to DrugBank. At this point, the expert may not be convinced and ask for a different explanation. Another explanation for this answer is that “Doxepin” treats “Asthma” according to CTD. Motivated by this example, we study generating different explanations.

Input: : ground ASP program, : answer set for , : atom in , a positive integer. Assume there are different explanations for w.r.t and .
Output: different explanations for with respect to and .
1 , ;
2 ;
3 root of ;
4 for  do
5       ;
6       if  then  return ;
7       ;
8       ;
9       ;
10       ;
11       ;
12      
13return ;
Algorithm 5 Generating Different Explanations

We introduce an algorithm (Algorithm 5) to compute different explanations for an atom in with respect to and . For that, we define a distance measure between a set of (previously computed) explanations, and an (to be computed) explanation . We consider the rule vertices and contained in and , respectively. Then, we define the function that measures the distance between and as follows:

In the following, we sometimes use and instead of and in . Also, we denote by the set of rule vertices of a vertex-labeled tree .

Let us now explain Algorithm 5. It computes a set of different explanations iteratively. Initially, . First, we compute the and-or explanation tree  (Line ). Then, we enter into a loop that iterates at most times (Line ). At each iteration , an explanation that is most distant from the previously computed explanations is extracted from . Let us denote the rule vertices included in the previously computed explanations by . Then, essentially, at each iteration we pick an explanation such that is maximum. To be able to find such a , we need to define the “contribution” of each vertex in to the distance measure if is included in explanation :

Note that this function is different from . Intuitively, contributes to the distance measure if it is not included in . The contributions of vertices in are computed by Algorithm 6 (Line ) by propagating the contributions up in the spirit of Algorithm 3. Then, is extracted from weighted- by using Algorithm 4 (Line ).

Input: ground ASP program, answer set for , set of vertices, , vertex in , set of edges, set of rule vertices in , candidate distance function.
Output: distance of .
1 if  then
2       foreach  do
3             ;
4            
5      ;
6      
7else if  then
8       if  then  ;
9       ;
10       else  ;
11       ;
12       foreach  do
13             ;
14            
15      
16return ;
Algorithm 6 calculateDifference

The execution of Algorithm 5 is also illustrated in Figure 9. Similar to Algorithm 1, which generates shortest explanations, first the and-or explanation tree is created, which has a generic structure as shown in Figure 9(a). Recall that yellow vertices denote atom vertices and blue vertices denote rule vertices. For the sake of example, assume that . Then, the goal is to generate an explanation that contains different rule vertices from the rule vertices in as much as possible. For that, the weights of vertices are assigned according to the weight function as depicted in Figure 9(b). Here, the weight of the root implies that there exists an explanation which contains different rule vertices from the rule vertices in and this explanation is the most different one. Then, starting from the root, a subtree of the and-or explanation tree is traversed by visiting maximum weighted child of every atom vertex, and every child of every rule vertex. This subtree is shown in Figure 9(c) by red vertices. Finally, an explanation is extracted by ignoring the atom vertices and keeping the parent-child relationship as it is, from this subtree. This explanation is illustrated in Figure 9(d).

Figure 9: A generic execution of Algorithm 5.
Proposition 5

Given a ground ASP program , an answer set for , an atom in , and a positive integer , Algorithm 5 terminates.

Proposition 6

Let be a ground ASP program, be an answer set for , be an atom in , and be a positive integer. Let be the number of different explanations for with respect to and . Then, Algorithm 5 returns different explanations for with respect to and .

Furthermore, at each iteration of the loop in Algorithm 5 the distance is maximized.

Proposition 7

Let be a ground ASP program, be an answer set for , be an atom in , and be a positive integer. Let be the number of explanations for with respect to and . Then, at the end of each iteration () of the loop in Algorithm 5, is maximized, i.e., there is no other explanation such that .

This result leads us to some useful consequences. First, Algorithm 5 computes “longest” explanations if . The following corollary shows how to compute longest explanations.

Corollary 1

Let be a ground ASP program, be an answer set for , be an atom in , and . Then, Algorithm 5 computes a longest explanation for with respect to and .

Next, we show that Algorithm 5 computes different explanations such that for every  () the explanation is the most distant explanation from the previously computed explanations.

Corollary 2

Let be a ground ASP program, be an answer set for , be an atom in , and be a positive integer. Let be the number of explanations for with respect to and . Then, Algorithm 5 computes different explanations for with respect to and such that for every () is maximized.

The following proposition shows that the time complexity of Algorithm 5 is exponential in the size of the given answer set.

Proposition 8

Given a ground ASP program , an answer set for , an atom in and a positive integer , the time complexity of Algorithm 5 is .

7 Experiments with Biomedical Queries

Our algorithms for generating explanations are applicable to the queries Q1, Q2, Q3, Q4, Q5, Q8, Q10, Q11 and Q12 in Table 1. The ASP programs for the other queries involve choice expressions. For instance, the query Q7 asks for cliques of 5 genes. We use the following rule to generate a possible set of 5 genes that might form a clique.

5{clique(GEN):gene_name(GEN)}5.

Our algorithms apply to ASP programs that contain a single atom in the heads of the rules, and negation and cardinality expressions in the bodies of the rules. Therefore, our methods are not applicable to the queries which are presented by ASP programs that include choice expressions.

Query CPU Time Explanation Answer Set And-Or Tree gringo calls
Size Size Size
Q1 52.78s 5 1.964.429 16 0
Q2 67.54s 7 2.087.219 233 1
Q3 31.15s 6 1.567.652 15 0
Q4 1245.83s 6 19.476.119 6690 4
Q5 41.75s 3 1.465.817 16 0
Q8 40.96s 14 1.060.288 28 4
Q10 1601.37s 14 1.612.128 3419 193
Q11 113.40s 6 2.158.684 5528 5
Q12 327.22s 5 10.338.474 10 1
Table 2: Experimental results for generating shortest explanations for some biomedical queries, using Algorithm 1.

In Table 2, we present the results for generating shortest explanations for the queries Q1, Q2, Q3, Q4, Q5, Q8, Q10, Q11 and Q12. In this table, the second column denotes the CPU timings to generate shortest explanations in seconds. The third column consists of the sizes of explanations, i.e., the number of rule vertices in an explanation. In the fourth column, the sizes of answer sets, i.e., the number of atoms in an answer set, are given. The fifth column presents the sizes of the and-or explanation trees, i.e., the number of vertices in the tree.

Before telling what the last column presents, let us clarify an issue regarding the computation of explanations. Since answer sets contain millions of atoms, the relevant ground programs are also huge. Thus, first grounding the programs and then generating explanations over those grounded programs is an overkill in terms of computational efficiency. To this end, we apply another method and do grounding when it is necessary. To better explain the idea, let us present our method by an example. At the beginning, we have a ground atom for which we are looking for shortest explanations. Assume that this atom is what_be_genes("ADRB1"). Then, we find the rules whose heads are of the form what_be_genes(GN), and instantiate with “ADRB1”. For instance, assume that the following rule exists in the program:

Then, by such an instantiation, we obtain the following instance of this rule:

Next, if the rules that we obtain by instantiating their heads are not ground, we ground them using the grounder gringo considering the answer set. We apply the same method for the atoms that are now ground, to find the relevant rules and ground them if necessary. This allows us to deal with a relevant subset of the rules while generating explanations. The last column of Table 2 presents the number of times gringo is called for such incremental grounding. For instance, for the queries Q1, Q3 and Q5, gringo is never called. However, gringo is called 193 times during the computation of a shortest explanation for the query Q10.

As seen from the results presented in Table 2, the computation time is not very much related to the size of the explanation. As also suggested by the complexity results of Algorithm 1 (i.e., ), the computation time for generating shortest explanations greatly depends on the sizes of the answer set and the and-or explanation tree. For instance, for the query Q4, the answer set contains approximately 19 million atoms, the size of the and-or explanation tree is 6690, and it takes 1245 CPU seconds to compute a shortest explanation, whereas for the query Q8, the answer set approximately contains 1 million atoms, the and-or explanation tree has 28 vertices, and it takes 40 CPU seconds to compute a shortest explanation. Also, the number of times gringo is called during the computation affects the computation time. For instance, for the query Q10 the answer set approximately contains 1.6 million atoms, the and-or explanation tree has 3419 vertices, and it takes 1600 CPU seconds to compute a shortest explanation.

Table 3 shows the computation times for generating different explanations for the answers of the same queries, if exists. As seen from these results, the time for computing 2 and 4 different explanations is slightly different than the time for computing shortest explanations.

Query CPU Time
different different Shortest
Q1 53.73s - 52.78s
Q2 66.88s 67.15s 67.54s
Q3 31.22s - 31.15s
Q4 1248.15s 1251.13s 1245.83s
Q5 - - 41.75s
Q8 - - 40.96s
Q10 1600.49s 1602.16s 1601.37s
Q11 113.25s 112.83s 113.40s
Q12 - - 327.22s
Table 3: Experimental results for generating different explanations for some biomedical queries, using Algorithm 5.

8 Presenting Explanations in a Natural Language

An explanation for an answer of a biomedical query may not be easy to understand, since the user may not know the syntax of ASP rules neither the meanings of predicates. To this end, it is better to present explanations to the experts in a natural language.

Observe that leaves of an explanation denote facts extracted from the biomedical resources. Also some internal vertices contain informative explanations such as the position of a drug in a chain of drug-drug interactions. Therefore, there is a corresponding natural language explanation for some vertices in the tree. Such a correspondence can be stored in a predicate look-up table, like Table 4. Given such a look-up table, a pre-order depth-first traversal of an explanation and generating natural language expressions corresponding to vertices of the explanation lead to an explanation in natural language [Oztok (2012)].

For instance, the explanation in Figure 6 is expressed in natural language as illustrated in the introduction.

Predicate Expression in Natural Language
gene_gene_biogrid(x,y) The gene x interacts with the gene y according to BioGRID.
drug_disease_ctd(x,y) The disease y is treated by the drug x according to CTD.
drug_gene_ctd(x,y) The drug x targets the gene y according to CTD.
gene_disease_ctd(x,y) The disease y is related to the gene x according to CTD.
disease_symptom_do(x,y) The disease x has the symptom y according to Disease Ontology.
drug_category_drugbank(x,y) The drug x belongs to the category y according to DrugBank.
drug_drug_drugbank(x,y) The drug x reacts with the drug y according to DrugBank.
drug_sideeffect_sider(x,y) The drug x has the side effect y according to SIDER.
disease_gene_orphadata(x,y) The disease x is related to the gene y according to Orphadata.
drug_disease_pharmgkb(x,y) The disease y is treated by the drug x according to PharmGKB.
drug_gene_pharmgkb(x,y) The drug x targets the gene y according to PharmGKB.
disease_gene_pharmgkb(x,y) The disease x is related to the gene y according to PharmGKB.
start_drug(x) The drug x is the start drug.
start_gene(x) The gene x is the start gene.
The distance of the drug x from the start drug is l.
The distance of the gene x from the start gene is l.
Table 4: Predicate look-up table used while expressing explanations in natural language.

9 Implementation of Explanation Generation Algorithms

Based on the algorithms introduced above, we have developed a computational tool called ExpGen-ASP [Oztok (2012)], using the programming language C++. Given an ASP program and its answer set, ExpGen-ASP generates shortest explanations as well as different explanations.

The input of ExpGen-ASP are

  • an ASP program ,

  • an answer set for ,

  • an atom in ,

  • an option that is used to generate either a shortest explanation or different explanations,

  • a predicate look-up table,

and the output are

  • a shortest explanation for with respect to