Molecular biology accumulates data and mechanisms suspected to play key roles in the cellular ecosystem. The activity of discovery currently outpaces human abilities to follow and collate new mechanisms. For instance, p53, a protein family relevant to cell apoptosis and cancer formation, is mentioned in the title or in the abstract of about 4700 papers for the year 2018 alone (PubMed).
In 2014, DARPA financed a large program named “Big Mechanism”, for about $45M, pointing explicitly to the problem of extraction and integration of molecular biology facts from biological literature [Cohen2015]. Along this line of research, our intention is to provide a formal basis for describing structural and dynamic biological knowledge suitable for composition and reasoning. To illustrate the type of knowledge we are aiming at, here is a typical sentence from a molecular biology paper:
“The activation of Raf-1 by activated Src requires phosphorylation of Raf-1 on Y340 and/or Y341 […]. Tyrosine phosphorylation and activation of Raf-1 have been shown to be coincident. However, others have been unable to detect phosphotyrosine in active Raf-1.”. [mason1999serine]
At this level of abstraction, proteins are considered as chains of amino acid residues such as Y340 and Y341, which are identified by their type (Y for tYrosine) and their position in the chain (resp. 340 and 341). Proteins have names, here Raf-1 and Src, and are usually divided into domains or regions that are covering sub-sequences of amino acids. Domains may also be given a name. For instance, Raf-1 has a “Zinc finger” domain in the 137-183 region.111uniprot.org/uniprot/P09560
Importantly, static names of proteins, domains and residues can be completed with dynamic attributes. Here, “phosphorylation” denotes the attachment of a phosphate group to a protein residue, which tends to modify the protein structure. One then talks about a phosphorylated protein, a phosphorylated domain or, as in the example above, a phosphorylated residue. Other dynamic attributes such as “active” are commonplace in molecular biology.
Underlying the snippet of biological literature given above is the notion of protein interactions: “the activation of Raf-1” by “activated Src” indicates that Raf-1 and active Src can bind to each other so that phosphorylation of Raf-1 by Src may occur. Stable binding of proteins requires complementary domains that stick together with various affinities. The binding state of a protein (or a region) is therefore also a dynamic, relational property.
We express observations using formulas. Their models are transitions, which represent a biological change from a precondition state to a postcondition state. States are forests of linked trees. Trees encode proteins. The root of a tree represents the entire protein, and children represent sub-parts of the protein. Nodes can have static names (Raf-1, Src, Y340, …) and dynamic attributes (Phos, Active, …).
The logic we introduce in this paper has the following design constraints:
– The logic describes changes in a compositional way and works as a basis for knowledge representation.
– One is in principle able to run queries on the information to judge the impact of adding new knowledge to an existing base.
– The logic accomodates both knowledge collation and biological modelling, which applies a parsimony assumption on available biomechanisms. This corresponds to commonsense reasoning: changes not implied by observations cannot occur. This assumption is expressed with a second-order operator on formulas. The formalism introduced in this paper allows one to mix both activities, knowledge collation and modelling, in a single logic while maintaining queryability.
We represent mixtures of proteins (states) as labelled forests. The trees have bounded height and degree. A root represents a whole protein, and children represent domains, subdomains or residues depending on their height in the tree. Transitions are pairs of forests, with overlapping underlying sets. They represent one step of biological change. The first element of a transition is the precondition, and the second element is the postcondition. Static labels (Src, Raf-1) are not allowed to change during the transition, and neither does the structure of the trees. Dynamic labels () may change. We encode changes by copying each dynamic predicate: for instance, means “ is phosphorylated in the precondition”, while means “ is phosphorylated in the postcondition”.
The other dynamic aspect is a functional and symmetric relation Link which represents protein-protein interactions, typically noncovalent bonds. Functionality captures the fact that binding sites are resources, so is incompatible with if . It comes with a copy for the postcondition, . If represents a protein connected to multiple partners, the corresponding links are distributed on separate children nodes of .
A transition contains zero or more changes (edge removal, label change, etc). Transitions can be ordered along their changes. Intuitively, if two transitions have the same precondition and one contains all the changes of the other, then they are comparable along a change order. We introduce nonmonotonic reasoning with the operator : for a formula , denotes the models that are minimal along the change order. In models of , no unnecessary changes occur.
Consider the following formula:
where is a term denoting the parent of in its tree (or itself, in the case where is a root). The models of are all transitions such that if, in the postcondition, the interpretation of is Active, then is not linked to any protein (it is ‘free’) also in the postcondition. We can compose observations and for example add the observation that ends up connected to an Src protein:
In figure 1(a), we show the transition , which satisfies . The precondition is on the left of the arrow, and the postcondition is on the right. Changes are highlighted in green. is the region R bearing the Y340 residue of Raf-1, which is phosphorylated. The Raf-1 node, , loses the label Active. A link between and is created.
There is, so to speak, an open-world assumption on changes: for instance, a transition that has thousands of trees in the precondition and deletes them all in the postcondition satisfies . While knowledge collation has to be made with an open world assumption (more structure or more changes might be added when more knowledge is accessible), modelling focuses on dynamics and is an activity that is intrinsically parsimonious with regards to changes: the dynamics of a model are restricted to what is implied by current knowledge. Therefore, we also want the ability to reason with the additional assumption that all relevant observations have been made. To remove transitions with spurious changes, we use the operator . Transitions that are models of have the changes required for to be linked to an Src in the postcondition, and such that either is free or ’s parent is inactive in the postcondition — and no other changes.
In figure 1(b), we see the transition , which satisfies . and have the same precondition, but contains additional changes, highlighted in red: the tyrosine residue of becomes unphosphorylated, and a new protein is created. Intuitively, those additional changes are not required by Obs, and we will show that does not satisfy .
Structure of the paper
Section 2 introduces the vocabulary. Section 3 describes two classes of structures, transitions and transitions of forests of linked bounded trees, or FLBs. Modulo the theory of FLBs, first-order satisfiability is not decidable, but satisfiability in the prenex fragment is. Section 4 defines a change order on transitions; two transitions with the same precondition are comparable if all the changes of one are included in the other. We denote the change-minimal models of a formula with . Section 5 introduces deduction rules and states the main theorem: modulo the theory of supported FLBs, deducible formulas are in a class with decidable satisfiability and validity. Section 6 defines the new constructs of unified circumscription and preservation, which allow one to prove the main theorem. Both of general relevance, the former can express as a second-order formula, and the latter, through model-theoretic properties, defines a class of for which is actually first-order.
Formulas are first-order except when otherwise noted. Equality is allowed, but not constant symbols. Signatures are noted , structures are noted, , and interpretations from a set of variables to a structure domain are noted .
By “nodes” we mean elements in the domain of a structure.
For a first-order quantifier prefix and any formula, to mean that is equivalent to a prenex first-order formula with quantifier prefix in .
If is a term, a set of terms, or a formula, is the set of variables mentioned in . For a structure , a term , and , is the interpretation of in under . We also lift the semantic brackets to sets of terms. Moreover, for a formula over the tuple of variables , is the set of tuples from such that .
Transitions are a generic framework for representing changes between states. The vocabulary for a transition is given by a transition signature of the form . and (for Dynamic) are tuples of relational symbols. and are similar: they have matching length and pointwise arities. (for Static) is a tuple of symbols.
provides the vocabulary for describing the precondition, for describing the postcondition, for describing the invariant part, and P, for describing the elements present in the pre- and postconditions.
contains a distinguished unary predicate symbol (for Presence) so node creation/destruction can be encoded. The coherence constraint prevents spurious nodes that would inhabit a structure yet be encoded as nonexistent. is supported if and is a support formula if .
A formula is pre if it does not use any symbol from . If is a formula, is where every symbol from has been replaced by its counterpart from .
3 Transitions, Forests of Linked Bounded trees
An FLB signature (for Forest of Linked Bounded trees) is a transition signature specialised for representing transitions on bounded forests with dynamic links between nodes. There is a convenience parent function symbol which goes up one level in the tree. Other function names play the role of tree edge labels. Nodes can be linked through a functional and symmetric relation Link (or in the postcondition).222One can increase the number of binding partners by allowing a subtree under a node.
A transition signature is an FLB signature whenever :
where are the unary presence symbols. are binary link symbols. (dynamic Labels) is a tuple of unary predicate symbols. is a tuple of unary functions (child-of functions). bounds the degree of the trees. parent is a distinguished unary function. (static names) is a tuple of unary predicate symbols.
Not all -structures are forests of linked bounded trees. With a -structure, is the loop-free333For any , is not in the graph even if . union of graphs of . is an FLB whenever is a forest, the loop-free graph of is the parent relation in that forest, and are symmetric and functional.
For every , the -FLBs are the FLBs with trees of height at most .
In figure 1(b), is a -FLB. The symbols Raf-1 and Tyr are in of the underlying signature. Phos is in and is in . There is a function symbol such that . The creation of a link between and is encoded as and . The creation of is encoded as and .
-FLBs can be characterised by a finite, first-order, universal444That is, is in the prenex class. theory . We do not reproduce it here in full detail. It is of the following form:
ParentSpec forces parent to behave as a parent function. forces paths through to be of length at most .555Note that the signature bounds the degree of the trees, while the theory bounds their height. FunSymLink and ensure that Link and are functional and symmetric.
A -structure is an -FLB iff .
The proof uses to prevent cycles and ParentSpec to force unicity of paths from the roots.
Formulas modulo are a good candidate for knowledge representation, but querying is not possible in general. Let be the theory of supported -FLBs:
First-order satisfiability modulo is undecidable for .
The proof is by reduction of domino problems . Colors are labels in , and trees have height , colored roots, and leaves. Each leaf is a direction (up,down,left,right) and the only allowed links are between up-down or left-right pairs with appropriately colored roots.
For any FLB signature, satisfiability modulo is still achievable in a restricted fragment:
For , satisfiability modulo in the fragment is decidable.
This can be proved by adapting the classic proof of decidability for the Bernays-Schönfinkel-Ramsey fragment in relational FO with equality to our non-relational signatures. We show that restrains functions enough to maintain decidability because iterated function application becomes stationary after a bounded number of steps. As in the original proof, we get a small model property as a byproduct and a description of that model (it has just enough trees to host the existential witnesses required by the part).
FLB signatures and their associated theories describe state transitions on forests of bounded trees with static and dynamic labels as well as a dynamic, functional link relation between nodes. While satisfiability is not decidable in general, it is in the fragment. Note in our example that . The next section introduces commonsense reasoning by characterising transitions which, given a precondition, only apply the changes that are necessary to satisfy a formula.
4 Change minimisation
For , and an arity(A)-sized tuple of variables, describes the changes in A: . For simplicity, the tuple may be omitted.
-structures can be ordered along a partial change order : for any two -structures, let whenever for all :
So means that they have equal preconditions, that contains at least the elements in , and that any change that occurs in also occurs in .
Consider and from figures 1(a) and 1(b). : their precondition () are equal, their static parts () are equal on their common elements, has one more element () and, while every change in is present in , but .
The -minimal models of a first-order formula are expressed as (“minimised ”):
With a formula, iff , and there is no such that .
Intuitively, if represents existing knowledge of a biological mechanism, represents the current best model (in the biological sense) implied by that knowledge.
One may naturally ask for a syntactic definition of . In section 6.2, we will see that, in general, is second-order expressible . In the meantime, the next section provides deduction rules that can produce formulas of the form . It defines a class of formulas with minimal models that can be captured in a first-order fragment, rather than in second-order logic only.
5 Deduction rules
We introduce deduction rules for the judgement , which should be seen as a typing property for formulas.
For any term , is any binary atom where and both appear. If is a tuple of relational symbols, is the set of literals that use symbols of .
is any set of first-order variables, , and is a formula. In a judgment of the form , we say that is the context. Functions of FLB signatures are unary, so for any term , is a singleton and for , .
The judgment implies that, for any and , there is a “protected” subset parameterized by , and such that removing changes of outside of preserves satisfaction of (see section 6.1).
We state the main theorem of the paper and informally describe the rules. The remaining sections introduce the main theoretical tools that are necessary to prove the theorem.
If , then and .
Proposition 3.3 and Theorem 5.1 imply that, modulo , validity and satisfiability are decidable for -deducible formulas. In particular, consider the rule ”rule:circum”, which has no special proviso. Any deducible formula can be minimised along (modulo ), and the result is not only first-order expressible, but also equivalent both to a formula in and to one in .
”rule:static” and ”rule:dynamic” both introduce literals, but ”rule:dynamic”, being about the postcondition (note the proviso ), must protect the elements mentioned in . ”rule:weak” says that the protected area can always be expanded. ”rule:bool” says that boolean combinations are allowed. ”rule:inv” says that, if the protected area is small enough, it can be ignored as long as constraints on the postcondition are extended to the precondition. While ”rule:bool” and ”rule:inv” may both produce new conjunctions, ”rule:inv” can remove an element from the protected set provided additional constraints are satisfied. ”rule:forall” and ”rule:exists” introduce quantifiers. The proviso for ”rule:exists” requires a proper guard () and increases the protection distance by . The proviso for ”rule:forall” allows a vacuous guard () in some cases, and does not always increase the protection distance. The asymmetry between ”rule:forall” and ”rule:exists” reflects the asymmetry in the notion of “protection”, cf. section 6.1.
and are deducible. For instance, is derived by applying ”rule:forall” to as and as (both introduced with ”rule:dynamic”).
6 Proof elements for Theorem 5.1
We focus on techniques with general applicability. Subsection 6.1 introduces preservation, the main semantic invariant which is implied by . Preservation captures a notion of constraint locality at the semantic level which then translates to syntactic expressivity properties. Subsections 6.2 and 6.3 detail how the operator is constructed as an instanciation of unified circumscription, a generalisation of existing circumscription schemes. Subsection 6.4 sketches how preservation implies first-order expressibility of circumscribed formulas modulo and why the resulting first-order formula lives in both and .
The intuition behind preservation is to find classes of formula that provide useful static information on their -minimal models. In particular, it implies that changes in minimal models are in a ball of bounded radius, which lets them be characterised by first-order formulas. Preservation also interacts well with formula composition.
Let be an FLB signature. For , let , and . Let be an FLB for . is the set of trees of . For , is the set of vertices of . For , is the tree such that .
A node is modified whenever at least one of the following is true:
– for some unary
– There is such that
– There is such that
In particular, an external link deletion (some with ) does not make a modified element. A tree is modified whenever at least one of its elements is modified. The set of modified elements in is denoted by . For any tree , the set of modified elements outside is .
For any nodes the link distance is the distance between and in the graph with nodes and edges .
For , the ball of radius around is:
If we protect a ball of radius around a set , we can clear the changes of a tree outside of that protected area and produce a new FLB . Intuitively, we:
Pick a tree far enough (at distance ) from a special set (), then
Clear any modification that relates to , and
Clear external edge deletions that relate to and unprotected, unmodified trees.
Let be a relation and a set, are the tuples of that mention at least one element of . are the tuples of that mention no element of .
With , , a tree of that does not intersect , we say that is a -sub of with cleared tree whenever, for all :
If is not specified, we say that is a -sub of . If is not specified, we say that is a sub of . Note that the resulting sub is not uniquely defined (elements of the domain may disappear).
We illustrate subs in figure 2. We assume no changes in . The pre- and postconditions are superimposed: there are changing links between with a solid link for an addition, and a dashed one for a deletion (no link is both in the pre- and postcondition). The effect of going from to a -sub of with cleared tree is illustrated by the red areas that indicate which link changes are cleared (i.e. are in but not in ). The striped area is . The link deletion from to is not cleared, because it touches a node in the tree of the kernel ; neither is the link deletion between and because is changing (both through a link addition with and an internal link deletion). However, the link addition between and is cleared (unconditionally), as well as the internal link deletion on since, even though is changing, it is also the cleared tree and thus unprotected. Finally, the link between and is cleared because is neither changing nor in .
The idea is that, for a class of formulas, satisfaction is preserved by taking subs. If a change (unary predicate change, or edge deletion, or edge addition) is present in but not present in , we say that it has been cleared. This new relation between structures induces a property on formulas we call preservation:
For , a set of variables, a formula is preserved under if for all FLBs , all , all -subs of , all , implies .
If then is preserved under .
We give a proof sketch for each rule:
”rule:dynamic”: Take as an example. In an FLB , the only clearing of changes that could invalidate would be the clearing of . If , , and so can never be the cleared tree.
”rule:static”: Constraints on the precondition, on equality or on static properties can not be invalidated by clearing changes, as that only modifies postconditions. Taking subs protects elements in the image of the interpretation of variables.
”rule:weak”: Taking a larger protected area (either by adding elements to or by increasing ) can only protect more trees.
”rule:bool”: We explain for . Consider a -sub of : if , then by hypothesis, so does .
”rule:circum”: Minimising a formula modulo leaves only models that have no strict -subs, so the claim becomes vacuously true.
”rule:inv”: Consider for example : as shown with ”rule:dynamic”, must be protected. More precisely, suppose is false in and is true. can be made false by clearing the wrong tree. Consider . Clearing changes on may no longer invalidate the formula. The rule ”rule:inv” extends this reasoning to first-order specifications that require a single protected element.
”rule:forall” and ”rule:exists”: The important aspect of quantification is that a variable becomes “hidden” from . If the interpretation of a variable had to be protected, the new context must ensure that the protection remains, even once the variable has become unreachable. functions as a guard: it links to and adds to the context.
There are two differences between ”rule:forall” and ”rule:exists” which make ”rule:forall” more relaxed.
First, the proviso in ”rule:exists” excludes formulas such as (with ). Taking the sub of an FLB may remove elements from the domain, so the existence of an element satisfying a static property is never guaranteed under subs. In ”rule:forall”, is possible as long as needs no protection (i.e. ), because a universal quantifier is not invalidated by domain reduction.
Second, the protection distance is systematically increased by in the case of ”rule:exists”, but not in the case of ”rule:forall”. For instance is preserved under , but is preserved under , not under : if a link between (the images of) and is created, clearing changes in ’s tree wil unconditionally clear that link creation, thus invalidating the formula. So we either need to protect directly, or we need to protect a ball of radius at least around . In the case of ”rule:forall”, the asymmetry in the definition modification is exploited: link deletions to protected trees may not be cleared, so it is not necessary to extend the protection radius. For instance, is preserved under .
Preservation becomes useful when one considers -minimal models of a preserved formula. First, we need to make the definition of explicit.
6.2 Unified Circumscription
Circumscription is an umbrella term for second-order characterisations of the minimal models of a first-order formula along an order. We combine general domain circumscription (GDC) [Doherty1998, McCarthy1980] and parallel predicate circumscription [McCarthy1986].
Any signature is partitioned into tuples of predicates and functions:
As in both GDC and parallel circumscription, some predicates are varying (). As in GDC, the domain is circumscribed and some predicates and functions are fixed on the restricted domain (). As in parallel circumscription, some predicates are circumscribed () and others are fixed on the initial domain ().666For simplicity, we omit varying and fixed functions from the definition (not necessary here).
Such a partition on induces an associated order: for , whenever
The -minimal models of a formula can be described by a second-order formula:
where is a unary predicate symbol and is similar to . specifies that behaves like a domain (closed by function application, nonempty), is with all quantifications relativised by (e.g. becomes ), and symbols in substituting symbols in . means that every component of every relation in is in , and for two similar relational tuples, is the componentwise inclusion.
With a first-order formula on , the models of are the -minimal models of .
The proof builds upon [Doherty1998]. Given a model of and , the internal structure of can be “plugged in” the tuple and verifies the left-hand side of the main implication in ; the right-hand side implies that cannot be strictly smaller than . For the other direction, with a -minimal model of , we construct models from any that verify the antecedent, and by minimality of show that they verify the consequent. It is easy to see that can also contain FO formulas that use fixed or varying predicates [Etherington1986].
6.3 Application of unified circumscription to transitions
Let be a transition signature. Let with purely functional and purely relational. Let be the tuple of formulas of the form for . Consider the circumscription order induced by the following mapping:
That is, the precondition of a transition is fixed (), static information on the remaining elements may not change (, ), the postcondition can change freely (), and both the domain and changes are minimised (). We check that the change ordering is actually an instanciation of unified circumscription:
For , iff .
6.4 Main theorem
If is preserved under then is first-order expressible.
The proof gradually removes second-order quantification from (cf. section 6.2). First, the restriction to FLBs (by ) removes the universal quantification on . Next, the domain is covered by P and (by Support), so the universal quantification on can be removed. Next we show that minimal models of preserved formulas have changes localised around the images of the variables in and within a radius . With this bound on the changes present in the minimal models of , the universal quantification on can be replaced with first-order quantification. This translation is global and not compositional as in e.g. the reduction of some modal logics to FO.
If is preserved under , in and , then is in and .
A refinement of lemma 6.4. The proof of this lemma exploits the locality of changes and the functionality of Link and to switch quantifiers as necessary: modulo functionality of , .
Theorem 5.1 (restated)
If then and .
7 Related work
Circumscription dates back to [McCarthy1980]. We use an instanciation of unified circumscription, a new flavor of circumscription which generalises [Doherty1998]. Previous works on taming circumscription require global syntactic properties of the formulas [Doherty1998, nonnengart1999elimination, Conradie2006] and only consider satisfiability or FO-expressivity of circumscribed formulas. [Doherty2004] uses circumscription for characterising weakest preconditions to reactions. The Floyd-Hoare tradition extends to e.g. separation logic [reynolds2002separation], with an emphasis on model checking, and can allow more than 2 states, which can be first-class or modal [Reiter1991, harel2001dynamic], with a focus on program traces.
There are biological knowledge bases with different degrees of formalism [uniprot, biopax]. Other modelling uses resource-aware logics [Despeyroux2016, Boniolo2010], or logic rules for specification and modality for queries [pathwayLogic]. Full expressivity comparison with existing logics of changes (Hoare-like, modal, etc) would require more space than currently available.
8 Conclusion and future work
We have introduced a framework for describing and reasoning with molecular biology knowledge. We follow the tradition of taking graph rewriting as a domain-specific language for biology [kappa, BNGL]. Biological entities are described at the level of proteins in the form of bounded trees containing encodings of domains, subdomains and residues. Links between the trees represent protein-protein interactions. Proteins and their parts have both static and dynamic properties. Formulas represent observations of changes as a pair of forests Precondition, Postcondition with shared underlying sets. The theory of forest transitions is . This theory does not have decidable satisfiability, but modulo , the fragment has.
As a knowledge representation tool, the logic describes changes in a compositional way, and a closed-world assumption on changes can be applied with a minimisation operator , defined using a variant of circumscription. A proof system produces formulas that can be queried, in the sense that validity and satisfiability are decidable, including minimised formulas, which a priori were only second-order expressible.
The proof uses a semantic property, preservation, to ensure that the change-minimal models of deducible formulas are first-order expressible. In addition, syntactic manipulation modulo shows that deducible formulas are in the fragments and .
Importantly, some formulas with unguarded existential quantifiers can be first-order circumscribed. As future work we plan to extend the definition of preservation to capture a larger class of formulas. In ongoing work, we continue the development of this framework. In particular, we wish to identify a logical fragment where automatic synthesis of graph rewriting rules from -minimised specifications becomes a possibility. The hope is to assist and partly automate biological modelling, from the description of observations at a high level of abstraction, to the execution of simulations and the validation of hypotheses. Future research also includes optimising the compilation to first-order and introducing reaction rates, i.e. transitions weights between the preconditions and postconditions.