Proofs of life: molecular-biology reasoning simulates cell behaviors from first principles

Science relies on external correctness: statistical analysis and reproducibility, with ready applicability but inherent false positives/negatives. Mathematics uses internal correctness: conclusions must be established by detailed reasoning, with high confidence and deep insights but not necessarily real-world significance. Here, we formalize the molecular-biology reasoning style; establish that it constitutes an executable first-principle theory of cell behaviors that admits predictive technologies, with a range of correctness guarantees; and show that we can fully account for the standard reference: Ptashne, A Genetic Switch. Everything works for principled reasons and is presented within an open-ended meta-theoretic framework that seemingly applies to any reductionist discipline. The framework is adapted from a century-long line of work on mathematical reasoning. The key step is to not admit reasoning based on an external notion of truth but work only with what can be justified from considered assumptions. For molecular biology, the induced theory involves the concurrent running/interference of molecule-coded elementary processes of physiology change over the genome. The life cycle of the single-celled monograph organism is predicted in molecular detail as the aggregate of the possible sequentializations of the coded-for processes. The difficult question of molecular coding, i.e., the specific means of gene regulation, is addressed via a detailed modeling methodology. We establish a complementary perspective on science, complete with a proven correctness notion, and use it to make progress on long-standing and critical open problems in biology.

READ FULL TEXT VIEW PDF
11/30/2017

Keep it Fair: Equivalences

For models of concurrent and distributed systems, it is important and al...
09/15/2020

Interfacing biology, category theory and mathematical statistics

Motivated by the concept of degeneracy in biology (Edelman, Gally 2001),...
04/21/2013

A novice looks at emotional cognition

Modeling emotional-cognition is in a nascent stage and therefore wide-op...
07/14/2022

Origin of life from a maker's perspective – focus on protocellular compartments in bottom-up synthetic biology

The origin of life is shrouded in mystery, with few surviving clues, obs...
07/18/2018

Formal Modeling of Robotic Cell Injection Systems in Higher-order Logic

Robotic cell injection is used for automatically delivering substances i...
03/06/2021

A Statistical Perspective on the Challenges in Molecular Microbial Biology

High throughput sequencing (HTS)-based technology enables identifying an...
02/26/2018

DropLasso: A robust variant of Lasso for single cell RNA-seq data

Single-cell RNA sequencing (scRNA-seq) is a fast growing approach to mea...

Introduction

“A satisfactory description or computation of [modular organization, complex ensemble behavior that might be called emergent behavior, and robustness in biological processes] is a critical challenge for the future of biology” [1]. Concretely, the cited report calls for “effective conceptual treatment,” “a theory of constructive engineering principles of life,” and “a theoretical basis for how biological entities generate aggregates of higher complexity.” Relying on standard logical meta-theory and adaptations of known computation theory, we show that molecular-biology reasoning answers the triple call. Our approach is adapted from a century-long line of work on mathematics, including axiomatic reasoning [2], constructivity [3], and the Curry-Howard Correspondence [4, 5]. The line of work, including our adaptation, is mathematics itself and is one of several that use computers [6]. We complement the case for applied mathematics in natural science [7]: also pure-mathematics methodology can be unreasonably effective.

Towards the end of the 19th century, mathematicians started getting concerned about the use of increasingly advanced proof methods and the appearance of increasingly complex proofs. After several decades of foundational efforts on several fronts, Whitehead and Russell [2] succeeded in showing that a substantial body of mathematics could be undertaken using a small set of reasoning rules that seemingly avoided inconsistency problems: rule clashes that allow everything to be inferred. The response was a change in attitude throughout pure mathematics so readers needed to be convinced that proofs could be spelled out in axiomatic detail rather than be persuaded in an intuitive or a political sense [8]. Accompanied by precise definitions that the reasoning could proceed from, the effect was to democratize authority, including for young researchers. While opaque to non-users, having step-by-step correctness available cuts down on the issues to consider at any given time and turns development into a game to be explored: “[the methods] give valuable results in regions, such as infinite number, which had formerly been regarded as inaccessible to human knowledge” [2]. To make the activity feasible, to scale to bigger proofs, to automate relevant parts, to not overlook details, and to ensure general high levels of confidence, formal proofs can be constructed and checked for rule compliance on a computer [9, 10]. The fully-axiomatic style of working affects what intermediate results get used and helps “uncover new and rather elegant nuggets of mathematics” via low-to-high level optimizations [11]. Modes of computation that guarantee logical correctness of its results are often made available, as proof by reflection [11, 10], including the innate Curry-Howard, or ‘reasoning-as-computation’ [4, 5].

Constructivity (and its canonical form: intuitionism) is possibly the most subversive idea in mathematics [3]: it avoids appeals to what ought to/might as well be true for the stricter requirement that detailed justification is proffered, specifically that any proof of a compound statement can be decomposed to proofs of constituent statements that can be recomposed to prove the original statement. For mathematics, it spurns the equivalent axioms of (excluded middle) and (double-negation elimination), for any formula, . In the former case, neither nor need be the case. In the latter, we would admit as justified anything additional () as long as this does not violate consistency (). Our interest in constructivity is purely practical.

This article consists of a main body of text, end matter, and separate supplementary text, with sections and appendices. The main body is written for the broadest possible scientific audience and is organized around our overall construction as applied to molecular biology. The key specifics for the monograph are presented in figures that come with extensive legends. The figure legends are listed between the main body and end matter, specifically after Discussion.

The article is accompanied by an instructional video: “Using the CEqEA tool: A synthetic Genetic Switch”. The video is available at:

http://ceqea.sourceforge.net/extras/instructionalPoL.mp4

All technology we present, including the small kernel to verify rule compliance of axiomatic molecular-biology reasoning, is available as free software: CEqEA, or the Cascaded-Equilibria Emergence Assistant [12]. The requisite mathematical background is presented in Section S1, including computer verification of the key meta-theory of the tool by means of the Coq Proof Assistant [13], see Section S1.3. The meta-theory 1) guarantees that our axiomatization of molecular-biology reasoning makes sense, 2) clarifies fundamental molecular and structural aspects of biology as revealed in the reasoning, and 3) turns reasoning into engineering principles. Community-developed reasoning principles get selected for clarity, i.e., we ultimately rely on the power of simplicity [14].

We apply our molecular-biology technology to the standard reference: Ptashne [15], referred to as ‘the monograph’, see Sections 1,S2. The reasoning in the monograph is not as detailed as what we present, see Section S2.3.2, but it is a close fit, see Appendices SA–SG. In particular, it is only possible to make a factual distinction between our reasoning and the monograph content in the case of statements that are superseded by additions to the latest edition [15], see Appendix SH. Irrespective of any particulars of our framework, the closeness means that the monograph is coherent to an extent that rivals any mathematical text [9]. Of the several hundred molecular-biology texts we have studied for this work, none appeared to match the monograph’s level of precision. Our verification of the monograph proceeded at the same rate as for similar mathematical texts: 1–2 days per page, excluding framework development but including time to deeply understand the text. It is not necessary to trust our tool’s verification kernel: the proof details are listed in Section S2.3.2 and can be checked by hand.

1 A Genetic Switch

Molecular biology is the bottom-up view on gene regulation:

“Our goal is to understand gene regulation in terms of the interaction of molecules.”[15]:xiv:23/App.SA

Genes are segments of DNA that are used to produce sequence-determined molecules that play functional roles inside organisms. The molecules may be nucleic acids or polypeptides, often in specific conformation, e.g., proteins [16].

“The workings of every organism have been determined by its evolutionary history, and the precise description we give of a process in one organism will probably not apply in detail to another. The answer is to be found in the context of the fundamental biological process called development. Briefly put, the issue is as follows: all cells of a given individual organism inherit the same set of blueprints in the form of DNA molecules. But as a higher organism develops from a fertilized egg a striking variety of different cell types emerges. Underlying the process of development is the selective use of genes, the phenomenon we call gene regulation. At various stages, depending in part on environmental signals, cells choose to use one or another set of genes, and thereby to proceed along one or another developmental pathway. What molecular mechanisms determine these choices? The lambda life cycle is a paradigm for this problem: the virus chooses one or another mode of growth depending upon extracellular signals, and we understand in considerable detail the molecular interactions that mediate these processes. We believe that analogous interactions are likely to underlie many developmental processes; by establishing a description for the particular case of lambda, we develop ideas that inform other studies even though no other case looks exactly like lambda.”

[15]:xiii:7/App.SA

The key DNA segment for the organism here looks as follows, see Section S2.1:

The illustration shows the beginning of two genes (cI, cro), located on either side of a central region. The two genes are on separate DNA strands, which means they are transcribed in opposite directions: away from the central region. Transcription is done by RNA polymerase (RNAP): an enzyme that has the ability to separate the two DNA strands, read off the particular base pairs on one of them, and assemble a strand of RNA that matches the base pairs in a particular way. DNA segments that allow RNAP to bind in the requisite manner are called promoters. The illustration includes two promoters in the central region (PRM, PR) that permit the discussed transcription. The cI gene may also be transcribed from a promoter that is located some distance away from the central region (PRE). The produced RNA may later be translated to polypeptides that may fold to proteins, in this case CI (aka repressor) and Cro. The central region also permits specific binding of the proteins in question, with the locations referred to as operators (OR1, OR2, OR3). These bindings happen with different affinities, meaning at different effective concentrations, which serves as a basic determinant of the regulation of the two genes. The bindings may take place either intrinsically or cooperatively between operators, which affects both their strengths and durations, with varying effects on other bindings. For example, some operator bindings may prevent other bindings but they may also serve to recruit RNAP to promoters that would not otherwise get occupied (dashed promoter lines). Some of the cooperative bindings involve operators that are separated by several thousand base pairs, which entails super-coiling of the DNA. We must be able to accurately account for all these specifics, see later. Presence of CI will be associated with the considered organism surviving by its DNA being embedded (lysogenized) in and passively replicated alongside a host organism (the lysogenic cycle). Presence of Cro is associated with the host’s DNA-replication machinery being commandeered to produce on the order of a hundred copies of the parasite DNA, followed by host destruction (lysis) and the release of viable parasite offspring (the lytic attack). Interactions with the environment mean that the lytic attack results if many non-infected neighbouring hosts exist at the time of initial infection. Otherwise, the lysogenic cycle is established and maintained in the face of host replication, super-infection, and more. If an infected host has its DNA damaged, e.g., by UV light, the parasite has the ability to switch its mode of growth from an ongoing lysogenic cycle to the lytic attack (induction). The combination of a (stable) lysogenic cycle and a (destructive) lytic attack with induction is called the temperate life cycle.

2 Overview

We have identified 481 factual statements in the monograph and classified them into eleven levels of abstraction, see Appendices SA–SH. Five levels with 180 statements make up the organism-specific reasoning, see Theorem 

2. Two levels with 23 factual statements address general aims and means of molecular-biology reasoning, i.e., its foundation. One level covers counterfactual reasoning about select mutants: we address mutations systematically, with a corresponding focus. Two levels concern, e.g., DNA conformation and consequences: these statements provide background, remain implicit to the reasoning, and are not discussed here. The final level contains 18 reasoning statements that are implied to be superseded by statements added to the latest monograph edition [15].

Theorem 0 (Overview)

Of the 180 reasoning statements in the monograph, 81 (‘population inferences’, ‘causation arguments’, ‘pathway arguments’, ‘cell behaviors’) follow progressively and axiomatically from the 99 (‘molecular basis’) within a logic that is consistent and compatible with the 23 foundation statements. The 18 superseded statements require assumptions contradicting the 99 and are contradicted by inferences not covered by the 81.

The work involves extracting theory from practice and developing meta-theory:

Theorem 0 (…)

Biological reasoning over a ‘molecular basis’ can be defined as open-system concurrent computation over the discrete manifestations of the molecule-coded regulation seen as ‘elementary processes of physiology change’. The mode of computation is induced from the reasoning rules’ meta-theory-determined structural proof theory and simulates reasoned-about cell behaviors: given execution will exhibit the considered phenotype property for the specific reasons from the genotype. Computation allows us to develop a phenotype for the monograph organism that incorporates the recent discoveries, is structured according to the coded-for dynamics, and revises but subsumes the monograph’s observation-based account. Our prediction of the possible sequentializations of the monograph concurrency reifies its life-cycle illustration.

3 Principles

In order to arrive at the axiomatization, we noted that the organism-specific reasoning in the monograph and throughout molecular biology covers five progressive levels of abstraction from molecules to biology, see Theorem 2. We:

  1. identified the levels of abstraction used,

  2. represented each level in a symbolic form, and

  3. connected the symbolic forms while ensuring smoothness by:

    1. fine-tuning the symbolic forms to be conceptually pure and distinct,

    2. abstracting each symbolic form to have a straightforward appearance,

    3. introducing intermediate symbolic forms as necessary, and

    4. parameterizing technical choices, e.g., as modalities, see Section S2.1.

Applying 13 is an exercise in taming concepts and combinatorics, which is a staple of the computer-science discipline of formal methods and of structured programming. Here, it was guided by the bottom-up nature of molecular biology, with existing staging already doing some constraining of technical issues. What we want to accomplish with 13 will have been increasingly accomplished with the maturation of a discipline. Pervasive abstraction, 3b, facilitated smoothness and, with 3d, was necessary to account for all details in the monograph, including interaction variations, cf. Guet et al. [17]. All perceived non-linearity was seen to be the result of nested simplicity, see Wolpert [18]. The requisite tautness of the permitted reasoning (see Theorem 8,Proposition 10) was helped by conceptual purity, 3a. Our final axiomatization proceeds across a dozen stages, 3c, with the load distributed evenly in small and surprisingly-standard technicalities.

(A)

+ !         |  $proteinC / cro   |  @PR_tr  ;                 //[15:13]

(B)

@PR_tr   =  RNAP.tr                                           //[14:6][6:-9a]
             |-- @OR1xor2_Cro                                 //[94:1][93:F4.24:1]
               : @OR1xor2_Cro +- !                            //[94:1][93:F4.24:1]
               : @OR1and2_Cro -- !                            //[94:1][93:F4.24:1]
               : @OLR12?_CI_rng  ! {|incdntl(CI)<mdtr,sink>}  //[111:F5.3:1][24:-8]
               ;

(C)

@OR1and2_Cro<@OR1_Cro,@OR2_Cro>  =  max&(@OR1_Cro,@OR2_Cro) ; //[93:F4.24:1]
@OR1xor2_Cro<> = min:(@OR1_Cro,@OR2_Cro) excl @OR1and2_Cro-- ;//[93:F4.24:1]

(D)

@OR1_Cro          =  @OR1_Cro_int     |--   @OLR12?_CI_rng ;  //intrinsic[17:6]
@OR2_Cro          =  @OR2_Cro_int     |--   @OLR12?_CI_rng ;  //intrinsic[17:6]

(E)

@OLR12?_CI_rng<>  =  @OLR123_CI--     : @OLR12_CI-- ;         //abbreviation

(F)

@OLR123_CI        =  (+) @OR3_CI_int  ;                       //[112:9]
@OLR12_CI         =  (+) @OR1_CI_int  excl  @OLR123_CI-- ;    //[86:F4.18][110:-4]

(G)

@OR1_CI_int<>     =  CI.l ;                                   //[86:F4.18:1]
@OR3_CI_int<>     =  CI.h ;                                   //[86:F4.18:1]

(H)

@OR1_Cro_int<>    =  Cro.h ;                                  //[25:F1.23:1]
@OR2_Cro_int<>    =  Cro.h ;                                  //[25:F1.23:1]

(I)

 $genomeI / cI    .  $proteinC / [ 0 {l} {m} {h} ] ;          //[12:-8]
 $genomeI / cro   .  $proteinC / [ 0 {l} {h} ] ;              //[12:-8]
Figure 1: MIG/RI-specification: regulation of gene cro in the [AGS3] genotype

For using the axiomatization, the starting point: how to formally represent the real world, requires considerable attention, as outlined in Figure 1. Section S2.1 provides the full details, following best practices in formal methods: we prefix a formal language (MIG, for Modal Influence Graph) designed for transparency with a modeling methodology (RI, for Regulation Interface, see Section S2.1.1) designed for safety. We refer to the combination as MIG/RI specification. RI is given with motifs for capturing common types of molecular interaction, amounts to incrementally writing down organisms’ molecular programming with no second-guessing allowed, and involves working over states that are physiologically determined: the nominal concentrations for which bindings may have an effect. Covariant, neutral, and contravariant effects are possible. Covariant effects derive from what is ostensibly specified, typically expression. Contravariant effects result from syntax annotations that our tool uses to determine, e.g., when decay explicitly outpaces production. Neutral effects may result both from lack of action and, e.g., from balanced decay and production. An example scenario: the monograph organism contains a gene (cI) that is transcribed from two promoters that require physiological presence in one of several configurations of specific recruiters of the transcribing molecules to have material effects of varying rates. The promoters may be inhibited in several ways with varying rates of effect-from-inhibitor-concentration change. Additionally, the gene product may be subject to proteolysis by a host protein. The requisite notion of compound contravariance is a combinatorial problem for MIG/RI specifications, cf. Ptashne [19], see Hay et al. [20]. Relatedly, certain combinations of interactions may preclude each other. These are identified as logical contradictions and similar incompatibilities. Coded-for observable concentrations are automatically inferred later, once everything that may affect the matter is known, see Proposition 10, cf. Guet et al. [17]. More generally, a key objective at the starting stage is to actively avoid category mistakes: we use categories to pinpoint details and to make sure components match their usages and are not conflated, cf. Ptashne [21].

Definition 1 (Instance-sorted states, homogeneity)

Consider some notion of states sorted into schema instances: different concentrations of a molecule population will, e.g., be states of the same sort. A set of states is homogeneous if it does not contain two states of the same sort.

Definition 2 (Basics)

  • ranges over formal genotypes, see Figure 1 and Section S2.1.

  • and range over sets of states, with used for compartments.

  • , range over causations: Reactants, Mediators, Products, Inhibitors, see Figure 3.

  • means ‘ codes for ’, see Figure 3 and Section S2.2.

  • Coinhibition is causation-on-causation ‘inhibition’, for when one of two opposing causations is stronger — coinhibition-free() is a predicate for its absence.

  • A genotype, , is regular if implies that and contain one state each of the same instance sort, meaning if all causations are concentration changes, or similar.

Definition 3 (Positively-validated causations)
Definition 4 (Coextension logic)

Compartment changes, , are justified, , from genotypes, , by:

[(interference)       if ] g^co A(A _i R_i)_i P_i
[(sequence)]g^co A_aA_cg^co A_aA_b g^co A_bA_c
Figure 2: Coextension logic (default modalities)

(A)

(B)

“In a lysogen, repressor bound at OR1 and OR2 keeps cro off while it stimulates transcription of its own gene cI”   [15]:22:1/App.SE

“More Cro is made until it reaches a level at which OR1 and OR2 are also filled and polymerase is prevented from binding to PR   [15]:25:7/App.SE

(C)

(D)

(E)

(1)
(2)
(3)
(5)
(6)
(7)
(8)
Figure 3: Causations vs ‘causation arguments’ for protein Cro; nested inhibition

The specifics of the last two stages —which is what a reasoner works with— are due to us: causations and compartment changes, see Figure 2. Causations consolidate the information in ‘causation arguments’, see Figure 3, i.e., resulted from 3a,3b, and include consideration of the difference between their construction (as the discrete manifestations of regulation, typically concentration changes) and their use (as elementary processes of physiology change, typically changes to the bindings that the gene product may be involved in). Causations have formative relevance to later technologies, see Propositions 7,10, and seemingly to biology itself, see Discussion. Compartment changes sit between ‘pathway arguments’ and ‘cell behaviors’, see Figure 4, i.e., resulted from 3a,3c.

(A)

(B)

{  :  ; :  ; :  ;
:  ; :   }

(C)

(D)

g⊢^co {CI.l,Cro.0,DNA.ss,RecA.*}↪{CI.0,Cro.l,DNA.ss,RecA.*} g⊢^co {CI.l,Cro.0,.ss,.*}↪{CI.0,Cro.0,.ss,.*} [[[SOS] ;  [],[PR_tr],[PRM_tr]]] g⊢^co {CI.0,Cro.0,.ss,.*}↪{CI.0,Cro.l,.ss,.*} [[[PR_tr],[SOS] ;  []]]

(E)

1[CI.l, Cro.0, RecA.*] + [DNA.ss]
2-> [SOS]
3x> ![] |-- [DNA.ss] ;              [PR_tr] |-- [CI.l!@[OLR12,OLR123]] ; [OLR12_CI, PRM_tr] |-- [SOS]
5[Cro.0, RecA.*] + [CI.0, DNA.ss]
6-> [PR_tr] ; [SOS]
7x> ![] |-- [DNA.ss]
9[Cro.l, RecA.*] + [CI.0, DNA.ss]
Linearized coextension derivation Section S2.3.2:2/c/
“Two changes result [from the SOS response’s RecA*-mediated cleavage of CI]. First, as repressor vacates O1 and O2 the rate of repressor synthesis drops (because repressor is required to turn on transcription of its own gene); and second, polymerase binds to P to begin transcription of cro.” ‘Pathway arg.’  [15]:24:-7/App.SF
Figure 4: Coextension derivations vs ‘pathway arguments’

For precise definitions, well-formedness is always a primary concern. In case of a reasoning axiomatization, we need to consider, e.g., inadvertent admittance of paradoxes. The problem is that simple rules do not necessarily result in simple behaviors when put together, or even in isolation. Consistency (Theorem 1) guarantees that a given system of rules makes distinctions: some form of sense is being made. Paradoxes would make everything provable, see Section S1.1.4.

Theorem 1 (Consistency)

Coextension logic (Figure 2) cannot justify arbitrary compartment changes.

To address the sense being made, we note that molecular-biology reasoning is scientifically reductionist:

“Biologists work on systems that have evolved. This gives us hope that any given case can be understood reductively. Nature built the system in steps, each step making an improvement on the previous version and so, this line of thought goes, the investigator can take it apart, study it in bits, and, perhaps, see how it all fits together.” [15]:xi:1/App.SB

The corresponding meta-theoretic property for axiomatizations is constructivity: any proof of a compound statement can be decomposed to proofs of constituent statements that can be recomposed to prove the original statement. A key insight here is that constructivity affirms Ptashne’s ‘perhaps’: the justification for a biological property is always meticulously and exclusively stated in terms of molecular interactions. Decomposition cannot be guaranteed, e.g., with state-space modeling and truth-based reasoning, meaning properties there may have spurious justifications, see Discussion. The next result needs unpacking:

Theorem 2 (Coding)

Assume states, , in a genotype, , are testable: holds. If can be justified in coextension logic then the propositional implication is provable in the canonical constructive logic: intuitionistic logic, where encodes coextension syntax as (varying) mathematical formulas. In particular, genotypes code for compartment changes in the computable sense, upto timing, stochasticity, and divergence.

The states in a MIG/RI-specification represent molecular ground notions, typically concentrations. Testability is an extra-logical open-system assumption prescribing that populations are either considered at a different concentration than the one at hand () or no conflict arises if we insist on the one (): it is already there or the population was not being considered. Constructively, not-not does not cancel out, meaning and are not identical, see Discussion.

The stronger decidability assumption for states: holds, is invoked with state spaces and truth-based reasoning. Subverting testability, decidability is an extra-logical closed-system assumption, see Sections S1.1.6,S1.1.9, cf.

“We wish to understand which steps are controlled by internal cellular programs and which by extracellular signals.” [15]:3:1/App.SB

While the result establishes that molecular-biology reasoning is a special case of mathematical reasoning, we need more to avoid false-positive verification.

The Curry-Howard Correspondence (CHC) establishes that an intuitionistic proof is a computable function that does what is proved [4, 5, 22]. CHC is the mathematics version of the reasoning-computation correspondence for molecular biology we develop here. Theorem 2 shows that there exists a computable function that takes as arguments a genotype, , two compartments, ,, and evidence that the three are coextension related: . With the

-coded causations as building blocks, it then mimics the compartment change. The function could be presented as a terminating Turing Machine

[5], cf. Brenner [23]. Call it MolBio. It remains terminating but becomes organism-specific when given a : MolBio, etc. To the extent of coextension logic’s expressivity, this means that genotypes translate to computable behavior when we have a way of taking an and producing a derived , see Proposition 7. 1) The result constrains the sense that coextension logic makes to something reasonable: provided choices are fixed at the start, a genotype, , can accomplish only what MolBio can, based exclusively on information in , hence ‘coding’. 2) The result does not address our ultimate concern: the means of computation, including anything pertaining to choices. The mathematics means are a form of closed-system sequential computation, as seen with CHC. The molecular-biology means are a form of open-system concurrent computation, as we show. ‘Open-system’ refers to the fact that molecular-biology reasoning explicitly considers choices that are not made up front but may come at any time, including from the environment. Specifically, ‘timing’ concerns the relative durations of causations, ‘stochasticity’ is the possibility that causations may proceed from concentrations other than the nominal physiological ones, and ‘divergence’ refers to conflicting causations. We discuss the issues in the relevant technical context following Proposition 7.

The constructivity in Theorem 2 is indirect: its de- and recomposition of proofs is relative to mathematical formulas. We also have direct constructivity.

Proposition 3 (Constructivity)

Any compartment change from a regular genotype, , can be justified by interfering -coded causations (in an arbitrary order).

4 Engineering: inner reasoning structure

Logical meta-theory is not mathematics-specific and applies here, too. With molecular biology being concrete, several issues assume special interest.

Theorem 4 (Consequence)

For regular genotypes, coextension logic is a logic: anything assumed follows as a consequence and if all of one set of assumptions are consequences of another then all consequences of the first are also consequences of the second. In formal shorthand, with implicit outer ‘forall’s ():

In principle, Theorem 4 is a basic sanity check akin to Theorem 1, but its justification includes its proof and corollaries. Given a reasoning axiomatization, ‘structural proof theory’ refers to the inner structure of the axiomatized reasoning style: its geometry. Our core point is that inner reasoning structure constitutes ‘engineering principles’ for the subject matter in case of (direct) constructivity, see National Research Council (US) [1]. A key methodological point is that the properties in Theorem 4 help establish structural proof theory. For example, the property in Corollary 5 follows from Theorem 4 but is often difficult to establish by itself, partly because it reveals that reasoning in a logic allows for manipulation of assumptions even if the specific rules do not, as here.

Corollary 5 (Cut)

For regular genotypes, the cut rule is admissible for coextension logic: adding it to the defining rules would not allow us to justify more compartment changes. In formal shorthand, with ␣ meaning ‘the (non-depleted) genotype that also codes for ’:

(Cut is not used for any reasoning here nor is it added to Figure 2. The formal result is in a more restricted form, see Section S1.3.2.)

Concretely, the cut rule addresses when the manifestations of genes may take the place of the genes themselves, e.g., medicinal intervention () to ensure unchanged cytoplasm operation () in case of a depleted genotype (). This is not a simple issue. More work on cut remains.

The properties in Theorem 4 are an example of induction loading. When a result (Corollary 5) is difficult to prove, a stronger result (Theorem 4) can be easier owing to stronger induction hypotheses: an analytic insight into the abstract nature of the problem is being expressed synthetically in the stronger result. Theorem 4 captures that the key issue in structural proof theory, i.e., in understanding how proofs hang together, is the interplay between assumptions (here: genes) and their usage (here: expression), with the proof having to account for how the interplay propagates across the axiomatization (here: compartment dynamics). To address the issue, we defined (and proved equivalent, see Propositions 3,6) a version of coextension logic where interference is between either nothing, one causation, or two interferences, i.e., where interferences are arbitrarily ordered. Ordered interference enjoys smoother proofs of the two properties, i.e., ordering serves to separate concerns. The other concerns amount to order manipulation.

Proposition 6 (Normalization)

Causation interferences for a compartment change that have been justified in a specific order can be justified concurrently.

Without Propositions 3,6, distributed/ordered vs. concurrent/unordered interference becomes problematic: which is right? We use unordered interference because it is the more direct, has the least administrative overhead, see Figures 2,4, and allows us to reason over the graphs of causations: cascaded causation diagrams, see Figure 5. Graphs are good for visualization, automation, and analysis.

Figure 5: [AGS3] cascaded causation diagram
Proposition 7 (Automated reasoning)

Letting a cascaded causation diagram for a genotype, , act successively on compartment to change it to builds a coextension derivation of , see Figure 4 and video.

A cascaded causation diagram acts on a compartment by considering the nodes whose content is in the compartment; considering their out-edges; excluding edges that are (co-)inhibited; and updating the compartment with the states from the targeted nodes, provided the result is homogeneous, see Section S1.4.2. For reasons we discuss along with Proposition 10, we call the action CCP (for Calculus of Coextensive Processes). To be exact, it is CCPall, where ‘all’ is a strategy for choosing out-edges. The ‘all’ strategy is appropriate when timing is trivial: all causations proceed at the same timescale, and stochastic behaviors are ignored: operators only function in line with the concentrations that nominally lead to their differential occupancy. Our CCPall implementation is interactive in part to allow timing and stochastic issues to be modeled manually, as needed, see video. We say we have encountered a divergence when a targeted compartment is not homogeneous, i.e., if two causations attempt to send some population towards different concentrations or a causation attempts to undermine a (non-auto-)mediating population. Coextension logic is designed to allow for reasoning about all that may happen while cascaded causation diagrams and CCPall provide a convenient platform for attempting to predict what will happen.

Apropos of Theorem 2, we note that the monograph organism exhibits only CCPall-divergences involving mediation and only in a few edge cases that are easily resolved. As seen when we do rapid phenotyping below, mutants of the monograph organism exhibit divergences proper as a result of overlapping cooperative bindings that are persistent and mutually exclusive. We would imagine programmatic/divergence-free behavior (pathways, to a first approximation) is pervasive within evolved organisms, cf. [15]:3:1/App.SB and Brenner [23].

5 Emergence: partitioning the reasoning space

Technically, CCPall halts when encountering a divergence, an absence of applicable causations, or if directed to do so by MIG/RI-specified expiration (here: to capture host lysis). We can also manually interrupt CCPall, e.g., to break loops, and initiate or alter progress to address specific points of interest, e.g., fresh/super infection, (loop) perturbation, and environmental signals. Progress alteration may come at the price of a new coextension derivation, see Proposition 7.

Host is healthy or not:

stands for ‘absent a host SOS response’, its negation ‘during a host SOS response’;

Host is doomed or not:

stands for ‘with CII below high concentration and with CI (repressor) below physiological concentration’, its negation ‘with CII at high concentration or with CI at physiological concentration’.

  1. lysogeny’s means take precedence over a lytic attack

  2. CII determines ’s pathway at fresh infection : lysogeny at high vs lytic below

  3. the lysogenic cycle is homeostatic, i.e., is sustainable and absorbs perturbations

    1. protein concentrations are CI controlled in the lysogenic cycle

      1. an initial CI concentration is and must be established by high CII

      2. once established, CI remains physiological, Cro non-physiological

    2. a switch to the lytic attack is not effected by

      1. natural or operative means — the basis of sustainability

      2. Cro-perturbation — the key means of homeostasis

      3. super-infection -immunity

    3. a switch to the lytic attack is effected by UV-irradiation (via )

  4. the lytic attack expires the host and, else, is sustainable but not homeostatic

    1. protein concentrations are programmatic during a lytic attack

      1. the lytic attack is constitutive or

      2. once initiated, Cro remains physiological and CI inoperative or

      3. a lytic attack expires the host after Cro has reached high or

    2. ignoring host expiration, a switch to the lysogenic cycle is not effected by

      1. natural or operative means or — the basis of sustainability

      2. CI-perturbation , although the lytic attack may be set back

      3. super-infection or — an aspect of anti-immunity, see Retrodiction (V)

    3. ignoring host expiration, a switch to lysogeny is likely to be effected by

      1. CI-perturbation (and for the lytic attack to be viable, see 3/d/)

    4. a lytic attack is unlikely and

Figure 6: [AGS3] ab-intra phenotype (computer-verified)
Theorem 8 (Ab-intra phenotyping)

Figure 6 is an ab-intra phenotype for the monograph organism: the properties reflect the structure of the reasoning space, see video. In particular, the reasoning behind Figure 6 1) proceeds from the monograph’s (regular) genotype, 2) is generated automatically by Figure 5 acting under CCPall on user-given start compartments: the s, with termination conditions that generate other s, e.g., a few steps or until expiration, looping, or divergence, 3) has been computer-verified to comply with Figure 2, 4) contains the monograph’s extant reasoning, and 5) covers the monograph’s extant phenotype inferences. The monograph’s superseded parts are incompatible with Figure 6 and its justifications.

The result does not render molecular-biology reasoning trivial. Instead, it shifts the burden to the choice of start points and termination conditions. We make the choices based on what is automatable, with special consideration of real-life events, symmetries, and completeness. The term ‘ab intra’ is due to us but refers to a known phenomenon: axiomatic reasoning is a natural Occam’s razor with the effect of streamlining the choice and statement of properties as dictated by the interaction of lower-level issues [2, 24, 11]. For example, an early milestone in computer-verified reasoning showed that the property progression in the standard reference for axiomatic mathematical reasoning: Whitehead and Russell [2], allows for a similar kind of automation as here [25]. The wider challenge there at this time is to integrate the in-the-small issue of automation with knowledge-management technologies and user-driven proving strategies [9].

The claim in Theorem 8,1) is the subject of Section S2.1: MIG/RI specification of the monograph, see Appendix SC. The scenarios in 2),3) describe use of our tool, see Section S2.3 and video. The claims in 4),5) are documented in Appendices SD–SG. The last claim is documented in Appendix SH: Retrodiction (I) below contradicts the superseded inferences while Retrodiction (III) contradicts the superseded assumptions.

6 Monograph Retrodiction

We refer to our formal treatment of the monograph organism as [AGS3].

(I)

[AGS3] does not require Cro action at OR3 against CI, i.e., it goes further in the direction that the monograph took the -narrative in its latest edition:

“that Cro must bind OR3 to trigger the transition to lytic growth, although not excluded, remains uncertain.”[15]:121:-4/App.SH

The quoted mechanism was inferred indirectly, based on OR3 conservation prior to the recent discovery that an adjacent CI-octamer binding mediates long-distance CI-tetramerization across O{L,R}3 [26, 27, 28]. It is contradicted by Figure 6:0/, which is a consequence of the discovered elaborately-cooperative CI-binding, see Section S2.3.2:0/. See also Section S2.5.1.

(II)

[AGS3]’s two modes of growth have different stability properties: lysogeny is homeostatic proper; the lytic attack is only so in parts, see Retrodiction (III), and else is not subject to perturbation, see Retrodiction (IV).

(III)

[AGS3]’s lytic attack reaffirms that Cro action at OR3 is CI-contravariant only late in the attack. The late action increases offspring production:

“If repressor were added to a phage beginning its lytic cycle, growth would be inhibited.”[15]:62:-1/App.SG

The relevant Cro-affinity is strong but involves non-cooperative binding with a linear effect function: the effect on CI is first neutral then negative.

(IV)

[AGS3]’s switch is made efficient partly by not involving Cro action at OR3: Figure 6:0/ implies that proteolysis will have left free CI at zero/sub-physiological concentration once CI’s elaborately-cooperative promoter control ceases prior to a switch.

(V)

[AGS3]-variants may exhibit anti-immunity without inhibition of cII, cf.:

“The anti-immune phenotype is evidently a consequence of partial repression by Cro of PL and PR and hence diminished expression of cII and cIII.”[15]:92:-7

A variant is anti-immune if it can prevent lysogeny by a super-infecting wild type in a contrived lysis-free ‘lytic cycle’. Weakened Cro binding at OR{1,2} suffice, see Section S2.5.7: higher Cro concentration will be maintained, resulting in continuous CI-contravariance, see Retrodiction (III).

7 Simulation

Figure 7: Molecular-biology reasoning as computing

Summarising the applied side up to this point, we note that we may ignore the reasoning aspect and view cascaded causation diagrams simply as executable code under CCPall, see Proposition 7 and Figure 7.

Proposition 9 (Open-system concurrent simulation)

Figure 5 simulates the monograph organism under CCPall: its execution exhibits the considered phenotype properties for the specific reasons from the genotype, including in response to user-imposed differential timing (pursuit of only some enabled causations), environmental signals (external stochasticity: pursuit of edges guarded by non-considered populations), and perturbation (internal stochasticity: pursuit of edges guarded by concentrations other than considered ones). Divergences can be resolved by the user or our tool can do it at random, along multiple strategies.

To be clear, we are dealing with three distinct perspectives: 1) real-life molecules and the biology they sustain, 2) the textual molecular biology in the monograph that reasons about the connection, and 3) our framework that sits in between the other two perspectives. The text reflects the defining features of the considered real life. Our formalism reifies the text in its entirety. The final question is whether all parts of our formal model is related to the considered real life. The fact that Section 6 seems to resolve uncertainties rather than add surprises suggests an affirmative answer. Methodologically, there are three possible sources for any differences: A) the formalism involves principles that do not match real life, B) the formally-specified organism is a poor match for the idealised real-life organism, and C) the idealised organism is a poor match for the real-life organisms, either due to heterogenous populations or poor understanding. Figure 1 and Section S2.1 represent our best efforts at addressing B) while C) is partly outside of scope and partly tied to A): can our execution complement molecular-biology groundwork, i.e., to what extent is CEq a first-principle framework for real life? Partly to address A) beyond Section 6, we now present derived technologies that involve consideration first of all computation and then of segmented computation that need not reflect the reasoning structure.

Sequentialization

CCPall execution is best thought of as the interaction of edges as standalone entities rather than the action of a graph on states. ‘Interacting edges’ is a process calculus perspective, a class of models of open-system concurrent computation [29]. Process calculi address not just interleaving or parallel execution of sequential computation but concurrency as a principal notion. Execution in process calculi, including CCPall, need not have designated beginning or end: we are concerned mostly with them just running, meaning they address computation but not computability per se. Their result notion is external [30], as we discuss. Our tool supports the edges-as-processes view by showing implied edges when a node is not used directly but its constituent parts are used separately. The process view provides insights into the workings of an organism and allows us to explore why pathways do or do not exist, see Section S2.5.1.

Figure 8: [AGS3] synchrotype prediction
Proposition 10 (Open-system synchrotyping)

Our prediction of the possible sequentializations of Figure 5 seen as concurrent CCPall code: the [AGS3]-synchrotype, reifies the monograph’s life-cycle illustration, with molecular pathway deciders added, including outside factors, see Figure 8. Synchrotypes operate over observable states, found by factoring out ‘stochastic fluctuations’ [17], etc.

We introduce the name ‘synchrotype’ for the predicted kind of information, to complement ‘genotype’ and ‘phenotype’. The core of our prediction technology is an adaptation of trace monoids as used for process calculi. Conceptually, trace monoids bundle together synchronized processes, leaving only synchrony-breaking changes, i.e., any sequential progress. Technically, trace monoids are to be defined over all linearized coextension derivations (‘CCPall traces’), see Figure 4, by factoring out how process/causation occurrences may be reordered across interferences, see Section S1.5. The main difficulty here is that causations do not come with explicit handshaking constructs that control synchronization/reordering. For that, we need to integrate inhibition into cascading, etc. For example, the lytic-attack processes (cro-transcription) can only come after those that maintain lysogeny (cI-transcription from PRM) because CI at OR{1,2} effects synchronization by simultaneous inhibition respectively auto-mediation. Our prediction technology applies to cascaded causation diagrams, with matching motivation. In principle, we collapse looping edges and take intermediates of the affinity-transposing states in the connected nodes as ‘observable states’:

“As [Cro] binds it turns off [PR], but as the cells grow and divide, the concentration of [Cro] drops, and [PR] turns on again. A steady state is reached at which the rate of synthesis of [Cro] just balances its rate of dilution and, presumably, a constant concentration of [Cro] is maintained. In this situation [Cro] diminishes (turns down), but does not abolish, its own synthesis.”[15]:92:-3/App.SB

The loops/‘observable states’ are tested for coexistability under inhibition in order to scale up to compartment-wide observations and the considered cascaded causation diagram is correspondingly convolved as the final step. The difficulty is that ‘loop’ is not a simple notion for edges with negative conditions under coexistability. Instead, we introduce sustained equilibria in place of plain loops.

Definition 11

Given a classification of states as A(lways)/T(ransitory)/N(ever) sustainers, we first find all strongly-connected components over the graph without A-inhibited edges. We then classify any found component as a sustained equilibrium of the given ATN type if it has no out-edges in the graph without A- and/or T-inhibited edges. By default, we consider only direct inhibitors for sustaining: indirect inhibition tends to express precedence, i.e., other action dominates.

The default definition of coexistability, below, is too strict. It, e.g., does not allow for lock-stepped interleaving, but it mostly suffices for the monograph.

Definition 12

A collection of sustained equilibria are orthogonal if no schema instance is in different states in different equilibria and if no equilibrium loses strong-connectivity under inhibition by the content of the other equilibria.

We shall not pursue the issue of a perfect definition here, other than mention that our tool also comes with a notion that is too lax, called reconcilable equilibria: each node in an equilibrium must be able to pair up with a node from each of the other equilibria without having some schema instance be in two states.

The trace-monoid perspective on our synchrotype prediction is that processes are nominated as independent of each other, i.e., reorderable, first within sustained equilibria, Definition 11, and then across coexistable equilibria, Definition 12. If a given ATN-sustained equilibrium cannot coexist with another, it is possible a subsumed one can, where sustainers are reclassified from N to T to A.

Rapid phenotyping

In addition to reasoning about the wild type’s temperate phenotype, the monograph includes counterfactual reasoning about mutants exhibiting clear, virulent, and anti-immune phenotypes, with the latter needing a host that cannot lyse. The clear phenotype differs from temperate by pursuing a lytic attack on all fresh infection, see Figure 6:1/. The virulent phenotype additionally overrides the wild-type’s ability to withstand super-infection, see Figure 6:2/b/iii/. The anti-immune phenotype differs from temperate by being able to prevent lysogeny from a super-infecting wild type, see Figure 6:3/c/ii/. Inspired by the automation in the wild-type’s ab-intra phenotype, see Figure 6, a few concrete tests can be used to rapidly phenotype variant organisms, see Section S2.5.3: what is the outcome from fresh infection, can Cro’s concentration increase in the presence of CI, what is the outcome from having CI at its highest and Cro at zero concentration, what is the outcome from there of raising the observed Cro concentration respectively introducing short, single-stranded DNA, and what is the outcome from having Cro at its highest and CI at zero concentration (with expiration turned off in the tool options). Section S2.5.2 presents a generic MIG-specification that allows us to vary the intrinsic affinities of the proteins for the operators, for 1,728 variants. The interactive analysis of each variant takes less than a minute. The initial challenge is to account for alternate-pairwise CI cooperativity, a form of meta-regulation:

“Because repressor dimers at OR1 and OR2 interact, or repressor dimers at OR2 and OR3 interact, we say the cooperativity is ‘alternate pairwise’.”[15]:21:-2

1) Our MIG language is sufficiently abstract that it can accommodate within a single specification all the regulation changes that result from varying the intrinsic affinities, i.e., MIG covers also the considered meta-regulation, see Section S2.5.2. 2) Our RI modeling is sufficiently robust that the generic specification needs only local changes from the wild type, see Figure 1. 3) A caveat: the monograph does not discuss all changes to the highly-cooperative molecule interactions that may result from varying the intrinsic affinities and our analysis is based in part on surmises. An example concerns favored CI binding to OR3 over OR1, where the wild-type’s favored cooperativity direction becomes moot. We surmise that the resulting long-distance octamer does not allow for a subsequent long-distance tetramer on the free O{L,R} operators, i.e., the discovery that prompted the latest monograph edition [15]. We imagine that the operators will not be sitting across from each other unless a strand of the super-coiled DNA is in the opposite orientation of the wild-type case, see Retrodiction (I). 4) We have seemingly identified a new anti-immune variant, see Retrodiction (V) and Section S2.5.7.

Figure 9: Phenotype phase space for [AGS3]-mutants: varying CI affinities

5) Figure 9 shows the phenotype phase space from varying CI’s intrinsic operator affinities, with temperate, clear, virulent, and hybrids, see Section S2.5.4. The temperate-clear hybrid codes for a divergence: alternate-pairwise cooperativity may go either way, implying that the phenotype is chosen stochastically. However, the choice gets made repeatedly and the variant will probably manifest as clear. Higher than wild-type CI affinity for OR3 (bottom of Figure 9) results in the clear phenotype: CI works against CI-maintained lysogeny. Otherwise, the exhibited phenotype will tend from temperate through clear to virulent as it becomes increasingly difficult for CI to inhibit cro-transcription (front-left to rear-right in Figure 9). To the extent of the available mutation information, Figures 6,9 are similarly compatible with the monograph, see Section S2.5.5. An alternative explanation for the prevalence of clear over virulent variants (other than involving fewer mutations [15]:68:13) is that mutations that increase CI’s OR3 affinity trump other mutations affecting CI’s affinities in terms of the exhibited phenotype.

Discussion

“Among the sciences, mathematics is distinguished by its precise language and clear rules of argumentation” [9]. We bring molecular biology into the fold, with several derived benefits. Fundamentally, we have verified that the standard molecular-biology reference [15] is internally correct: its reasoning complies with a consistent logic, i.e., it makes sense. Rather than relying on existing technology for showing internal correctness, we recapitulate the modern pillars of mathematical reasoning for the distinct case of molecular biology (reasoning):

Principia Mathematica [2]

axiomatized mathematical practice and gave rise to the standard for rigor of the 20th century. It is #23 in “The Modern Library’s Top 100 Nonfiction Books of the Century,” seehttps://www.nytimes.com/library/books/042999best-nonfiction-list.html.

The Curry-Howard Correspondence [4, 5]

says that mathematical practice is closed-system sequential computation and vice versa. We establish that molecular-biology practice is open-system concurrent computation. Mathematics-type computation can mimic the molecular-biology type provided termination is guaranteed, except in the presence of choices/stochasticity.

“The new standard for rigor” [9]

looks set to become computer-assisted reasoning with verification and development technology, using axiomatics and proof by reflection (e.g., the innate form of Curry-Howard, as here).

Constructivity [3]

guarantees that molecular-biology reasoning is molecule-based simulation of the biology. The details will depend on the involved concepts: constructivity is relative to a language of properties, meaning we have general alignment of concerns when there is language matching at all levels. Reductionist reasoning is (all-level) constructive by definition. Closed-system assumptions, use of state spaces, and most extrapolation from data violate constructivity for fundamental reasons, see Section S1.6.4.

The outcome here is “a theory of constructive engineering principles of life [over molecular interactions]” [1], i.e., first principles from reductionism. This should be understood in contrast to Anderson [31]: while reductionism per se is not constructionism, reductionist reasoning is. The induced theory consists of:

Molecular programming:

MIG/RI-specifications capture how molecules may interact: RI stands for Regulation Interface; states nominally transpose affinities, i.e., they capture when bindings may have material effects; regulation may be arbitrarily nested; regulatory units may be assigned a name for subsequent tracking in the reasoning; and modalities may be used to account for variations in the manifestation of the programming, e.g., whether a mediator needs to remain in place to see a change through.

Causations

are constructed as the independently-operating and discretely-regulated manifestations of the molecular programming, accomplished as conversion of propositional versions of regulatory expressions to disjunctive normal form. Causations are elementary processes of physiology change.

Concurrency:

Everything biological is inferred by assessing how causations may interfere as concurrent processes. The interference works as a process calculus, a type of computation that need not have designated beginning or end and where no particular form of input or output is involved: the computation just runs, as it reacts to any and all environment changes. All runs are guaranteed to be coded for by justifying molecular interactions.

Sequential results:

An external result notion exists that aggregates how the processes can interfere, by factoring out how they may be reordered across interferences. The notion captures the possible sequential forms the concurrency may take: the coded-for life cycles/the organism synchrotype.

Open-system modeling

involves weaker logical assumptions than a closed-system approach, meaning an open-system framework will be sound for more real-world scenarios. Here, closedness admits false-positive argumentation for properties that hold for different reasons, see Section S1.6.4. Aside from avoiding state spaces and any appeal to truth within our setup, open-system modeling dictates that we do not enforce inhibition when predicting synchrotypes but retain the possibility as a pathway decider, see Figure 8.

While the induced mode of computation will not account for all of molecular biology, it does involve abstract concepts that go beyond the monograph [32] and seemingly are needed throughout the discipline [19]: we establish the basics.

Validating prediction:

The key concept in our reasoning and computation formalisms is causations. As detailed above, causations are constructed as exactly the independently-operating processes of physiology change, and we predict that they can be identified in biology, too. Based on independence, we may detect them combinatorially. One candidate scenario is digitation. The hypothesis would be that each causation resulting from the regulation of a growth factor is responsible for the growth of one digit. Figure 3:(E) shows that the open issue of Fibonacci-many digits might be explained by nested inhibition but the prediction is more general.

Hybrid modeling:

Owing to coding, i.e., the absence of incorrect argumentation relative to computability within CEq, it is possible to instrument a genotype with affinity values and push them through to kinetics values on the edges in cascaded causation diagrams alongside the regulatory-expression names that track the sites of action, cf. Section S1.6.4. In what seemingly is not a surprise [33], numerical and symbolic methods that rely on and respect the same reductionist backbone can be expected to combine:

ab initio:

first-principle numerical calculation over physical properties of base entities.

ab intra:

first-principle symbolic computation over logical relationships between base entities.

Complexity:

Our translation of the regulation of a gene into the coded-for causations may need to consider as many as double-exponential-many potential causations in the size of the start expression. If we abandon the requirement that causations operate independently of each other, i.e., if we admit false-positive argumentation/site-action, some exponential-many actual causations collapse to linear-many, see Figure 3:(E). The complexity for a genotype is the summation over the individual genes.

The notation called Gene Regulation Networks (GRNs) is typically used with state-space analysis. The combination violates most of the above issues and, e.g., cannot easily account for nested regulation, as found in molecular biology. In effect, GRNs correspond to our discretely-regulated causations but without the benefit of our exponential subtlety in unraveling the manifestations of regulation, see Section S1.6.4 and Figure 3:(E): if we used GRNs for molecular programming, they might become big and it would be difficult to get all details right and coherent, or we would admit false-positive site-action. It is not clear if GRNs come with a distinct open-system/sub-state-space notion of model.

Our approach is intensional: we address regulatory engineering principles over mature concepts. GRN-based approaches are extensional: they work with effects throughout. As shown, life cycles, etc., are intensional notions. In general, intensional notions (e.g., physiological concentrations) are not easily recovered from extensional ones (e.g., observable concentrations), see Section S1.6.

Given proven reasoning principles, logical meta-theory makes fairly easy work of engineering structure and enables computational treatment of dynamics. Our axiomatization for molecular biology has been tested in exacting detail:

  • It involves abstract principles [32] that are subsumed by standard reasoning, see Theorem 2.

  • It has been computer-verified to enjoy standard reasoning meta-theory, see Theorems 1,4,Corollary 5.

  • It has been computer-verified to account for the extant parts of the standard reference [15], see Theorem 8.

  • It admits derived usages covering all rather than just specific reasoning, including automation, see Theorem 8,Proposition 9.

  • Its derived usages admit their own derived usages, see Proposition 10, cf. [1].

The induced theory appears to be simple. This is a reflection of conceptual purity and is a hard-won property. Without simplicity, there would be caveats in our story for Ptashne [15]. Instead, as seen, the theory’s inner structure is non-trivial.

Figure legends

Figure 1:

A cross-section of the formal [AGS3]-genotype, covering regulation of cro-expression. The text is in our MIG (Modal Influence Graph) language, written in accordance with our RI (Regulation Interface) modeling methodology applied to Ptashne [15], see Section S2.1. The combination is called MIG/RI specification. The items are listed in reverse for discussion purposes. Macros, for regulatory expressions, are prefixed with @ and categories with $. Conjunction is written &, disjunction :, inhibition |--, and contravariance !. The text inside ! {|…} is a modality for the contravariant effect. A | is a separator and a ; is a terminator. Comments are after // — here, they are used to indicate the main quotations in Ptashne [15] being formalized under RI. (A) The cro gene is transcribed from the PR promoter, which increases (+) the concentration of Cro protein; Cro decays naturally and is subject to passive contravariance (!): the population decreases unless maintained. (B) PR-transciption is constitutive but inhibited by either of the OR{1,2} operators; Cro has a neutral effect at one site at first but is contravariant at higher concentrations (+-) and at dual occupancy (--); CI binding is cooperative, immediately contravariant, and may outlast its free concentration ({|…}). (C) Dual Cro occupancy of OR{1,2} happens at the highest concentration, if both are well-defined (&); single occupancy requires either to be well-defined (:) and is excluded by dual. (D) Cro regulates with concentrations matching intrinsic affinities but CI binds preferentially. (E) Highly-cooperative CI binding is either as an octamer at O{L,R}{1,2} plus a tetramer at O{L,R}3 or as just an octamer. (F) CI’s octamer binding at O{L,R}{1,2} (@OLR12_CI) is initiated from OR1 and the cooperativity makes it physiological at a lower concentration ((+) ...) than that transposing the intrinsic affinity; octamer+tetramer binding forms when CI is able to bind also at OR3, with cooperativity across O{L,R}3 lowering the effective concentration. (G) CI binds OR{1,3} with different intrinsic affinities, specifically at low and high nominal concentrations. (H) Cro binds OR{1,2} at high nominal concentration. (I) cI/CI and cro/Cro are declared as genes whose protein products bind operators differentially, at several nominal concentrations.

Figure 2:

Axiomatic compartment changes over causations, i.e., over the discretely-regulated physiology changes coded for by a genotype, see Figure 3 and Section S2.2. In Definition 3, means ‘defined to be’, see Section S1.1.1. Definition 4 is an inductive definition of a formal proof system by proof rules, see Section S1.1.1: ‘if above the line, then below’, with meaning: ‘we [can] infer from ’. The are user-chosen, indexed subsets of the positively-validated causations in the considered compartment. The must be (co-)inhibition-free and produce a homogeneous compartment when combined. We consider open systems: mediators and inhibitors need not occur as reactants or products.

Figure 3:

(A) The causations coded for by Figure 1 (but without macro names, see Figure 4) — for the specifics, see next. See Section S2.2 for all causations. (B) The ‘causation arguments’ in Ptashne [15] typically synthesize information spread over multiple causations. Reading (A) in textual order, [15]:22:1/App.SE makes reference to the inhibition on the first two causations: constitutive cro-expression. With the first two, the third causation completes [15]:25:7/App.SE while stressing that the referenced operator bindings are contravariant, see Figure 1:(B): they result in an opposite effect, meaning effective decay. The last six causations are the result of contravariance from [15]:22:1/App.SE. More subtly, [15]:22:1/App.SE’s “[i]n a lysogen” is seen as the CI inhibition on Cro-mediated auto-decay in the third causation: CI and Cro bind to the same operators to mediate Cro-decay, but CI binds preferentially, see Figure 1:(D). Without the CI inhibition, our framework would admit false-positive argumentation for effects that do take place but for different reasons. (C) The essence of our translation from the MIG language, see Figure 1, to causations, see (A), is to view regulatory expressions as propositional formulas () and convert these to disjunctive normal form (2DNF) by using commutative laws to push all negations to the inside, conjunctions to the middle, and disjunctions to outermost, see Section S1.3.1. Each non-contradictory disjunct (i.e., maximal collection of conjunctions of arbitrarily-negated states) is then the conditions for a causation. A key aspect is the listed definition of propositional inhibition. The obvious definition may appear to be , with the listed example instead becoming . The first of these two disjuncts would become a causation that can go ahead in the presence of although is stated to inhibit the considered inhibition by . The second can also go ahead, but the first would be false-positive argumentation. The definition we give ensures that all considered causations operate independently of each other. (D) Consider schemas, , with one active () and one non-active state each. Let be the function that constructs -nested inhibition over the by structural recursion, see S1.1.1. The example in (C) is . (E) Partly to illustrate the effective structure resulting from (C), we prove that gives rise to Fibonacci number () many causations. The base cases involve and . The proof steps for the recurrence are (1) definition of inhibition; (2) standard distributive laws for conjunction over disjunction, Section S1.1.7; (3) the conjunctive clauses in a DNF is the union of clauses over a disjunction; (4) conjoining a variable onto an formula does not change the number of clauses and, as it is a distinct variable, no causations get invalidated; no causations result from a contradiction; (5) by definitions; propagate negation by De Morgan’s laws (DM), Section S1.1.7; (6) repeat step in (3); (7) repeat step in (4); (8) twice DM leave disjunctions and conjunctions intact. Use (4)[(1) left-hand] twice in [(5) left-hand]=(8). For propositional inhibition as , the result for would be . ( is a bit more than 2.)

Figure 4:

Consider (A) the listed set of states, with instance-sorting given by the part before ‘.’, and (B) these ‘[named]:’ causations coded for over them, where (C) [SOS] coinhibits (outpaces) [PRM_tr] — the listed causations are Section S2.2.5: 1a1:1;2.1:3;2.1:4;3.2:2;4.1:1 for [AGS3]. (D) An example coextension derivation, see Figure 2, with rule names omitted: (interference) occurrences are top-most and include positively-validated causations in , with (co-)inhibitees to the right of semi-colons and the rest used for . (E) Our tool outputs coextension derivations in linearized form, with enabled positively-validated causations after -> or, if (co-)inhibited, after x>. Inert states not used as reactants are written after +. Our linearized coextension derivations reify the ‘pathway arguments’ in Ptashne [15]. For the example, [15]:24:-7/App.SF’s “SOS response” is listed in l.2; it is effected by (activated) RecA* in l.1; SOS counteracts cI expression from P, see l.3, which results in a CI-concentration decrease: l.1 vs. l.5; “vacating repressor” (CI) is OLR12_CI occurring in l.3 but not in ll.5–7; “transcription of cro” from P is listed in l.6, resulting in a Cro-concentration increase: l.5 vs. l.9. See video.

Figure 5:

The graph of the coded-for causations from the [AGS3] genotype, see Section S1.4.1. The darker nodes in boxes list schema instances that do not change across any edges, i.e., environmental influencers. The magenta node with (inject) was specified as an entry seed in the [AGS3] genotype: a channel into the considered compartment, see Section S2.1.9. The edge labels show regulators and where they act: <source-only> ; <source+target> ; <target-only>. An edge with a filled head has a target-only mediator, which must be present for the edge to run. A -/ indicates direct/indirect inhibitors. A ! indicates inhibitors with a contravariant effect that will appear elsewhere. The red text in the Cro.h node indicates system expiration (here: host lysis), see Section S2.1.9.

Figure 6:

Our formally-substantiated ‘pathway properties’ for Ptashne [15] are developed ab intra, see Theorem 8: they reflect the structure of the reasoning space generated by Figure 5 acting on compartments, see video. The listed [AGS3] phenotype and its reasoning cover Ptashne [15], see Theorem 8. The items in italics are not in Ptashne [15], see Retrodictions.

Figure 7:

The practical aspect of this work is a tool that supports 2 usage perspectives: molecular-biology [reasoning computing]. The use case of Genotype [molecular basis programming] to Physiology Changes [‘causation arguments’ executable code] to Synchrotype [open-system life cycles (aggregated) sequential form] is automated. Exploration of Phenotypes [‘pathway arguments’ + ‘pathway properties’ + ‘cell behaviors’ perturbable concurrent computation] is interactive. The affixed circles indicate the computational nature of the locations: physiology changes operate concurrently; the simulations may effect these step-wise; synchrotypes factor out how the concurrent processes may synchronize to each other in a stepping-free way, resulting in a sequential presentation of all possible behaviors, i.e., a prediction of the possible system life cycles.

Figure 8:

Our prediction of [AGS3]-sequentializations reifies the temperate life-cycle illustration in Ptashne [15], see [15]:133:15/App.SG. The construction convolves relevant parts of Figure 5, with originating nodes in []. The obtained graph reveals molecular details that would affect its traversal. (A) Synchronization nodes consists of coexistable causation nodes, with channels (magenta) and expiration (red text) from the causation level. The expiration node is not terminal because lysis is not regulatorily terminal, see Retrodiction (II). (B) Edges with a filled arrow have a target-only mediator: the direct route to lysogeny is available only to a super-infection in the considered case of CII below high concentration, see Figure 6:1/,2/b/iii/. It is not filled for CII at high, see Section S2.4.3. (C) Dashed edges are inhibited in the target node. CI is an inhibitor for the left edge, i.e., only continued lysogeny is possible in case of super-infection, see Figure 6:2/b/iii/. See Retrodiction (III) for the right edge. (D) Dotted edges are inhibited in the source node. The inhibitor here is CI, meaning a lytic attack requires any CI to become non-physiological by outside means (here: host-based proteolysis), see Figure 6:2/{a,b}/ vs. 2/c/. (E) Reflexive edges indicate active self-regulation (of lysogeny), see Figure 6:2/a/. (F) Boxes indicate ‘synchro-sustainability’ (of lysogeny), see Figure 6:2/a/.

Figure 9:

Phenotypes exhibited by varying CI’s intrinsic affinities for OR{1,2,3}, including the possibility that higher-than-wild-type concentration is needed for physiological binding. The origin in the figure is for binding at low concentration (high affinity), with also medium, high, and extra-high concentrations shown. The wild type is indicated with bold border (low, high, high nominal concentrations). Green (upper left) is temperate, red (upper right) virulent, and yellow (lower) clear, see Section S2.5. The mixed-color squares are hybrids that likely will be identified as the most destructive phenotype (most used color).

Supporting material

Instructional video:

“Using the CEqEA tool: A synthetic Genetic Switch”,
http://ceqea.sourceforge.net/extras/instructionalPoL.mp4 — with insert from St-Pierre and Endy [34] (with permission).

coext.v (included):

Proof scripts for the Coq Proof Assistant [13] to formally verify the logical meta-theory of our CEqEA tool [12], see Section S1.3.2.

lambdaAGS3-RI.mig (included):

Our MIG/RI specification for CEqEA that captures the premises that the reasoning in Ptashne [15] proceeds from. We refer to it as the [AGS3]-genotype, see Section S2.1.

lambdaAGS3-cert.txt (included):

CEqEA’s certification of the form-to-function compilation of lambdaAGS3-RI.mig/the [AGS3]-genotype, see Section S2.2.

lambdaAGS3-phys.mig (included):

A reconstituted MIG specification of the physiological influences coded for by the [AGS3]-genotype: a minimal but non-mutable organism account — extracted from lambdaAGS3-cert.txt.

lambdaVar-generic.mig (included):

1,728 variants of lambda[AGS3]_RI.mig, see Section S2.5.

Acknowledgements

RV thanks Jittisak Senachak for coding prototypes of CEqEA’s form-to-function compiler, Olivier Danvy for comments on the manuscript, and Olivier Danvy, Hiroakira Ono, Kiyoyuki Terakura, and Mun’de Vestergaard for discussions. The authors declare no competing interests and no funding sources.

Contributions

RV conceived of and did the work. EP advised on visualization, adapted his ZGRViewer tool for the requirements of interactive CCP usage in CEqEA, and made Figures 79 and the video with RV.

References

  • [1] National Research Council [US], Committee on Defining and Advancing the Conceptual Basis of Biological Sciences in the 21st Century. The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research. The National Academies Press, Washington, DC, 2008.
  • [2] Alfred North Whitehead and Bertrand Russell. Principia Mathematica, Vols.I,II,III. Cambridge University Press, 1910,1912,1913.
  • [3] Andrej Bauer. Five stages of accepting constructive mathematics. Bulletin of the American Mathematical Society, 54(3), 2017.
  • [4] William Howard. The formulae-as-types notion of construction. In Jonathan Seldin and Roger Hindley, editors, To H.B. Curry: Essays on Combinatory Logic, Lambda Calculus and Formalism, pages 479–490. Academic Press, 1969. Published 1980.
  • [5] Philip Wadler. Propositions as types. Commun. ACM, 58(12):75–84, November 2015.
  • [6] Thomas C. Hales. Mathematics in the age of the Turing machine, volume 42 of ASL Lecture Notes in Logic, chapter 7. Cambridge University Press, 2014.
  • [7] Eugene Wigner. The unreasonable effectiveness of [applied] mathematics in the natural sciences. Communications in Pure and Applied Mathematics, 13(1), 1960. “The great mathematician fully, almost ruthlessly, exploits the domain of permissible reasoning and skirts the impermissible. That his recklessness does not lead him into a morass of contradictions is a miracle in itself: certainly it is hard to believe that our reasoning power was brought, by Darwin’s process of natural selection, to the perfection which it seems to possess. However, this is not our present subject”.
  • [8] Frank Quinn. A revolution in mathematics? What really happened a century ago and why it matters today. Notices of the AMS, 59(1), January 2012.
  • [9] Jeremy Avigad and John Harrison. Formally verified mathematics. Communications of the ACM, 57(4), 2014.
  • [10] Thomas Hales, Mark Adams, Gertrud Bauer, Tat Dat Dang, John Harrison, Le Troung Hoang, Cezary Kaliszyk, Victor Magron, Sean McLaughlin, Tat Thang Nguyen, Quang Truong Nguyen, Tobias Nipkow, Steven Obua, Joseph Pleso, Jason Rute, Alexey Solovyev, Thi Hoai An Ta, Nam Trung Tran, Thi Diep Trieu, Josef Urban, Ky Vu, and Roland Zumkeller. A formal proof of the Kepler Conjecture. Forum of Mathematics, Pi, 5, 2017.
  • [11] Georges Gonthier. Formal proof — the Four-Color Theorem. Notices of the AMS, 55(11), 2008.
  • [12] René Vestergaard. CEqEA [k(e-as-in-met)k-e-a], the Cascaded-Equilibria Emergence Assistant. http://ceqea.sourceforge.net/, 2009–.
  • [13] LogiCal Team. The Coq Proof Assistant. https://coq.inria.fr/, 1989–.
  • [14] Richard Kelsey, William Clinger, and Jonathan Rees, editors. Revised report on the algorithmic language Scheme. Higher-Order and Symbolic Computation, 11(1):7–105, 1998. “Programming languages should be designed not by piling feature on top of feature, but by removing the weaknesses and restrictions that make additional features appear necessary”.
  • [15] Mark Ptashne. A Genetic Switch: Phage Lambda Revisited. Cold Spring Harbor Laboratory Press, 3rd edition, 2004.
  • [16] Francis Crick. Central dogma of molecular biology. Nature, 227:561–563, 1970.
  • [17] Calin C. Guet, Michael B. Elowitz, Weihong Hsing, and Stanislas Leibler. Combinatorial synthesis of genetic networks. Science, 296(5572), 2002. “Boolean-type models neglect many potentially important intracellular phenomena, including stochastic fluctuations in the levels of components and the detailed biochemistry of protein-DNA interactions”.
  • [18] Lewis Wolpert. The Unnatural Nature of Science. Faber and Faber Limited, London, 1992.
  • [19] Mark Ptashne. Principles of a switch. Nature Chemical Biology, 7, 2011.
  • [20] Deborah Hay, Jim R. Hughes, Christian Babbs, James O. J. Davies, Bryony J. Graham, Lars L. P. Hanssen, Mira T. Kassouf, A. Marieke Oudelaar, Jacqueline A. Sharpe, Maria C. Suciu, Jelena Telenius, Ruth Williams, Christina Rode, Pik-Shan Li, Len A. Pennacchio, Jacqueline A. Sloane-Stanley, Helena Ayyub, Sue Butler, Tatjana Sauka-Spengler, Richard J. Gibbons, Andrew J. H. Smith, William G. Wood, and Douglas R. Higgs. Genetic dissection of the -globin super-enhancer in vivo. Nature Genetics, 48:895–903, 2016.
  • [21] Mark Ptashne. Epigenetics: Core misconcept. PNAS, 110(18), 2013.
  • [22] Morten Heine Sørensen and Pawel Urzyczyn. Lectures on the Curry-Howard Isomorphism, Volume 149 (Studies in Logic and the Foundations of Mathematics). Elsevier Science Inc., New York, NY, USA, 2006.
  • [23] Sydney Brenner. Turing centenary: Life’s code script. Nature, 482(7386), 2012.
  • [24] Tobias Nipkow. Winskel is (almost) right: Towards a mechanized semantics textbook. Formal Aspects of Computing, 10:171–186, 1998.
  • [25] Martin Davis, David Luckham, and John McCarthy. Citation for Hao Wang as winner of the [1983] milestone award in automated theorem-proving. In W.W. Bledsoe and D.W. Loveland, editors, Automated Theorem Proving: After 25 Years, pages 47–48. American Mathematical Society, 1984.
  • [26] Bernard Révet, Brigitte von Wilcken-Bergmann, Heike Bessert, Andrew Barker, and Benno Müller-Hill. Four dimers of repressor bound to two suitably spaced pairs of operators form octamers and DNA loops over large distances. Current Biology, 9(3), 1999.
  • [27] Ian B. Dodd, Alison J. Perkins, Daniel Tsemitsidis, and J. Barry Egan. Octamerization of lambda CI repressor is needed for effective repression of P(RM) and efficient switching from lysogeny. Genes & Development, 15(22), 2001.
  • [28] Ian B. Dodd, Keith E. Shearwin, Alison J. Perkins, Tom Burr, Ann Hochschild, and J. Barry Egan. Cooperativity in long-range gene regulation by the lambda CI repressor. Genes & Development, 18:344–54, 03 2004.
  • [29] Robin Milner. Elements of interaction (Turing Award Lecture). Communications of the ACM, 36(1), 1993.
  • [30] Antoni Mazurkiewicz. Concurrent program schemes and their interpretations. Technical Report PB 78, DAIMI, Aarhus University, 1977.
  • [31] Philip W. Anderson. More is different. Science, 177(4047), 1972.
  • [32] Phantom, 2018. Ptashne [15] portrays the same interactions and dynamics as Lloyd Webber’s “The Phantom of the Opera”: cro/Cro is the chorus girl, cI/CI is the prima donna whose influence on others is not limited by her presence (‘in free concentration’), RecA is The Phantom whose actions may curb the prima donna’s reign and set the chorus girl free to live out her constitutive abilities with destructive consequences, cII/CII is a fickle public clamoring for star power until overwhelmed by an alternative, RNAP is the opera leadership that promotes one or the other performer according to circumstances, and ssDNA is the patron whose presence provokes The Phantom to change the course of events.
  • [33] Albert Einstein. Geometry and experience, 1921. An expanded form of an address to the Prussian Academy of Sciences in Berlin on January 27th, 1921.
  • [34] François St-Pierre and Drew Endy. Determination of cell fate selection during phage lambda infection. PNAS, 105(52):20705–20710, 2008.
  • [35] Tsutomu Hosoi. Pseudo two-valued evaluation method for intermediate logics. Studia Logica, 45:3–8, 1986.