Chemical Transformation Motifs - Modelling Pathways as Integer Hyperflows

12/07/2017 ∙ by Jakob L. Andersen, et al. ∙ 0

We present an elaborate framework for formally modelling pathways in chemical reaction networks on a mechanistic level. Networks are modelled mathematically as directed multi-hypergraphs, with vertices corresponding to molecules and hyperedges to reactions. Pathways are modelled as integer hyperflows and we expand the network model by detailed routing constraints. In contrast to the more traditional approaches like Flux Balance Analysis or Elementary Mode analysis we insist on integer-valued flows. While this choice makes it necessary to solve possibly hard integer linear programs, it has the advantage that more detailed mechanistic questions can be formulated. It is thus possible to query networks for general transformation motifs, and to automatically enumerate optimal and near-optimal pathways. Similarities and differences between our work and traditional approaches in metabolic network analysis are discussed in detail. To demonstrate the applicability of the mathematical framework to real-life problems we first explore the design space of possible non-oxidative glycolysis pathways and show that recent manually designed pathways can be further optimised. We then use a model of sugar chemistry to investigate pathways in the autocatalytic formose process. A graph transformation-based approach is used to automatically generate the reaction networks of interest.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Chemical reactions are intrinsically many-to-many relationships on molecules. Chemical reaction networks thus are naturally modelled as directed multi-hypergraphs [76], with vertices being molecules and directed hyperedges being reactions. A wide variety of methods has been devised to characterise the function of the networks in terms of pathways. Nevertheless, the very notion of a pathway has remained difficult to define formally. While biochemical tradition stipulates that certain combinations of enzyme-catalysed reactions constitute pathways, competing mathematical approaches single out reaction sets with specific properties that may serve as natural pathways in a particular setting. Flux Balance Analysis (FBA) [30, 46, 60] is geared towards finding a single pathway that is optimal in a certain sense, usually w.r.t. a biomass function defined from experiments. In contrast, Elementary Flux Modes (EFM) [72] and Extremal Pathways (ExPa) [50, 51] consider particular sets of pathways that implicitly describe the complete solution space. An important formal difference is that EFMs and ExPas have a mechanistic interpretation as integer valued fluxes, while FBA pathways do not have this property. Reversible reactions are in ExPa represented as separate pairs of mutually inverse hyperedges, while this is not done in FBA and EFM which instead allow for negative fluxes. A general mechanistic model related to EFM pathways introduces additional constraints to limit futile flux [13]. Interpretations of chemical networks that disregard the stoichiometry of the network have been popular since they are amenable to efficient methods from network analysis (e.g., see [62]) but are much less expressive [63].

We introduce here a versatile model for general mechanistic pathways, where routing constraints can be introduced to limit futile branches. Pathways are modelled as integer-valued hyperflows, which are the natural generalization of flows on directed graphs to directed hypergraphs [33]. Hyperflows are mathematically equivalent to chemical fluxes. They have been studied in their own right for restricted classes of hypergraphs [21, 32] and provide a direct link to the rich computer science literature (see e.g., [1]). Hyperflows can be found with Linear Programming (LP), as in FBA. We will show, however, that there are cases where it is important to insist on integer hyperflows. Integer hyperflows can be found with Integer Linear Programming (ILP), a technique that has been applied in previous pathway models [63, 13] and that we will also make extensive use of. As we shall see, LP and ILP solutions can be vastly different. It is well-known that LP can be solved in polynomial time [48, 44], while ILP is NP-hard in the general case [45]. Not surprisingly, the restricted problem of finding a maximum integer hyperflow in a chemical reaction network is also NP-hard, even for bi-molecular reactions (bounded degree hyperedges) [2]. Nevertheless, modern ILP solvers readily deal with most problem instances of practical relevance. Specialised (Mixed) ILP modification schemes have been devised to enumerate EFMs [25] and minimal reaction sets in an FBA setting [41, 19, 68]. Here we use a tree search algorithm in combination with the underlying ILP solver to enumerate alternative optimal and near-optimal pathways.

Pathways are only one way to capture the structure of a chemical network. The theory of Chemical Organizations (CO) [42, 23] considers higher-level properties of pathways and focusses on closure properties of subnetworks. A closely related topic is the recognition of catalytic and autocatalytic cycles. Autocatalysis forms the basis of autopoietic systems, a particular type of organizational structure. In prebiotic chemistry [69], autocatalysis has long been hypothesized to be one of the dominating mechanisms that eventually lead up to the origin of life. Logically described by the simple scheme , autocatalysis may be instantiated by a wide variety of distributed chemical schemes [16, 64] akin to autocatalytic set models [47, 40]. We will show that the geometric interpretation of integer hyperflows also facilitates the definition of algebraic criteria for autocatalyis, thereby allowing us to identify autocatalytic cycles as the solutions of an ILP problem. To this end we devise here an expanded network model that represents each vertex (compound) as a bipartite digraph. This allows us to efficiently implement additional topological constraints to handle futile cycles and to treat catalytic and autocatalytic pathways that are not immediately meaningful in the traditional FBA framework without mechanistic pathways.

From a chemical point of view the notion of a chemical transformation motif (CTM) takes on a central role in this context. It refers to a coherent subset of chemical reactions with a well-defined interface to the remainder of the network and implements a coherent overall chemical transformation. Metabolic pathways and subnetworks exhibiting overall (auto)catalysis are particular examples of CTMs. Hence CTMs are formal specification of a set of integer hyperflows in a chemical reaction network, possibly in conjunction with additional constraints on the allowed hyperflows. As an example we will investigate the many ways of instantiating the transformation motif defined by the conversion of 1 glucose to 3 acetylphosphates and the alternative implementations of the well-known autocatalytic Formose process.

The approach presented here combines aspects of previous pathway models. It can be used for finding single pathways as well as for the targeted, large-scale enumeration of pathways, both with respect to a given optimisation criterion, but without the burden of characterising the complete solution space.

2 Model

In this section we develop a formal model for pathways in reaction networks. For conciseness, we entirely use the language of hypergraphs. We first describe the basic pathway model and a simple notion of autocatalysis, characterised by the signature of the overall reaction of a pathway. This model allows for misleading, or chemically “uninteresting”, pathways. An expanded model is therefore introduced to narrow the space of pathways.

Note that the core model aims at formalising the notion of a chemical pathway, and therefore does not by itself include any optimality criteria. This model forms the basis of a practical implementation in terms of integer linear programs (ILP), which we will describe in detail in the following section. Different optimality criteria can then be added to the model in the form of linear objective functions.

2.1 Directed Multi-Hypergraphs

A directed multi-hypergraph

is an ordered pair of a vertex set

and a set of directed hyperedges . Each edge is itself a pair of multisets (hence “multi-”hypergraph) of vertices and , called the tail and head of . In the following we refer to directed multi-hypergraphs simply as hypergraphs. Fig. 1 gives an example of the visualisation scheme we generally use throughout this contribution.

Figure 1: Visualization of an abstract reaction network consisting of the reactions , , and . Note that we use a simplified visualisation scheme for 1-to-1 reactions, without a box but simply with a direct arrow between the compounds.

Since we will use both normal sets and multisets we introduce the following notation for multisets. When constructing a multiset we use double curly brackets (i.e., ) to distinguish them from normal set constructors, . For a multiset and some element we use to denote the number of occurrences of in . We introduce a multi-membership operator, , used in iteration contexts. E.g., .

2.2 Integer Flows on Extended Hypergraphs

Throughout, we assume that is a directed multi-hypergraph. Syntactically we will use a superscripted plus  to refer to “out”-related elements relative to vertices (e.g., out-edges), and a superscripted minus to refer to “in”-related elements.

We need a mechanism to inject and extract molecules from the network, i.e., what is called “exchange reactions” in metabolic networks. We therefore define the extended hypergraph of with where and designate the additional “half-edges” and , for each . An example is shown in Fig. 2.

Figure 2: fig:modelEx:network a small network, . fig:modelEx:extended the extended network, . Note that most of the input/output edges in the extended network will be constrained in the final formulation, and thus for many chemical networks many of these edges will effectively be removed to model specific interface conditions.

For a vertex and an edge we use the multiplicity function for multisets to write for the number of occurrences of in the head of (and for the tail). We use and to denote a set of incident out-edges and in-edges respectively, and thus use as the set of out-edges from , restricted to the edge set , i.e., . Likewise, denotes the restricted set of incident in-edges of .


An integer hyperflow on is a function satisfying, for each the conservation constraint


That is, the sum of flow out of each vertex must be the same as the sum of flow into it. We mostly speak of integer hyperflows, and will for brevity refer to them simply as flows.

In order to constrain the in- and out-flow to certain vertices we specify a set of inputs (sources) and outputs (targets/sinks) . Thus


serve as additional constraints in an I/O-constrained extended hypergraph, which is completely specified by the the triple .

We adopt the notion of an overall flow from the chemical overall reaction for a pathway, which is simply a convenient notation for the I/O flow. For a flow we syntactically write the overall flow as

However, as usual for chemistry we omit the terms with vanishing coefficients.

Flows are non-negative by definition. It is therefore necessary to model every reversible reaction by two separate edges and . This separation of the flow will later allow us to define useful chemical constraints on the flow.

Finally, as in many typical flow problems, a capacity function can be added to limit the flow from above, i.e., . We will, however, not explicitly make use of a capacity function in this contribution.

2.3 Specialised Flows – Overall (Auto)catalysis

The nature of autocatalysis is that the product of a chemical reaction catalyses its own formation. This interaction leads to an exponential time behaviour in the growth characteristics of the product, as well as, to a positive correlation of initial concentration and the reaction rate. Autocatalytic reactions have been investigated for over a century and instances of this behaviour have been found in a wide range of topologically different chemical systems, which are based on a rich chemical setup (for a recent review see [15]). Usually several reactions are organised in a cyclic network to achieve the proper topology for autocatalysis.

We here define a simple notion of both catalysis and autocatalysis in terms of the I/O flow of a network. As we only constrain the flow of the overall reaction, we call this overall (auto)catalysis. Catalysis means in chemical terms that a molecule, the catalyst, is first consumed by some reaction and then regenerated by a subsequent reaction in such a way that overall the catalyst is neither consumed nor produced. We therefore say that a vertex is overall catalytic in a flow if and only if (i) the input and output flows of are non-zero and (ii) the input and output flows of are equal, i.e., iff


Similarly, autocatalysis, refers to a situation where a molecule is consumed in a reaction only to be (re)produced in subsequent reactions in higher quantities than what has been consumed originally. In terms of flow we say a vertex is overall autocatalytic in a flow if and only if


We extend the terminology to say that a flow is overall (auto)catalytic if some vertex is overall (auto)catalytic in .

2.4 Chemically Simple Flow and Vertex Expansion

Representing reversible reactions as pairs of irreversible ones gives rise to trivial pathways consisting of edge and its inverse with equal positive flow. Consider the example in Fig. 2(a).

Figure 3: Simplification of a flow to an equivalent flow , by removal of futile 2-edge sub-pathways. fig:vertexExp:1 The molecule is created through , but can only be interpreted as being consumed through the reverse reaction. fig:vertexExp:2 After removal of 1 flow from the reactions , the molecule now participates in a futile 2-edge flow. fig:vertexExp:3 Removing 1 flow from and the I/O edges , we arrive at the simplest flow.

Here we see three pairs of reversible reactions with positive flow: , , and the I/O reactions . However, we can argue that this flow is not “simple” in the sense that there is no interpretation of the flow without a futile conversion of matter. This problem has of course been recognized in the literature. Additional ILP constraints were used in [63, 13] to prune such cases. However, this approach seems quite difficult to control in practice. Disallowing all cases of flow on both a reaction and its reverse as in [13], for example, turns out to be too restrictive. Another approach is to simply cancel out flow as a post-processing step [43]. In the following we illustrate that certain seemingly futile flows should be allowed. To facilitate precise constraints we therefore advocate a refinement of the network model itself.

In the pathway a single copy of is created, through the reaction , and it can only be routed into a single reaction, . The sub-pathway is thus a futile 2-edge branch that we can simplify away, yielding the equivalent flow in Fig. 2(b). The same reasoning can now be applied to , and subsequently , resulting in the flow depicted in Fig. 2(c).

The original flow in Fig. 2(a) fulfils the requirement for overall autocatalysis in vertex (Eq. (4)), but clearly the in-flow of 1  is not involved in the extra production of , which goes against the idea of autocatalysis. The simplified flow, Fig. 2(c), is however not overall autocatalytic, and it is therefore desirable to constrain the model such that futile 2-edge flows are not possible, in order to further approach a precise characterisation of autocatalysis.

From the shown example it is tempting to simply disallow flow on both an edge and its inverse. This is the approach effectively used in FBA-related methods, and also in flows on normal graphs [1, 11]. Since flow on an edge and its inverse is inherently a part of the I/O specification for autocatalysis, some exceptions must be made for I/O flow. However, as illustrated with the counterexample in Fig. 4 even internal edges can have seemingly futile 2-cycles.

Figure 4: Example of a flow with meaningful flow on an edge and its inverse, in the network from Fig. 1(b). Only edges with non-zero flow are shown.

We can interpret this flow such that no flow is directly reversed:

  1. twice

  2. twice

Since this interpretation has the flow on mutually inverse edges temporally separated, it should not be excluded from the pathway model.

To facilitate fine-grained control of flow where directly reversed flow can be disallowed we expand the extended hypergraph into a larger network. Each vertex is expanded into a subnetwork that represents the routing of flow internally in the expanded vertex. Formally, for each


That is, is replaced with a bipartite graph with the vertex partitions representing the in-edges and out-edges of respectively, and the edge set being the complete set of edges from the in-partition to the out-partition. We say that is the set of transit edges of .

The original edges are reconnected in the natural way; for each the reconnected edge is with and . We finally define the expanded hypergraph as

We expand the definition of a flow function to and pose the usual conservation constraints, but on : for all


The I/O constraints translates directly to the expanded network. In Sec. A we formally describe the relationship between flows on the extended and the expanded network.

Using the expanded network we can prevent flow from being directly reversed: for each pair of mutually inverse edges we consider the transit edges they induce, i.e., the edges for all . None of these edges may have flow through them. A flow must thus satisfy


Fig. 5 shows the expanded version of the network from Fig. 1(b), with these constraints in effect.

Figure 5: The network from Fig. 1(b) expanded into . The vertices of are the small filled circles, while the large circles, , , and , only serves as visual grouping of the actual vertices. fig:modelEx:expanded:full the expanded network without transit edges constrained in Eq. (8). fig:modelEx:expanded:constrained the expanded network, simplified using the source set and sink set . fig:modelEx:expanded:flow the example flow from Fig. 4 in the expanded network, with only edges with non-zero flow shown. Note that no 2-cycles exist in this flow.

Note that the expansion of the network also opens the possibility of forbidding other 2-sequences of edges, and in general the possibility of posing constraints on the routing of flow internally in vertices.

When querying for chemical pathways with partially unknown I/O specification we have found it useful to distinguish between reversible reactions that are in the original network and the reversible I/O reactions. That is, we may choose to not pose the above constraints on transit edges when and , thus allowing excess input-flow to be routed directly out of the network again. Fig. 4 shows a valid flow with a meaningful 2-cycle. The expanded flow in Fig. 5 no longer has 2-cycles.

In Eq. (3) and (4) we defined the I/O constraints for overall catalysis and autocatalysis. These constraints are converted in the obvious manner to the expanded network . However, the expanded network reveals another possibility for somewhat misleading flows, exemplified in Fig. 5(a).

Figure 6: A simplified network with an overall autocatalytic flow. fig:strictCataProblem:nonStrict the vertex is both overall autocatalytic and an intermediate vertex in the flow. fig:strictCataProblem:strict the same motif for overall autocatalysis, but without being an intermediate vertex.

Vertex is overall autocatalytic, but is also utilised as an intermediary molecule. The same autocatalytic motif can be expressed by a simpler flow, Fig. 5(b).

In the interest of finding the simplest (auto)catalytic flows, we introduce the following constraints. Let be a flow and a vertex satisfying the I/O constraints for overall catalysis (3) (resp. overall autocatalysis (4)). In the expanded network must additionally satisfy the transit constraints (note that does not include the I/O edges):

That is, all transit flow in an overall (auto)catalytic vertex must flow either from the input edge or towards the output edge .

If a compound is overall autocatalytic, it merely means that; if it is available then even more can be produced. However, this does not mean that it cannot be produced solely by the other input compounds. Solutions can therefore be found that may be surprising. One such solution is illustrated in Fig. 7.

Figure 7: The vertex is overall autocatalytic, but not autocatalytic in the chemical sense.

As a variant of our definition of overall autocatalysis we define that a vertex , is exclusively overall autocatalytic if and only if it is overall autocatalytic and is not trivially reachable from the other input vertices, . A vertex is trivially reachable from a vertex set if it can be marked during a simple breadth-first marking of the hypergraph . For completeness, the pseudocode is shown in Algorithm 1.

Input : A directed (multi-)hypergraph .
Input : A set of starting vertices .
Output : A marked subset of the vertices.
foreach do mark while no more hyperedges can be marked do
       foreach do
             if all are marked then
                   mark foreach do mark
Algorithm 1 Breadth-first marking of a hypergraph.

Note that breadth-first marking of hypergraphs, and variations thereof, has in the literature also been referred to as finding scopes of molecules [37]. Breadth-first marking has in those studies been used alone to analyse metabolic networks, and define set-theoretical notions of pathways and later of autocatalysis [54]. The methods thus do not have focus on the underlying mechanism of the pathways, which is our aim in this contribution.

2.5 Properties of the Expanded Hypergraph

The expansion of the networks obviously changes the size of the underlying model, and it is therefore necessary to investigate how large the expanded network can get, in order to bound the complexity of algorithms. In this section we only state the results. The proofs can be found in the appendix.

For a directed multi-hypergraph , let the size of it be denoted by . Note that if the hypergraph is represented by a bipartite normal graph, then this size corresponds to the number of vertices and edges.


The size of the extended network and the expanded network is polynomial in the size of the original network.


A feasible flow on can be converted into an equivalent feasible flow in , with: , for all .


Let be a feasible flow on . It can then be decided in polynomial time, in the size of , if a feasible flow in exists such that for all . If it exists it can be computed in polynomial time.

The problem of finding a pathway with maximum production of specific compounds is known to be NP-hard, even for networks with bounded degree reactions [2]. The last proposition underlines that the expansion of the network does not drastically increase the complexity of finding pathways, but the proof of it also shows a potentially practical algorithmic approach to working with flows in the expanded network. In Sec. 2.7 we will however show an approach to directly find the flows in the expanded network using ILP.

2.6 Comparison to Existing Methods

The basic pathway model described in Sec. 2.2 is quite similar to the formalism used in FBA, EFM and ExPa, with the latter two methods primarily aiming to categorise specific classes of pathways [61]. The basic model is also similar to the model presented in [13], although the specification of allowed I/O flow is phrased differently. We briefly recast FBA in terms of hypergraphs as the underlying model of reaction networks to clarify the similarities but also the differences with our present approach. The mathematical development of FBA, EFM, and ExPa is based upon the concepts of the stoichiometric matrix and

flux vectors

. These are analogous to a directed multi-hypergraph and non-integer hyperflows. Let denote an I/O-constrained reaction network as defined in Sec. 2.2, with . The network can be represented as two matrices, an out-incidence matrix and an in-incidence matrix , both in the domain . That is, each row represents molecules and each column represents reactions. Let vertices and hyperedges have some arbitrary total order, and . Then for each pair of vertices and reactions, , the matrices are defined as and . Thus the columns of represents the tail-multiset of each hyperedge, likewise for and the head-multisets. The actual stoichiometric matrix is defined as , which in chemical terms is the change of the number of each molecule that each reaction induces. The stoichiometric matrix describes both the proper chemical reactions and transport of material from and to the outside, equivalent to the extension of a hypergraph to an extended hypergraph .

The stoichiometric matrix completely describes the original reaction network, and thus is equivalent to the I/O-constrained reaction network if and only if all hyperedges have disjoint head and tail. All direct catalysts, however, are cancelled out in the stoichiometric matrix, hence the equivalence fails whenever there are reaction hyperedges with . This somewhat limits the scope of FBA. Although it is possible in principle to replace reactions with direct catalysts by a sequences of intermediate reactions that consume and regenerate the catalyst, the resulting FBA network is no longer equivalent to the original one and allows the drainage of intermediates. This alters flux solutions and may be undesirable, e.g., when modelling the concerted action of enzyme complexes. Inspired by natural biosynthetic pathways industrial biocatalysis research [34, 24] currently intensively investigates multi-enzyme cascade reactions, i.e., the combination of several enzyme reactions in concurrent one-pot processes [66], because of their prospect towards a “greener” and more sustainable chemical future. Intermediates from these cascades cannot be accessed as substrates by reactions outside the cascade; hence they require special treatment when represented explicitly.

A flux vector models a pathway, and must satisfy the usual conservation constraint, (cmp. Eq. (1)). Reversible reactions are modelled in one of two ways:
(i) Combined: reversible reactions are modelled as a single reaction, but with the flow/flux allowed to be negative. The flow/flux of irreversible reactions is non-negative. This is the approach followed when finding EFMs [72].
(ii) Separate: reversible reactions are modelled as two inverse reactions, and the flow/flux on all reactions must be non-negative. This is the approached followed when finding ExPas [71]. We also follow this approach in our contribution both for mathematical simplicity and because it allows us to make use of the enhanced modelling capabilities offered by the expanded network.

The extension of the stoichiometric matrix to incorporate I/O reactions can also be implemented using both the “combined” and the “separate” way of handling reversible reactions. The I/O constraints from Eq. (2), specified by and , translate naturally to the corresponding constraints on the extended flux vector.

Figure 8: Examples of reaction networks with not totally unimodular stoichiometric matrices. fig:tumMulti all entries in a TU matrix must be , , or . fig:tumSquare the submatrix consisting of the top two rows has determinant .

With FBA we additionally define a linear objective function to find an optimal flux vector, possibly with additional linear constraints [70]. As a flux vector is real-valued, and all the stated constraints are linear, it can be found using linear programming (LP) in polynomial time [48, 44]. Herein lies a major difference to the model presented in this contribution, where we require an integer hyperflow. We can thus characterise the linear program from FBA as the LP relaxation of the basic pathway problem presented here.

The LP relaxation of an ILP yields an integer solution only under special conditions. The best known sufficient condition is that the matrix of constraint coefficients is totally unimodular (TU), i.e., when all its square submatrices have determinants , , or , and thus all entries of the matrix are also , , or . This is the case for example for integer flows in graphs [1, 11]. As the simple examples in Fig. 8 shows, this not true in general for stoichiometric matrices and hence for hyperflows.

Even though total unimodularity is not a necessary condition, it is not too difficult to construct reaction networks with linear optimisation problems where the integer problem and the LP relaxation have drastically different optimal solutions. As an example consider the carbon rearrangement network described later, in Sec. 3.1, and the question: given 1 xylulose 5-phosphate () and an arbitrary amount of phosphate (), find a pathway that maximises the production of acetylphosphate (). As contains 5 carbon atoms and AcP contains 2, it is clear that the maximum production from a single molecule must be at most 2 . It turns out that the optimal integer solution produces just 1 , by the single reaction . However, the optimal solution to the LP relaxation of the problem yields 2.5 , via the pathway described in Tab. 1.

Table 1: A non-integer pathway with maximum production of AcP from 1 X5P. The optimal integer solution consists of the shaded reaction with flow 1. Relaxing the integrality constraint thus allows for a recycling pathway using to produce 1.5 extra copies of . See Sec. 3 for a table of molecule abbreviations.

This non-integer pathway is incidentally a superposition of the optimal integer pathway and a recycling pathway for converting 1 into 1.5 .

A scaling of the non-integer flow with a factor 2 gives an integer solution, and indeed our methods can also find this as well as many other solutions implementing the same motif. Since LP solutions with integer-valued constraint matrices and objective functions with integer coefficients are rational, it is mathematically always possibly to scale the LP solution to integer values. The actual numbers, however, may become very large. Taking physiological constraints into account, the number of available individual molecules may be small, as low as 100 copies [36], and even smaller for biological macromolecules [29, 57]. Small numbers may have a profound influence on chemical kinetics and can in many cases cause qualitative changes in the behaviour of systems [35, 53]. It is an important feature of the ILP framework to allow for directly expressing such biological constraints. At the same time, it provides pathway solutions that have a direct mechanistic interpretation. On the other hand, in some cases where composite pseudo-reactions are used to model experimental results, as in the case of biomass as objective function, non-integer coefficients may appear in way that makes scaling difficult or undesirable. Such models, however, inherently are are approximations consistent with the LP relaxation. Mechanistic pathways are no longer well-defined in such as setting.

In the previous example we maximised the production of a specific molecule, and saw that the ILP solution have objective value 1 and the LP relaxation have objective value 2.5. The ratio between these values is known as the integrality gap, and it is known that this gap can scale with the input instance. For a simple example, consider the reaction networks stemming from the polynomial-time reduction described in the appendix, reducing the well-known Independent Set problem to maximising the production of a single molecule in a reaction network.

Figure 9: Reduction from the independent set problem to the problem of maximising output from a molecule in a reaction network, applied to the two graphs fig:independent:K3 and fig:independent:K4 . The hyperedges are annotated with both a feasible integer flow and a feasible non-integer flow, written as . Allowing a maximum inflow of 1 to all vertices and maximising the outflow of corresponds to finding a maximum independent set in the original graph.

Applying the reduction to complete graphs with vertices and formulating the problem in terms of hyperflows, we obtain an integrality gap of : for integer flows we can use at most 1 reaction, thus giving a maximum output of 1. When the integrality constraint is removed we can let the flow be 0.5 on all reactions, giving an output of . The reaction networks for the complete graphs of size 3 and 4 are shown in Fig. 9. This illustrates that the use of the LP relaxation is not just a technical detail, but changes the nature of the problem entirely.

2.7 Implementation Using Integer Linear Programming

The ILP formulation characterising feasible flows is based on an expanded hypergraph . The flow function is modelled by an integer variable for each edge , and by constraints for flow conservation. The basic constraints are thus for all and

This definition is similar to an ILP formulation of a classical network flow problem, but with important differences; is a hypergraph so an edge may be in both and (or and ) for . Additionally, is a multi-hypergraph, and thus the coefficients and are introduced, which may be larger than .

The basic formulation can be augmented with constraints, e.g., on the I/O edges, and an objective function depending on the specific problem to be solved. Additionally, the constraints for chemical flows specified in Eq. (8) are added in the obvious way. However, to reduce the size of the ILP we merge transit nodes where no incident transit edges have associated constraints, i.e., those created from irreversible reactions.

In the following sections we describe constraints for finding overall catalytic and autocatalytic flows. For the formulation we will use to denote a classical “large enough” constant, and as some parts of the models for overall catalysis and autocatalysis are similar we will first describe the formulation of these common parts.

In many cases of practical relevance not only the optimal solution is relevant because alternative, near optimal solutions may be easier to find and realize in an evolutionary context. A linear program may have an uncountable number of optimal solutions as the variables are in the domain of real numbers. The solutions can, however, be described by enumerating the, possibly exponentially many, corners of the optimal face of the polyhedron defined by the linear program [74]. This has also been applied to FBA [55]. The restriction of ILP keeps the set of candidate solution countable (even finite with flow capacity constraints), and it becomes straight-forward to enumerate not only optimal solutions, but also near-optimal solutions, see Sec. 2.8.

In our definition of (auto)catalysis we require that if a vertex is (auto)catalytic, then no flow can enter the vertex from the network and exit the vertex to the network again. Let be the indicator variable for the vertex being (auto)catalytic, then the requirement is trivially enforced by the constraints for all

We model catalysis by introducing an indicator variable for each indicating whether is catalytic or not. Thus we can enforce a solution to be catalytic by posing the constraint . The actual constraints for the indicator variables are obtained partially by the constraints above on strictness of flow. Below follows the last requirement, Eq. 3, which is realised through a set of auxiliary indicator variables,

As for overall catalysis we model overall autocatalysis with a set of indicator variables for all , and force a solution to overall autocatalytic with the constraint .

We use the constraints for strictness of flow and model the remaining constraint, Eq. 4, using the auxiliary variable set , indicating :

2.8 Solution Enumeration

A typical use of solvers for integer programs is to find a single optimal solution. However, from a chemical perspective we are also interested in near-optimal solutions and in some cases even all solutions. The structure of our formulation additionally have influence on when two solutions are considered different. Often we might not consider two solutions different if they only differ in the flow on the transit edges, i.e., those introduced by the vertex expansion. This makes it difficult to use build-in features in solvers, such as the solution pool in IBM ILOG CPLEX, to enumerate solutions.

For finding multiple solutions we therefore explore a search tree based on the domain of the variables; each vertex in the tree represents a restriction of the variable domains, with children representing more constrained domains. Note that this tree, in theory, is infinite as some variable may have no upper bound. In each vertex we use an ILP solver to find an optimal solution for the sub-problem. If the problem is infeasible the sub-tree is pruned, otherwise a path to a leaf in the tree is constructed to represent the solution found by the ILP solver. The quality of the found solution at the same time acts as a lower bound on the objective function of the sub-tree (when minimising the function). Vertices in the tree are explored in order of increasing lower bound. If a different value of flow is not to be considered a difference in the solution we simply do not consider the corresponding variables to be part of the branching procedure.

2.9 Software

The pathway model is implemented as an extension of the larger software package, () [10, 3], which can be downloaded at The software combines the pathway analysis with methods for working with generative chemistries [6, 7], including the algorithms for network expansion [8] also used in this contribution. The tight integration between the methods makes it a convenient tool to design artificial chemistries, both high-level systems like DNA-templated computing [5], or hypothetical prebiotic chemistries [4, 9]. Our implementation uses IBM ILOG CPLEX 12.5 for solving integer linear programs. An upcoming version of is in preparation that will include the pathway model, and a preliminary version is accessible at, with examples of usage. The integrality constraints and the implicit constraints depending on this assumption can easily be disabled in our implementation. In this mode the framework is no longer guaranteed to produce mechanistic pathways. Instead, it reproduces classical FBA. Solution enumeration, as described above, is not well-defined in this setting.

3 Application Scenarios

Here we illustrate the strength of our modelling framework by analysing two specific chemical systems. We shall see that beyond well studied objectives such as the minimization of the number of reactions used or maximizing the yield of a target compound, the objective can also be augmented with constraints for a chemical transformation pattern, for example overall autocatalysis. Both chemistries are modelled using the graph grammar approach described in [3, 8, 6, 7]. A visualisation of all rules can be found in the appendix, along with tables of molecule name abbreviations.

3.1 Carbon Rearrangement in the Non-oxidative Glycolysis Pathway

Microbial synthesis of plant-derived secondary metabolites currently is a hot topic in metabolic engineering [75, 58]. Impressive examples include the synthesis of the anti-malarial drug precursor artemisinic acid [67], and of (S)-reticuline [27], an important intermediate towards the benzylisoquinoline alkaloids codein or morphine. The idea of engineering microbes to synthesize useful compounds also extends to the production of fuel. One approach is to couple the catabolic pathway known as glycolysis, which decomposes glucose (a molecule) into acetyl coenzyme A units (, a molecule), to a designed synthetic pathway, condensing into the skeleton of the target molecule. The drawback of this approach is, however, that 2 of the 6 carbon atoms from glucose are lost as during normal oxidative glycolysis, which pushes the yield of this pathway down to , in terms of atom economy. A lossy process for producing educts for the synthetic biofuel producing pathway is naturally a bad idea from an economic perspective.

A recent study [17] attacks the aforementioned problem by hand crafting a non-oxidative glycolysis (NOG) pathway, which prevents the carbon atom loss of its natural counterpart. The general logic of this designed pathway is to couple the splitting reaction that produces the desired body and a body as putative wast, to a carbon rearrangement network, which then recycles the body into molecules, that can be fed back into the NOG as educts. With this strategy NOG achieves a carbon atom economy. The architecture of metabolic networks in general shows this kind of self-referential topology; every putative waste molecule is recycled, making metabolic networks very different from unevolved chemical reaction networks (e.g., atmosphere or geochemical networks), which do not show this property.

The paper[17] discusses several sources of variation for the structure of the NOG pathway. First, the splitting reaction can be performed by two types of phosphoketolase (PK) enzymes, differing only in the their input sugar preference; either fructose (F) or xylulose (X). Second, the carbon rearrangement network can go either via fructose 1,6-bisphosphate (, a sugar) or sedoheptulose 1,7-bisphosphate (, a sugar). Three pathways are shown (Fig. 2, [17]) that exploits this freedom in the design of the NOG pathway motif. In this section we illustrate that many more equivalent solutions can be found automatically, using enumeration of mechanistic pathways. In order to explore related reactions for which concrete enzymes may not yet exist, we use a generic model of the chemistry.

The molecules are encoded as graphs in the straight-forward manner, though without stereochemical information. This implies that certain classes of molecules are represented as a single molecule, e.g., and .

We have modelled the generic transformation rules listed in Tab. 2, and shown in the appendix.

Abbr. Name Description
AL Aldolase A generic aldol addition.
AlKe Aldose-Ketose Aldehyde to ketone conversion.
KeAl Ketose-Aldose Ketone to aldehyde conversion.
PHL Phosphohydrolase Use water to cleave off phosphate.
PK Phosphoketolase Break C-C-bond and add phosphate.
TAL Transaldolase Move -end.
TKL Transketolase Move -end.
Table 2: List of generic transformation rules for modelling the non-oxidative glycolysis chemistry. The full details of the rules are shown in the appendix.

In [17] the use of phosphoketolase is associated with specific names for the specific reactions:

  • XPK for

  • FPK for

We extend the naming scheme to cover educts with 7 and 8 carbons:

  • SPK for

  • OPK for

To create the reaction network we use the starting molecules , AcP, G3P, DHAP, E4P, R5P, Ru5P, F6P, S7P, and FBP.

The network is created by iterating the application of all the transformation rules listed in Tab. 2, until no new molecules are discovered. This chemistry is theoretically infinite, so we impose the restriction that no molecule with more than 8 carbon atoms may be created. This is a reasonable constraint, since even under physiological conditions carbohydrates are inherently metastable compounds. For carbohydrates larger than the decomposition reactions become a dominating reaction channel. Although alternatives (L-type PPP) or extended reaction sequences via higher carbohydrates have been suggested, their biochemical evidence has been questioned (see discussion in recent review [73]). The network generation therefore terminates, and results in a network with 81 molecules and 414 reactions. The ILP for finding pathways contains 12077 variables, of which 10477 are due to transit edges. Without transit node merging the number of transit edges would have been 23805.

For the overall reaction we have enumerated all solutions using at most 8 unique reactions and with at most 11 reactions happening in total. Additionally we disable the AL and PK reactions with small educt molecules; those with less than 3 carbon atoms each. This is in agreement with the experimentally characterised reversible ping-pong mechanism of these two enzymes [52]. This results in 263 different pathways, which were computed in 5 minutes on a desktop computer with an Intel® CoreTM i7-6700 CPU (). In Tab. 3 we categorise the pathways w.r.t. (i) the number of unique reactions used, (ii) the number of reactions, (iii) whether the only bisphosphate used is FBP, and (iv) the histogram of different PK reactions (see the modelling above). The table shows the number of solutions for each combination.

Table 3: Overview of number of NOG pathways. Categories marked with subscript , , and refer to Fig. 2 of [17], and we see that not only are there alternate solutions in the exact same categories, but in the case of we even find a shorter pathway with the same properties. Note that the left block is similar to the middle block of categories, but with 1 less reactions used. This is due to a replacement of part of the pathway with a shorter pathway, see Tab. 4. The pathway shown in Fig. 9(a) is from the framed blue category, and the pathway in Fig. 9(b) is from the framed green category. Their counterparts with the replacement from Tab. 4 are the unframed blue and green numbers.

Interestingly it turns out that the solution space where FBP is the only bisphosphate used is quite similar to the space where other bisphosphates are allowed but with only 7 unique reactions. The solutions are distributed in the same manner except for a 1-shift in the number of reactions. There is a 1-to-1 mapping between these two sets of solutions such that the only difference in the pathway is the sub-pathways described in Tab. 4.

Table 4: Two pathways with the same overall reaction. tab:nogDiffLong a 3-reaction pathway using FBP as bisphosphate. This sub-pathway is highlighted in Fig. 9(a). tab:nogDiffShort a 2-reaction pathway using a bisphosphate with 7 carbons. This sub-pathway is highlighted in Fig. 9(b).
Figure 10: Two solution pathways for NOG. fig:nogLessA A pathway similar to [17, Fig. 2a], but using fewer reactions. This solution category is the framed blue cell of Tab. 3. The highlighted sub-pathway is the pathway from Tab. 3(a). fig:nogShortest The shortest pathway found, denoted by the framed green cell in Tab. 3. The highlighted sub-pathway is the pathway from Tab. 3(b).

In Fig. 9(a) one of the solutions is illustrated in detail. This solution has similar properties to the solution shown in Fig. 2a in [17]: the phosphoketolase reactions all have F6P as educt, and the only bisphosphate used is FBP. However, this solution can be regarded as being shorter as it uses fewer reactions, though the number of unique reactions is the same. Allowing other bisphosphates than FBP to be used enables even shorter solutions to be found. Fig. 9(b) shows the shortest solution, which uses a bisphosphate with 7 carbon atoms. Its use of phosphoketolase is also different, as it uses both XPK, FPK, and SPK.

The basic structure of the NOG solutions combine a productive splitting reaction with a recycling network, which converts the “waste” produced during the splitting reaction back into starting material using carbon rearrangements. The major source of variability in the solutions stem from the flexibility and combinatorial nature of the carbon rearrangement chemistry. Our investigation nicely illustrates how vast the chemical network design space is. Even if the reaction chemistry is restricted to only a handful of enzyme functionalities, systematic exploration of the network design space without computational approaches is inefficient and many interesting solutions may be missed.

3.2 Overall Autocatalysis in the Formose Process

Carbohydrates , vulgo sugars, can formally be viewed as polymers of formaldehyde building blocks. The chemical reactivity of sugars is dominated by the two functional groups (1) carbonyl group () and (2) vicinal hydroxyl groups (), making carbohydrates amenable to keto-enol tautomerization, aldol addition, and retro-aldol fragmentation reactions. Due to this intrinsic reactivity, sugars are potentially labile compounds, which readily isomerize into complex mixtures under non-neutral conditions. The formose process, described by Butlerov [20] is one of the scarce examples of an autocatalytic reaction network, which generates complex mixtures of sugars from an aqueous formaldehyde solution under high-pH conditions (for a recent review on autocatalysis see [15]). The formose process has intensively been studied (for a recent review see [26]) for its potential to produce biologically significant carbohydrates from formaldehyde under prebiotic conditions [14]. The time-concentration behaviour of formaldehyde consumption during the formose process shows a linear lag phase, followed by exponential consumption, and a levelling off when the formose processes runs out of formaldehyde supply. This is the point where the clear reaction mixture starts to turn yellow and the generated sugars isomerize to a combinatorially complex mixture of compounds and black tar. The core autocatalytic cycle of the formose process usually found in the literature [65, 26, 69], glycolaldehyde “fixates”, via a series of keto-enol tautomerizations and aldol additions, two formaldehyde molecules (thereby doubling in size) followed by a retro-aldol fragmentation, resulting in two copies of glycolaldehyde. However, experimental evidence exists [65, 49] that this base cycle cannot account for the massive consumption of formaldehyde, after the lag-phase. The reasons is that, under the reaction conditions, enolization of the carbonyl group and aldol addition are much faster than “ketonization” (restoring the carbonyl group) required to close the autocatalytic base cycle. From this experimental evidence the following mechanistic picture arises; fast repeated addition of formaldehyde to enolized keto groups produces larger sugars, draining material from the base cycle and retro-aldol fragmentation of larger sugars replenishes the base cycle (or variants) with short carbohydrates. It is unknown how these higher order cycles are structurally organized around the base autocatalytic cycle and how they are interconnected.

For the formose chemistry we adopt the following naming scheme for molecules; C, where specifies the number of carbon atoms and indicates the position of the double bond. We use for aldehydes, for enol forms, and for ketones. The model follows [14] using only two basic types of reactions, keto-enol tautomerism and aldol reaction. Both are reversible, thus giving the four transformation rules listed in Tab. 5 and the appendix. Starting molecules are formaldehyde () and glycolaldehyde (). The network is expanded to include all derivable molecules with at most 9 carbon atoms, thus reaching a size of 284 molecules and 978 reactions. The computation time was 8 seconds on a Intel® CoreTM i7-6700 CPU (). The ILP for finding pathways contains 33241 variables, of which 28240 are due to transit edges. In this network all reactions are reversible and no transit node merging can thus been performed.

Abbr. Name Description
KeEn Keto-Enol Keto to enol form conversion.
EnKe Enol-Keto Enol to keto form conversion.
AA Aldol addition Merge an enol and a keto.
RAA Retro-aldol add. Split a keto into enol and keto.
Table 5: List of generic transformation rules for modelling the formose chemistry.

In the network we have enumerated overall autocatalytic pathways starting from those of minimum size. Specifically, pathways with the overall reaction

with minimum number of unique reactions used. For the purpose of this enumeration we consider two solutions equivalent if the set of reactions used is the same, thus how many times a reaction is used is ignored.

Maximum #C
Unique reactions used 4 5 6 7 8 9   Sum
6 0 0 1 1 1 2 5
7 0 0 0 0 0 2 2
8 1 5 7 17 37 68 135
9 0 0 12 12 37 69 130
10 0 12 50 274 849
11 0 5 41 190 738
Table 6: Overview of the number of autocatalytic flows in the formose chemistry. Solutions are grouped by the number of unique reactions used, and by the number of carbon atoms in the largest molecule used. We were not able to compute the missing entries due the demand of computation time (more than 200 hours) and memory (more than 64 GB RAM). Four of the smallest pathways can be found in the appendix.

Tab. 6 shows the resulting number of solutions found, grouped by the number of reactions used and the maximum size of molecules involved. The enumeration was split into 6 queries, one for each row of the table, and the combined computation time was approximately 134 hours, using a computing nodes with Intel® XeonTM E5-2665 CPUs (). The enumeration procedure issued in total 1,565,756 optimisation queries to CPLEX, arising from a search trees of a combined size of 3,129,218 nodes. The unknown entries in the table are due to high memory demands (more than ) for single enumeration runs. A selection of pathways can be found in the appendix.

The computational analysis of the chemical space of the formose process reveals that the density of autocatalytic cycles is very high. The majority of the enumerated autocatalytic cycles involve higher sugars (), but conform to the overall reaction of the shortest possible overall autocatalytic cycle, referred to as base cycle. The higher cycles branch off from compounds in the base cycle and merge back to the base cycle further downstream. The resulting structure of interwoven autocatalytic cycles is highly self-referential and shows, with respect to this property, similarities to evolved metabolic networks, where also all tentative waste compounds are recycled by feeding them back into the metabolic network. The massive drain of material by the feeding of the higher autocatalytic cycles considerably slows down the turn-over of the base cycle or even result in breaking of the cycle. Furthermore, even if the higher autocatalytic cycles themselves would not turn over, the base cycle would still be replenished with / compounds generated by retro-aldol fragmentation reactions of longer sugars. This results in a mechanistic scenario where compounds of the base cycle massively fixate formaldehyde in a fast polymerization type process to form longer sugars, which themselves feedback to their starting points on lower levels via fragmentation reactions, refilling these crucial compounds in the base cycle. In that way the fragmentation reactions compensates for the material loss of the autocatalytic base and higher cycles.

4 Discussion

The model presented here, based on integer hyperflows, provides a versatile framework for querying reaction networks for pathways. The restriction to integers, although harder to solve in the general case, has the advantage that the network flow solution can be directly interpreted in terms of mechanisms. Furthermore, questions such as “how many of a specific product can be formed from a limited amount of starting molecules” can easily be formulated and answered in our framework, due to the use of ILP. This approach also enables a targeted enumeration of pathways of interest. Naturally, such systematic enumerations can become computationally much more challenging than finding a single optimal solution. However, on the other hand, it allows for exploring the space of inferred mechanisms.

Autocatalysis, for instance, is frequently discussed as one of the key mechanistic concepts to understand the transition from abiotic to biotic chemistry. Reaction chemistries that “maximize” the emergence of this reaction pattern are considered the most plausible predecessors of the chemistries employed by present-day biochemistry. To identify these potential precursor reaction chemistries, a strict algorithmic approach for the search and identification of functional subnetworks in arbitrary reaction chemistry is indispensable. We applied our technique to the formose process and found an intricate topological structure of cascading autocatalytic cycles that feed upon each other, fitting well to the existing experimental evidence.

The NOG problem is a prime example of the problem setting in engineering cellular metabolisms, a branch of Synthetic Biology. Given starting material, a target molecule, and a set of enzymes, the task is to find a network that implements the desired chemical transformation. Because of lack of efficient computational network design methods, the established workflow rests on directed evolution and screening [18]

. This requires searching for the desired transformation network in metabolic networks of existing organisms, transplanting the identified pathways into the microbial cell factory, followed by improving the performance of the pathway in the alien environment of the host cell. Although impressive examples of this approach have been published

[67, 75, 31], this strategy must fail, if no natural pathway is known that implements the desired transformation.

The pathway modelling framework presented here is implemented as part of a larger software package that includes the methods for automatic generation of reaction networks using graph transformation [3, 10]. As illustrated with both the formose and NOG chemistries, this combination results in an extremely powerful framework for attacking a wide range of problems from Chemistry and Biology. The grounding of the graph transformation approach in well established mathematical theory allows us to provide efficient algorithms for an intermediary level of detail. For instance, it gives us a handle on tracking individual atoms throughout the network, e.g., following specific pathways. The graph transformation formalism also makes it possible to lump reaction sequences into a single overall reaction [7], thereby enabling precise computer-assisted coarse graining operations on reaction networks.

In recent years the interest in CTMs resurfaced in the context of designing alternative networks performing the same function as their natural archetypes [28], as well as in the context of the notion of optimality in biochemical network structure. While the earliest work focused on small, well-characterised pathways such as the pentose phosphate pathway and the TCA cycle [56], recent work extends the original approaches to larger networks including diverse reaction chemistry such as the central carbon metabolism [59] or the fixation pathways [12]. In this literature the concept of CTM is only discussed implicitly and to our knowledge there has been no explicit attempt to formalize CTMs.

Here, we have illustrated how higher-level CTMs can be detected efficiently in given reaction networks using overall (auto)catalytic pathways as an example. This is possible whenever input and output of a motif can be specified as a linear constraint, which includes in particular sets of allowed input molecules and desired output molecules. In combination with the graph transformation formalism our framework is even capable of designing alternative pathways enforcing, e.g., the use of prescribed intermediates.

Last but not least, the use of directed multi-hypergraphs is sufficient to strictly enforce chemical constraints. At the same time they make complex network transformations convenient to understand, since hyperflows correspond to subnetworks that have intuitively understandable graphical representations.


This work was supported in part by the Volkswagen Stiftung proj. no. I/82719, the COST-Action CM1304 “Systems Chemistry”, and by the Danish Council for Independent Research, Natural Sciences, grants DFF-1323-00247 and DFF-7014-00041. It is also supported by the ELSI Origins Network (EON), which is supported by a grant from the John Templeton Foundation. The opinions expressed in this publication are those of the authors and do not necessarily reflect the views of the John Templeton Foundation.


  • [1] R. K. Ahuja, T. L. Magnanti, and J.B. Orlin. Network Flows: Theory, Algorithms, and Applications. Prentice Hall, Englewood Cliffs, NJ, 1993.
  • [2] J. L. Andersen, C. Flamm, D. Merkle, and P. F. Stadler. Maximizing output and recognizing autocatalysis in chemical reaction networks is NP-complete. J. Systems Chem., 3:1, 2012.
  • [3] Jakob L. Andersen. MedØlDatschgerl (MØD)., 2017.
  • [4] Jakob L. Andersen, Tommy Andersen, Christoph Flamm, Martin M. Hanczyc, Daniel Merkle, and Peter F. Stadler. Navigating the chemical space of hcn polymerization and hydrolysis: Guiding graph grammars by mass spectrometry data. Entropy, 15(10):4066–4083, 2013.
  • [5] Jakob L. Andersen, Christoph Flamm, Martin M. Hanczyc, and Daniel Merkle. Towards an optimal DNA-templated molecular assembler. In ALIFE 14: The Fourteenth Conference on the Synthesis and Simulation of Living Systems, volume 14, pages 557–564, 2014.
  • [6] Jakob L Andersen, Christoph Flamm, Daniel Merkle, and Peter F. Stadler. Inferring chemical reaction patterns using graph grammar rule composition. J. Syst. Chem., 4:4, 2013.
  • [7] Jakob L Andersen, Christoph Flamm, Daniel Merkle, and Peter F Stadler. 50 shades of rule composition: From chemical reactions to higher levels of abstraction. In François Fages and Carla Carla Piazza, editors, Proceedings of the 1st International Conference on Formal Methods in Macro-Biology, volume 8738 of LNCS, pages 117–135. Springer-Verlag, Berlin Heidelberg, 2014.
  • [8] Jakob L. Andersen, Christoph Flamm, Daniel Merkle, and Peter F. Stadler. Generic strategies for chemical space exploration. International Journal of Computational Biology and Drug Design, 7(2/3):225 – 258, 2014.
  • [9] Jakob L. Andersen, Christoph Flamm, Daniel Merkle, and Peter F. Stadler. In silico support for Eschenmoser’s glyoxylate scenario. Isr. J. Chem., 2015.
  • [10] Jakob L. Andersen, Christoph Flamm, Daniel Merkle, and Peter F. Stadler. A software package for chemically inspired graph transformation. In Rachid Echahed and Mark Minas, editors, Graph Transformation: 9th International Conference, ICGT 2016, in Memory of Hartmut Ehrig, Held as Part of STAF 2016, Vienna, Austria, July 5-6, 2016, Proceedings, pages 73–88, Cham, 2016. Springer International Publishing.
  • [11] Jørgen Bang-Jensen and Gregory Z Gutin. Digraphs: Theory, algorithms and applications. Springer Monographs in Mathematics, 2009.
  • [12] Arren Bar-Even, Elad Noor, Nathan E. Lewis, and Ron Milo. Design and analysis of synthetic carbon fixation pathways. Proc Natl Acad Sci, 107:8889–8894, 2010.
  • [13] John E. Beasley and Francisco J. Planes. Recovering metabolic pathways via optimization. Bioinformatics, 23(1):92–98, 2007.
  • [14] S. A. Benner, H.-J. Kim, M.-J. Kim, and A. Ricardo. Planetary organic chemistry and the origins of biomolecules. Cold Spring Harbor Perspect. Biol., 2(7), 2010.
  • [15] Andrew J. Bissette and Stephen P. Fletcher. Mechanisms of Autocatalysis. J Angew Chemie Int Ed, 52(49):12800–12826, 2013.
  • [16] DG Blackmond. An examination of the role of autocatalytic cycles in the chemistry of proposed primordial reactions. J Angew Chemie Int Ed, 48:386–390, 2009.
  • [17] Igor W Bogorad, Tzu-Shyang Lin, and James C Liao. Synthetic non-oxidative glycolysis enables complete carbon conservation. Nature, 502(7473):693–697, 2013.
  • [18] Patrick M. Boyle and Pamela A. Silver. Parts plus pipes: Synthetic biology approaches to metabolic engineering. Metab. Eng., 14:223–232, 2012.
  • [19] Anthony P. Burgard, Shankar Vaidyaraman, and Costas D. Maranas. Minimal reaction sets for escherichia coli metabolism under different growth requirements and uptake environments. Biotechnol. Progr., 17(5):791–797, 2001.
  • [20] Alexandr Mikhaylovich Butlerov. Einiges über die chemische Structur der Körper. Zeitschrift für Chemie, 4:549–560, 1861.
  • [21] Riccardo Cambini, Giorgio Gallo, and Maria Grazia Scutellà. Flows on hypergraphs. Math. Program., 78:195–217, 1997.
  • [22] R Caspi, T Altman, K Dreher, C A Fulcher, P Subhraveti, I M Keseler, A Kothari, M Krummenacker, M Latendresse, L A Mueller, Q Ong, S Paley, A Pujar, A G Shearer, M Travers, D Weerasinghe, P Zhang, and P D Karp. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res., 40:D742–D753, 2012.
  • [23] Florian Centler, Christoph Kaleta, Pietro Speroni di Fenizio, and Peter Dittrich. Computing chemical organizations in biological networks. Bioinformatics, 24:1611–1618, 2008.
  • [24] Jung-Min Choi, Sang-Soo Han, and Hak-Sung Kim. Industrial applications of enzyme biocatalysis: Current status and future aspects. Biotech Adv, 33:1443–1454, 2015.
  • [25] Luis F. de Figueiredo, Adam Podhorski, Angel Rubio, Christoph Kaleta, John E. Beasley, Stefan Schuster, and Francisco J. Planes. Computing the shortest elementary flux modes in genome-scale metabolic networks. Bioinformatics, 25(23):3158–3165, 2009.
  • [26] Irina V. Delidovich, Alexandr N. Simonov, Oxana P. Taran, and Valentin N. Parmon. Catalytic Formation of Monosaccharides: From the Formose Reaction towards Selective Synthesis. ChemSusChem, 7(7):1833–1846, 2014.
  • [27] William C DeLoache, Zachary N Russ, Lauren Narcross, Andrew M Gonzales, Vincent J J Martin, and John E Dueber. An enzyme-coupled biosensor enables (s)-reticuline production in yeast from glucose. Nature Chem Biol, 11:465–471, 2015.
  • [28] Oliver Ebenhöh and Reinhart Heinrich. Stoichiometric design of metabolic networks: multifunctionality, clusters, optimization, weak and strong robustness. Bull. Math. Biol., 65:323–357, 2003.
  • [29] Nina Fedoroff and Walter Fontana. Small numbers of big molecules. Science, 297:1129–1130, 2002.
  • [30] D A Fell and J R Small. Fat synthesis in adipose tissue. an examination of stoichiometric constraints. Biochem. J., 238:781–786, 1986.
  • [31] Elena Fossati, Lauren Narcross, Andrew Ekins, Jean-Pierre Falgueyret, and Vincent J. J. Martin. Synthesis of morphinan alkaloids in saccharomyces cerevisiae. PLoS One, 10(4):e0124459, 2015.
  • [32] G. Gallo, C Gentile, D Pretolani, and G. Rago. Max Horn SAT and the minimum cut problem in directed hypergraphs. Math. Programming, 80:213–237, 1998.
  • [33] Giorgio Gallo, Giustino Longo, Stefano Pallottino, and Sang Nguyen. Directed hypergraphs and applications. Discrete Appl. Math., 42(2–3):177–201, 1993.
  • [34] Eduardo García-Junceda, Iván Lavandera, Dörte Rother, and Joerg H Schrittwieser. (chemo)enzymatic cascades – nature’s synthetic strategy transferred to the laboratory. J Mol Cat B: Enzymatic, 114:1–6, 2015.
  • [35] D T Gillespie. Stochastic simulation of chemical kinetics. Annu. Rev. Phys. Chem., 58:35–55, 2007.
  • [36] Purnananda Guptasarma. Does replication-induced transcription regulate synthesis of the myriad low copy number proteins of escherichia coli? Bioessays, 17(11):987–997, 1995.
  • [37] Thomas Handorf, Oliver Ebenhöh, and Reinhart Heinrich. Expanding metabolic networks: Scopes of compounds, robustness, and evolution. J. Mol. Evol., 61:498–512, 2005. 10.1007/s00239-005-0027-1.
  • [38] Gemma L Holliday, Claudia Andreini, Julia D Fischer, Syed Asad Rahman, Daniel E Almonacid, Sophie T Williams, and William R Pearson. MACiE: exploring the diversity of biochemical reactions. Nucleic Acids Res., 40:D783–D789, 2012.
  • [39] Gemma L. Holliday, Gail J. Bartlett, Daniel E. Almonacid, Noel M. O’Boyle, Peter Murray-Rust, Janet M. Thornton, and John B. O. Mitchell. MACiE: a database of enzyme reaction mechanisms. Bioinformatics, 21:4315–4316, 2005.
  • [40] Wim Hordijk. Autocatalytic Sets: From the Origin of Life to the Economy. BioSci, 63(11):877–881, 2013.
  • [41] Sudhakar Jonnalagadda and Rajagopalan Srinivasan. An efficient graph theory based method to identify every minimal reaction set in a metabolic network. BMC Syst. Biol., 8(1):28, Mar 2014.
  • [42] C Kaleta, F Centler, and P Dittrich. Analyzing molecular reaction networks: from pathways to chemical organizations. Mol. Biotechnol., 34:117–123, 2006.
  • [43] Christoph Kaleta, Luís Filipe de Figueiredo, and Stefan Schuster. Can the whole be less than the sum of its parts? pathway analysis in genome-scale metabolic networks using elementary flux patterns. Genome Res., 19(10):1872–1883, 2009.
  • [44] Narendra Karmarkar. A new polynomial-time algorithm for linear programming. In

    Proceedings of the sixteenth annual ACM symposium on Theory of computing

    , pages 302–311. ACM, 1984.
  • [45] R. M. Karp. Reducibility among combinatorial problems. In R. E. Miller and J. W. Thatcher, editors, Complexity of Computer Computations, pages 85–103, NY, 1972. Plenum Press.
  • [46] Kenneth J Kauffman, Purusharth Prakash, and Jeremy S Edwards. Advances in flux balance analysis. Current Opinion Biotech., 14:491–496, 2003.
  • [47] S. A. Kauffman. Autocatalytic sets of proteins. J. Theor. Biol., 119(1):1 – 24, 1986.
  • [48] Leonid G Khachiyan. Polynomial algorithms in linear programming. USSR Computational Mathematics and Mathematical Physics, 20(1):53–72, 1980.
  • [49] Hyo-Joong Kim, Alonso Ricardo, Heshan I. Illangkoon, Myong Jung Kim, Matthew A. Carrigan, Fabianne Frye, and Steven A. Benner. Synthesis of Carbohydrates in Mineral-Guided Prebiotic Cycles. J. Am. Chem. Soc., 133(24):9457–9468, 2011.
  • [50] S. Klamt and J. Stelling. Combinatorial complexity of pathway analysis in metabolic networks. Mol. Biol. Rep., 29:233–236, 2002.
  • [51] Steffen Klamt and Jörg Stelling. Two approaches for metabolic pathway analysis? Trends Biotechnol., 21:64–69, 2003.
  • [52] Roelco J Kleijn, Wouter A van Winden, Walter M van Gulik, and Joseph J Heijnen. Reviciting the C-label disribution of the non-oxidative branch of pentose phosphate pathway based upon kinetic and genetic evidence. FEBS J., 272:4970–4982, 2005.
  • [53] Peter Kreyssig, Christian Wozar, Stephan Peter, Tomás Veloz, Bashar Ibrahim, and Peter Dittrich. Effects of small particle numbers on long-term behaviour in discrete biochemical systems. Bioinformatics, 30:i475–i481, 2014.
  • [54] Ádám Kun, Balázs Papp, and Eörs Szathmáry. Computational identification of obligatorily autocatalytic replicators embedded in metabolic networks. Genome Biol., 9:R51, 2008.
  • [55] Sangbum Lee, Chan Phalakornkule, Michael M Domach, and Ignacio E Grossmann. Recursive milp model for finding all the alternate optima in lp models for metabolic networks. Computers & Chemical Engineering, 24(2):711–716, 2000.
  • [56] Enrique Meléndez-Hevia, Thomas G. Waddell, and Francisco Montero. Optimization of metabolism: The evolution of metabolic pathways towards simplicity through the game of the pentose phosphate cycle. J. Theor. Biol., 166:201–220, 1994.
  • [57] Ron Milo and Rob Phillips. Cell Biology by the Numbers. Garland Science, New York, NY, 2015.
  • [58] Jens Nielsen. Yeast cell factories on the horizon. Science, 349:1050–1051, 2015.
  • [59] Elad Noor, Eran Eden, Ron Milo, and Uri Alon. Central carbon metabolism as a minimal biochemical walk between precursors for biomass and energy. Mol. Cell, 39:809–820, 2010.
  • [60] Jeffrey D Orth, Ines Thiele, and Bernhard Ø Palsson. What is flux balance analysis? Nat. Biotech., 28:245–248, 2010.
  • [61] Jason A Papin, Joerg Stelling, Nathan D Price, Steffen Klamt, Stefan Schuster, and Bernhard O Palsson. Comparison of network-based pathway analysis methods. Trends Biotechnol., 22(8):400–405, 2004.
  • [62] F. J. Planes and J. E. Beasley. Path finding approaches and metabolic pathways. Discrete Appl. Math., 157(10):2244 – 2256, 2009. Networks in Computational Biology.
  • [63] Francisco J. Planes and John E. Beasley. A critical examination of stoichiometric and path-finding approaches to metabolic pathways. Briefings Bioinf., 9(5):422–436, 2008.
  • [64] Raphaël Plasson, Axel Brandenburg, Ludovic Jullien, and Hugues Bersini. Autocatalyses. J. Phys. Chem. A, 115(28):8073–8085, 2011.
  • [65] Alonso Ricardo, Fabianne Frye, Matthew A. Carrigan, Jeremiah D. Tipton, David H. Powell, and Steven A. Benner. 2-Hydroxymethylboronate as a Reagent To Detect Carbohydrates: Application to the Analysis of the Formose Reaction. J. Org. Chem., 71:9503–9505, 2006.
  • [66] Emanuele Ricca, Birgit Brucher, and Joerg H Schrittwieser. Multi-enzymatic cascade reactions: Overview and perspectives. Adv Synth Catal 353, 2239 – 2262, 353:2239–2262, 2011.
  • [67] Dae-Kyun Ro, Eric M. Paradise, Mario Ouellet, Karl J. Fisher, Karyn L. Newman, John M. Ndungu, Kimberly A. Ho, Rachel A. Eachus, Timothy S. Ham, James Kirby, Michelle C. Y. Chang, Sydnor T. Withers, Yoichiro Shiba, Richmond Sarpong, and Jay D. Keasling. Production of the antimalarial drug precursor artemisinic acid in engineered yeast. Nature, 440:940–943, 2006.
  • [68] Annika Röhl and Alexander Bockmayr. A mixed-integer linear programming approach to the reduction of genome-scale metabolic networks. BMC Bioinf., 18(1):2, Jan 2017.
  • [69] Kepa Ruiz-Mirazo, Carlos Briones, and Andrés de la Escosura. Prebiotic systems chemistry: New perspectives for the origins of life. Chem. Rev., 114:285–366, 2014.
  • [70] Joanne M. Savinell and Bernhard Ø Palsson. Network analysis of intermediary metabolism using linear optimization. I. Development of mathematical formalism. J. Theor. Biol., 154:421–454, 1992.
  • [71] Christophe H Schilling, David Letscher, and Bernhard Ø Palsson. Theory for the systemic definition of metabolic pathways and their use in interpreting metabolic function from a pathway-oriented perspective. J. Theor. Biol., 203(3):229–248, 2000.
  • [72] S Schuster and C Hilgetag. On elementary flux modes in biochemical reaction systems at steady state. J. Biol. Syst., 2(02):165–182, 1994.
  • [73] Anna Stincone, Alessandro Prigione, Thorsten Cramer, Mirjam M C Wamelink, Kate Campbell, Eric Cheung, Viridiana Olin-Sandoval, Nana-Maria Grüning, Antje Krüger, Mohammad Tauqeer Alam, Markus A Keller, Michael Breitenbach, Kevin M Brindle, Joshua D Rabinowitz, and Markus Ralser. The return of metabolism: biochemistry and physiology of the pentose phosphate pathway. Biol. Rev., 90:927–963, 2015.
  • [74] Garret Swart. Finding the convex hull facet by facet. Journal of Algorithms, 6(1):17–48, 1985.
  • [75] Kate Thodey, Stephanie Galanie, and Christina D Smolke. A microbial biomanufacturing platform for natural and semisynthetic opioids. Nature Chem Biol, 10:837–844, 2014.
  • [76] Andrew V Zeigarnik. On hypercycles and hypercircuits in hypergraphs. Discrete Mathematical Chemistry, 51:377–383, 2000.

Appendix A Properties of the Expanded Hypergraph

This section is the full version, including proofs, of the corresponding section in the paper.

a.1 Size of the Expanded Network

For a directed multi-hypergraph , let the size of it be denoted by . Note that if the hypergraph is seen as a bipartite normal graph, then this size corresponds to the number of vertices and edges.


The size of the extended network and the expanded network is polynomial in the size of the original network.


The size of the extended network is , as two half-edges are added to each vertex. For the expanded network, , the size depends on the in- and out-degree of the vertices in the extended network. Let denote the in-degree of , and the out-degree. Note that the degree counts the number of unique incident edges, so for the size contribution of to is still only 1. Then the size of the expanded network is

where the inequality stems from the fact that at most all vertices are in all head and tail sets, in the original network. ∎

a.2 Translation of Flow


A feasible flow on can be converted into an equivalent feasible flow in , with: , for all .


If is feasible, Eq. (7) holds for all . By the definition of , we can say that Eq. (7) holds for all for all . Recall that all transit edges have singleton heads and tails, and . Thus, by addition of Eq. (7) in each we get that :

Here, the flow along each transit edge is first added and then subtracted again, so we can simplify the expression to, :

Using the definition of and (Eq. (5) and (6)) one verifies that these relaxed constraints are exactly those of Eq. (1), i.e., the constraints on flows in . ∎


Let be a feasible flow on . It can then be decided in polynomial time, in the size of , if a feasible flow in exists such that for all . If it exists it can be computed in polynomial time.


The proof proceeds by a reduction to finding a feasible flow in bipartite normal directed graphs, with balance constraints. We refer to [1, 11] for a definition of this problem. Recall that the edges of are translated directly into a subset of the edges in , and we as such are tasked with finding a feasible flow on all the transit edges, which can be decomposed into finding a feasible flow for each expanded vertex independently. Let , then the hypergraph only contains edges with singleton head and tail multisets. It is therefore a normal directed, bipartite graph. We then define the flow balance function as and . Using the natural lower bound of flow and infinite upper bound finally gives us the complete specification. A feasible integer flow, if one exists, can be found in polynomial time in the size of the network[1, 11]. ∎

Appendix B Reduction from Independent Set to Maximum Output

This reduction is intended to serve as an example for the section on comparison to existing methods, in the context of LP relaxation. For a wider set of proofs of the computational complexity of integer hyperflow problems, see [2], where also reductions to bounded hyperedge degree networks are described. We here reduce between the optimisation versions of the problems, though it can easily be adopted to the decision versions. In order to easily compare the reduction to the LP relaxation of the Independent Set problem, we will also state the ILP formulation of the problem.

Independent Set: Given an undirected graph , find a set of vertices , of maximum cardinality, such that no edge in the graph is between vertices of , i.e., .

Maximum Output: Given

  • a directed multi-hypergraph ,

  • a special output vertex ,

  • and lower and upper bounds on flow ,

find an integer hyperflow , with maximum output of , i.e., .

b.1 ILP Formulation of Independent Set

Let be the input graph.

The resulting independent set is the vertices with .

b.2 Reduction to Maximum Output

Let be the input graphs for the Independent Set problem. We will first construct a directed multi-hypergraph :

We thus construct a vertex for each edge in and an extra “goal vertex”. The hyperedges correspond to the vertices of , with the goal vertex as the head and the rest of the vertices corresponding to the incident edges of as the tail.

The hypergraph is then I/O-extended to , and we define the lower bound on flow to be 0 on all hyperedges. We set the upper bound to 0 for the input flow to the goal vertex, and 1 for the input to the remaining vertices. All other upper bounds are left infinite.

After maximisation of the output flow of the goal vertex we can construct the independent set as all the vertices where the corresponding hyperedge has .

Appendix C Molecules

The following sections contains tables of the molecule abbreviations used in the main text. Some molecules are modelled using non-chemical vertex labels that represent unimportant substructures. For those molecules we explicitly visualise the labelled graph, while for the strictly chemical graphs we show a SMILES string.

c.1 Molecule Abbreviations Rules for NOG


Abbreviations Name SMILES/Visualisation
AcCoA Acetyl-CoA [collapse hydrogens=false][scale=0.6, align=c, trim=0 -5pt 0 -5pt]CC(=O)S[CoA]
AcP Acetyl phosphate OP(O)(=O)OC(=O)C
Carbon dioxide O=C=O
CoA Coenzyme A [collapse hydrogens=false][scale=0.6, align=c, trim=0 -5pt 0 -5pt][CoA]S
DHAP Dihydroxyacetone phosphate OP(O)(=O)OCC(=O)CO
E4P Erythrose 4-phosphate OP(O)(=O)OCC(O)C(O)C=O
F6P Fructose 6-phosphate OCC(=O)C(O)C(O)C(O)COP(=O)(O)O
FBP Fructose 1,6-bisphosphate OC(COP(O)(O)=O)C(O)C(O)C(COP(O)(O)=O)=O
G3P Glyceraldehyde 3-phosphate C(C(C=O)O)OP(=O)(O)O
Water O
Phosphate O=P(O)(O)O
R5P Ribose 5-phosphate OP(O)(=O)OCC(O)C(O)C(O)C=O
Ru5P, X5P Ribulose 5-phosphate, Xylulose 5-phosphate OCC(=O)C(O)C(O)COP(=O)(O)O
S7P Sedoheptulose 7-phosphate O=P(O)(OCC(O)C(O)C(O)C(O)C(=O)CO)O
SBP Sedoheptulose 1,7-bisphosphate OC(COP(O)(O)=O)C(O)C(O)C(O)C(COP(O)(O)=O)=O

Molecule Abbreviations for Formose

Abbreviations Name SMILES
Formaldehyde C=O
Glycolaldehyde OCC=O

Appendix D Transformation Rules

The following sections contains visualisations of all graph transformation rules used to generated the analysed networks. Each rule is annotated with references to the data its modelling is based on. Most of these reference are to entries in the MACiE (Mechanism, Annotation, and Classification in Enzymes) database [39, 38].

Transformation Rules for NOG



MACiE entry 0052.