This paper covers three main topics which support, motivate, and reinforce each other: reverse automatic differentiation (AD), string diagrams, and (hyper)graph rewriting.
AD is an established technique for evaluating the derivative of a function specified by a computer program, a particularly challenging exercise when the program contains higher-order sub-terms. This technique came to recent prominence due to its important role in algorithms for machine learning(DBLP:journals/jmlr/BaydinPRS17). We focus in particular on the influential algorithm defined in (DBLP:journals/toplas/PearlmutterS08), which lies at the foundation of many practical implementations of AD. The main novel contribution of our paper is to prove, for the first time, the soundness of this particular style of AD algorithm.
String diagrams are a formal graphical syntax used in the representation of morphisms in monoidal categories (selinger2010survey) which is finding an increasing number of applications in a wide range of mathematical, physical, and engineering domains. We contribute to the development of string diagrams by formulating a new hierarchical string diagram calculus, with associated equations, suitable for the representation of closed monoidal (and cartesian closed) structures. This innovation is, as we shall see in the paper, warranted: the hierarchical string diagrammatic syntax allows for a more intelligible formulation of a complex algorithm and, most importantly, a new style of inductive argument which leads to a relatively simple proof of soundness.
Finally, hierarchical hypergraphs are given as a concrete and efficient representation of hierarchical string diagrams, which pave the way towards efficient and effective implementation of AD as graph rewriting in the well established framework of double-pushout (DPO) rewriting (DBLP:conf/mfcs/EhrigK76). Moreover, we identify a class of hierarchical hypergraphs, which we call hypernets, which are a sound and complete representation of the hierarchical string diagram calculus. This is the third and final contribution of our paper.
2. Higher-order string diagrams
String diagrams are a convenient alternative notation for constructing morphisms, in particular in (strict) monoidal categories. In this paper we largely build on the syntax proposed in (DBLP:conf/csl/Mellies06), with only a few cosmetic changes aimed at making higher-order concepts more perspicuous. String diagrams in this work are to be read from top to bottom.
2.1. Functorial string diagrams
We start with the basic language of categories, ranged over by . This language consists of a collection of objects ranged over by and two families of terminal symbols, identities , represented as an -labelled vertical stem pics/tikzit/components/id-typed , and morphisms , represented by labelled boxes with an -labelled top stem (which we sometimes call input or operand) and a -labelled bottom stem (which we call output or result) pics/tikzit/components/morphism-typed . We may distinguish (families of) terminal symbols in the diagram language with particular geometrical shapes instead of labelled boxes, much in the way we have artificially disgtinguished identities from other morphisms.
Terms, ranged over by , are created using composition. Given , and we write as the stacking of the diagrams for and . Since the output of must match the input of we connect the corresponding stems, to give a graph-like appearance to the string diagram pics/tikzit/components/sequential-composition-typed . We enforce two properties of composition familiar from category theory. First, composition is associative, meaning . This identification is subsumed by the diagrammatic notation. Second, we require the identity axiom . Diagrammatically, this means the lengths of the stems of a diagram can be lengthened or shortened without ambiguity.
We extend our string diagram language with labelled frames which indicate mappings between morphisms of different categories. The application of a mapping to a morphism, as a string diagram, is indicated by drawing an -labelled frame around the morphism and modifying the stems of the diagram as appropriate, as seen in Fig. 1. Note that in this diagram the stems and morphims inside the frame are a different color to the frame and the stems outside the frame. This is an indication that objects and morphisms belong to potentially distinct categories. When the map goes from a category to itself, we may use the same color inside and out of the frame, but often leave the frame itself a different color to emphasize the mapping.
Such morphism mapping constitutes a functor if it satisfies the following properties. First, there must be a mapping on objects, which we also denote by common abuse of notation, such that and for all in the source category. Second, this mapping must respect basic categorical structures, expressed in the language of string diagrams in Fig. 2. We use for the identity functor. Given two functors with matching domains and codomains we write and in the sequel.
The diagrammatic notation can be generalised to bifunctors in the obvious way, by drawing two side-by-side boxes for the arguments. One bifunctor that plays a special role in string diagram is the tensor product or monoidal product, in particular when it is strict
. The tensor is represented diagrammatically as:
pics/tikzit/pam/bifunctor = pics/tikzit/pam/parallel-comp-strict = pics/tikzit/pam/parallel-comp
The diagram above contains three representations. In the first one we can see tensor as a bifunctor, with the two separate boxes indicating the two arguments of the bifunctor. The second one is special notation for the tensor, essentially hiding the functorial box and using a graphical convention (the horizontal line) to represent the ‘unravelling’ of the tensored-labelled stem into components. Finally, the third one is special notation for strict monoidal tensor, in which the tensor is represented as the list of its components . The strict diagram absorbs the associativity isomorphisms and makes the tensor associative on the nose:
pics/tikzit/pam/left-assoc = pics/tikzit/pam/triple-assoc = pics/tikzit/pam/right-assoc
Henceforth we will work in the strict setting, but it will be sometimes convenient to de-strictify a diagram and group individual stems in stems with tensor types. Coherence (and strictness) ensure that such de-strictifiations can be always performed unambiguously.
In the strict setting we also have special notation for the unit of the tensor, which we represent as the empty list; identity on is represented as empty space. It is immediate then, diagrammatically, that .
We further extend the string diagram with the concept of natural transformation between functors with the same domains and same codomains. Natural transformations are object-indexed families of morphisms written as (or just ) which obey the following family of axioms, expressed in the language of string diagrams as:
pics/tikzit/pam/natl-trans-pre = pics/tikzit/pam/natl-trans-post
One particularly interesting example of natural transformation is symmetry, written as , for which we use the special geometric shape of two crossing wires. The fact that it is a natural transformation immediately imples that
Symmetry is also an involution, i.e. .
For functors and such that natural transformations (called the counit), (called the unit) exist, they form an adjunction if and only if they satisfy the following family of axioms:
For all pics/tikzit/pam/counit and pics/tikzit/pam/unit we have that pics/tikzit/pam/unitcounit2 and pics/tikzit/pam/unitcounit1 .
In this situation, we say is a left adjoint and is the right adjoint.
We adopt the convention of writing the counit of an adjunction as a downward pointing semicircle, the unit as an upward pointing semicircle, and omitting the label when the map is clear from context. Note that (DBLP:conf/csl/Mellies06) does not discuss adjunctions specifically, although the streamling of the notations and calculations with adjunctions is a prime benefit of the string diagram notation, and no additional technical content is required.
2.2. String diagrams for monoidal-closed and cartesian-closed categories
Monoidal closed categories and cartesian closed categories are categorical models for the linear and simply-typed lambda calculus, respectively. A monoidal closed category arises if for every object in the category, the (endo)functor has a right adjoint . Diagrammatically, we depict these functors as:
pics/tikzit/pam/right-product and pics/tikzit/pam/exponential
Instantiated to the families of functors above, the naturality and adjunction equations are expressed in string diagrams as in Fig. 3. The counit of the adjunction is normally called eval, and we call the unit coeval for the sake of symmetry in terminology and by analogy with compact-closed categories.
To further expand our diagrammatic language to cartesian closed categories, one easy way is to add natural transformations (contraction) and (weakening) such that (heunen2012lectures). We represent both of these natural transformations with a black dot, disambiguated by the quantity of results. The monoid equations are pics/tikzit/equations/copy-monoid . Copying and discarding are both consequences of naturality, i.e. pics/tikzit/equations/natural-copy and pics/tikzit/equations/natural-discard , respectively.
Here we have presented adjunctions with unit and counit natural transformations. An equivalent description of adjunctions involves a natural bijection between sets of morphisms. In the case of monoidal or cartesian closed categories, this bijection is between and . This bijection is known as “currying”, and is a more germane presentation for the lambda calculus. We define abstraction, the composition of the unit of the adjunction with the functorial box for , as syntactic sugar denoted by a plain box with rounded corners.
This structure for abstraction gives our notion of a hierarchical string diagram, which is to say a string diagram which may contain other string diagrams in these boxes.
Terms written as string diagrams can be presented in a particular form, which will turn out to lead to some useful insights:
Definition 2.1 (Foliations).
A foliation is a string diagram written as the sequential composition of a list of diagrams called leafs. A singleton leaf is a diagram consisting of a non-identity atomic string diagram (symmetry, evaluation, operation, contraction, or weakening) or an abstraction, tensored with any number of identities. A maximally sequential foliation is a foliation comprising only singleton leafs. A maximally sequential hierarchical foliation is a maximally sequential foliation which is either abstraction free, or in which all abstracted diagrams are also maximally sequential hierarchical foliations.
For instance, if are not identities then the maximally sequential foliations of pics/tikzit/components/foliation-fxg are pics/tikzit/components/foliation-fg and pics/tikzit/components/foliation-gf . The following is an obvious generalisation of a folklore theorem about monoidal categories.
Lemma 2.2 ().
Any hierarchical string diagram can be written as a (non-unique) maximally sequential hierarchical foliation.
The proof is straightforward. The graphical intuition which underlies the proof is that whenever two morphisms are “level” in a diagram one of them can be “shifted” using identities, then tensors and compositions can be reorganised using the functorialty of the tensor.
Foliations are convenient because syntactic transformations can be presented recursively on the foliation. This spares us the need to define ‘big’ rules for sequential and tensorial composition. Instead only ‘small’ rules for composing a term with a singleton leaf are required. This makes transformations easier to specify, and also makes for simpler inductive proofs, using the foliation as a list.
2.3. Explicit substitution in string diagrams
In this section we illustrate the use of hierarchical string diagrams to represent the simply typed lambda calculus with explicit substitutions. This is an interesting example in its own right, but more importantly it sets the scene for the next section, where we define an automatic differentiation algorithm. The explicit substitutions play an essential role, as they give us a handle on managing closures, which the AD algorithm requires.
Hierarchical string diagrams with rules for copying and discarding are a ready-made graphical syntax for the lambda calculus with explicit substitutions (DBLP:journals/jfp/AbadiCCL91). Syntactically, these calculi fall mainly in two categories, those using deBruijn indices or those using named variables. The former have better formal properties and their formalisation can be mechanised, but are not a very human-readable notation. The latter are easier to read but have some subtle failures of alpha equivalence. Formalising alpha equivalence for calculi of explicit substitution is a somewhat tricky problem, the solution of which leads back to rather intricate notations (DBLP:journals/iandc/FernandezG07). String diagrams thus seem like an improved syntax for explicit substitutions, as they are both formal and, we contend, rather readable. The graphical notation is variable-free therefore alpha equivalence is not an issue, and other equational properties are also rendered obvious by the diagrammatic representation.
We pick for comparison a presentation of the calculus of explicit substitutions with named variables (DBLP:conf/csl/Kesner07), leaving aside alpha equivalence. A es-term is inductively defined as a variable , an application , an abstraction or a substituted term , where and are es-terms and a variable. The terms and both bind in . The set of free variables of a term , denoted is defined as usual.
Note that the syntactic object , an explicit substitution, is not a term because of the way variable is bound. By contrast, in deBruijn formulations of the lambda calculus with explicit substitutions, substitutions are terms.
The following key equations and reduction rules are considered:
We qualify these as ‘key’ because for more precise resource analysis the (Lamb) and (Comp) rewrites usually are given ‘linear’ forms depending on whether the substituted variable occurs in the term or not. For our purposes here we can assume the linear versions are subsumed by the general version and the (Gc) axiom.
The string diagram interpretation of es is given in Fig. 4. The proof of soundness is a straightforward exercise. Some of the axioms are simply instances of the identity (Var), associativity of composition (Comp), and naturality of the contraction (App) or weakening (Gc) — we leave them as an exercise. The two non-trivial axioms and their proofs are in Fig. 5.
Finally, the CE structural rule is also rather interesting, as it requires proving that pics/tikzit/ex-sub/ce-lhs = pics/tikzit/ex-sub/ce-rhs . The proof is an immediate consequence of the functoriality of the tensor and of the identity law. What is interesting is that the two diagrams look very similar as graphs. Indeed, the intuition that diagrams represented by isomorphic graphs denote equal morphisms will be made rigorous in Sec. 4.
To emphasise the syntactic nature of the transformation we will call the objects the types of the diagrams. Since we are situated in a strict-monoidal setting we will write a composite tensor of objects as a list of types. We write a generic typed term in the language of string diagrams as pics/tikzit/components/op-generic .
3. A graphical AD algorithm
This section represents the main technical result of our paper, to define and and prove the soundness of an algorithm for performing reverse-mode automatic differentiation on hierarchical string diagrams. The algorithm can be considered a simplified version of that presented in (DBLP:journals/toplas/PearlmutterS08). This algorithm is remarkable for being one of the first such algorithms that can be applied to code containing closures and higher-order functions. It is particularly in the treatment of higher-order features where we draw inspiration from their work.
The soundness property of the algorithm is technically interesting because simple inductive proofs of correctness do not seem possible. If simply taking the gradient of a higher-order function, the algorithm is actually unsound. However, when taking the gradient of a function with ground-type inputs and outputs only, the results are correct even if the function contains higher-order terms.
Unlike the original algorithm, however, we do not provide automatic differentiation as a first-class entity. This means, implicitly, that we also do not have a means to perform ‘higher order differentiation’ in the sense of differentiating the differential operator itself. In the original work, this was achieved by extending the language with rich runtime reflection capabilities whose formalisation is entirely outside of the scope of our paper. Our algorithm instead is formulated as a meta-level set of rules on hierarchical string diagrams or, in actual implementation, on their hypernet representation. This is akin to the source-to-source transformation approach to automatic differentiation.
The setting for this algorithm is that of a (strict) cartesian closed category generated from one object , representing the real numbers, and a collection of primitive operations (addition, multiplication, trigonometric functions, etc.) and their gradients, along with a collection of nullary primitive operations for real constants. Among these, real addition pics/tikzit/components/add-sd and zero pics/tikzit/components/zero-sd must be included. In the string diagrams throughout this section we represent constants as a triangle instead of a box, just for improved readability. We write pics/tikzit/components/oplus-sd and pics/tikzit/components/barzero-sd for the obvious extension of pics/tikzit/components/add-sd and pics/tikzit/components/zero-sd to bundles.
Each of the provided primitive operations must also come equipped with a pullback diagram: for a primitive operation pics/tikzit/components/op-generic of type , its pullback diagram pics/tikzit/components/op-generic must have type . We make no assumptions about the pullback diagram of an operator, other than its type. However, the correctness result in this section will require pullback diagrams to be ‘correct’ implementations of the gradient of the corresponding operation.
3.1. Rewrite rules on string diagrams
The AD algorithm consists of three separate sets of transformations, the application of which we denote by differently coloured boxes around a diagram. We emphasise that these boxes represent meta-level transformations, and are not to be confused with object-level entities such as the rounded rectangles that we use to denote abstraction.
To reduce clutter we use coloured boxes rather than labelled frames to indicate string-diagram transformations. The first transformation, whose only rewrite rule can be found in Fig. 6, is denoted by a blue box and is the entry point of the algorithm. Given an input diagram with operands of type and results of type , this transformation produces an adjoint diagram with operands of type and results of type , corresponding to the result of the original diagram plus an abstraction, the backpropagator, that computes the gradient of the original diagram at the point at which the adjoint diagram is evaluated. In particular, if the original diagram produces a single result of type
, when evaluating the backpropagator atwe will obtain the gradient of the original diagram.
This transformation consists, in turn, of two components: a forward pass transformation (in orange), rewrite rules in Fig. 7) and a reverse pass transformation (in green), rewrite rules in Fig. 7). As the naming suggests, these correspond to the forward and reverse passes commonly employed in reverse-mode AD. The forward pass executes the original function ‘as is’, whereas the reverse pass computes the gradient of every sub-expression, in reverse order of execution. In our algorithm, as is usually the case in reverse-mode AD systems, some intermediate values computed during the forward pass are preserved and passed along to the diagram corresponding to the reverse pass. This is shown in Fig. 6 as a bundle of type flowing from the forward-pass computation into the backpropagator.
The rewrites for the forward-pass transformation, depicted in Fig. 7, are self-explanatory, as they are limited to constructing a copy of the original diagram. Only two cases (those for evaluation and abstraction) merit some attention. For abstraction, the diagram enclosed by the bubble is recursively transformed using the blue rule — that is, any abstraction in the primal diagram is replaced by a new abstraction that computes the adjoint of the original one. Then, when function evaluation in the primal diagram is translated by the forward pass, the result of the adjoint application contains both the result of the original abstraction and a backpropagator which is not used in the forward pass but is set aside for the reverse pass.
The rules governing the reverse pass transformation, in Fig. 7, are more involved, so we provide here an intuitive explanation for each. The first rule, which handles constants, states that constants do not contribute to the gradient of the graph. The second rule computes the gradient of the identity function to be the identity. Contraction is transformed into addition, since the gradient of the diagonal map
is the addition of tangent vectors, and weakened variables become zero. Each primitive operation is replaced by its corresponding pullback diagram, which receives as additional operands the copies of the inputs to the operation in the forward pass. This is why we require that every primitive operation to be mapped to a pullback diagram of the appropriate type. Some examples of pullback diagrams for common operations can be found in Fig.8.
The reverse pass handling of application and abstraction are, both in the original algorithm and in our interpretation of it, difficult to back up with compelling intuitions, but we shall try our best.
Remember that the forward pass transforms every abstraction in the primal diagram in order to compute the original value together with a backpropagator. The latter is captured by the reverse pass in every application rule. When rewriting an application node, the reverse pass instead applies the backpropagator given by the forward rule. This backpropagator in turn produces a wire for every operand of the body of the original abstraction, which are swapped into the correct order. The abstraction rule in the reverse pass then expects the sensitivity of an abstraction to consist of a bundle of wires corresponding to the sensitivities of each captured wire.
As an example illustrating this algorithm and its handling of closures in particular, we provide in Fig. 8(a) a diagram that might result from a program like let mul y = x * y in mul x + x, with the free variable x corresponding to its single operand. On the right, in Fig. 8(b), we show the result of applying the adjoint transformation to this diagram (see Appendix LABEL:appendix:animations for an animated step-by-step derivation). It is a mere calculation to check that the resulting backpropagator, when applied to input , can be evaluated to the correct derivative of the polynomial .
Lemma 3.1 ().
Every rule erases one node in the fringe of the left-hand side diagram, and that no two rules can be applied to erase the same node. Therefore, if two rules can apply to the same diagram, it must be the case that they apply to different fringe nodes. It is then easily checked that every pair of such rules commutes, modulo a permutation of the wires that are propagated from the forward to the reverse pass. For a concrete example, consider the two sequences of rewrites in Fig. 10. ∎
Remark 1 ().
The proof above, although very simple, illustrates a proof method that is made possible by using string diagrams: induction on the length of the foliation of the diagram (Def. 2.1). The ‘fringe’ mentioned in the proof above is simply the ‘bottom’ (in this case) leaf in the chosen foliation, noting that the foliation is not unique. This proof method also benefits additionally from absence of names and all related bureaucratic concerns (free vs. bound, alpha equivalence, capture-avoiding substitution).
3.2. Reverse derivative categories
In order to prove that the algorithm we have given is correct, we need to select an appropriate semantic domain that reflects the behaviour of the gradient operator from calculus. The obvious choice is the setting of reverse derivative categories (cockett2019reverse). In simple terms, these are cartesian categories equipped with a ‘reverse differential combinator’ which behaves, in a suitable sense, like taking the gradient of a function in multivariate calculus. For a more thorough treatment and explanation, we refer the reader to loc. cit.. They are defined as follows:
Definition 3.2 ().
(cockett2019reverse, Def. 13) A reverse derivative category is a cartesian left-additive category endowed with a combinator sending each morphism to a morphism which satisfies the following conditions:
One caveat of reverse derivative categories is that they do not naturally accommodate higher-order functions. Indeed, there is no natural notion of a reverse derivative category with exponentials In contrast, cartesian differential categories which can be extended to differential -categories (bucciarelli2010categorical) — cartesian differential categories which are cartesian closed and where the differential combinator is ‘well-behaved’ with respect to abstraction. This limitation is of no concern to us, however: we do not claim that the AD algorithm in this paper produces correct gradients for arbitrary higher-order diagrams, only for those whose inputs and outputs have first-order types – even if they do contain higher-order sub-terms. The first-order setting of reverse derivative categories is sufficient.
Henceforth, we will assume that the strict cartesian category generated by the object and the collection of primitive operators and pullback diagrams defined above is a reverse derivative category. We will use the notation pics/tikzit/components/op-generic-grad to denote the reverse derivative pics/tikzit/components/op-generic in diagrammatic form. In addition, we require that this reverse derivative category satisfies:
The of the left-additive structure coincides with pics/tikzit/components/zero-sd
The of the left-additive structure coincides with pics/tikzit/components/add-sd
For each primitive operation pics/tikzit/components/op-generic , we have pics/tikzit/components/op-generic pics/tikzit/components/op-generic-grad
Using this notation, all the equations in Definition 3.2 can be written diagramatically. The graphical translation of conditions [RD.1] and [RD.3]-[RD.5], which will be relevant to us later, can be found in Fig. 11.
Our proof of correctness proceeds in two steps. First, we prove that our AD transformation is compatible with Beta reduction, that is to say, whenever two diagrams are equivalent modulo Beta reduction, then so are their adjoints. Then, we show that the AD transformation is correct for diagrams featuring only first-order nodes (that is to say, no abstractions or applications). For both of these steps, we will make use of the following technical result, which simply states that the forward and reverse passes are compositional.
Lemma 3.3 ().
A trivial induction on the maximally sequential hierarchical foliation of . ∎
The proof proceeds by straightforward application of the rewrite rules. We provide the calculation in full in Fig. 13 (an animated version of which can be be found in Appendix LABEL:appendix:animations).
Lemma 3.5 ().
For every diagram pics/tikzit/components/op-generic whose operands and results are all of a first-order type, there is a Beta-equivalent diagram pics/tikzit/components/op-generic that contains no instances of abstraction or evaluation and whose every node has all first-order inputs and outputs.
Since our graphical language is simply-typed, evaluation of the diagram pics/tikzit/components/op-generic is guaranteed to terminate, following an argument similar to the proofs of strong normalisation for the simply-typed -calculus (such as the one in (girard1989proofs, Chapter 6)), or for interaction nets (e.g. (DBLP:journals/tcs/Mackie00)). Such a normal form cannot contain any redexes, and so any application or evaluation node must be connected to an operand or a result wire, which cannot be the case as these have all first-order types. ∎
Theorem 3.6 ().
For every diagram pics/tikzit/components/op-generic whose operands and results are all first-order, Eqn. 13(a) holds.
Applying Lemma 3.5 and Lemma 3.4, it suffices to consider the case where pics/tikzit/components/op-generic contains no instances of application or evaluation. Applying the rewrite rule in Fig. 6 and calculating gives the diagram in Eqn. 13(b). The result then follows by induction on the foliation of the diagram pics/tikzit/components/op-generic . We show one case in full.
= pics/ad/soundness-dup-grad-2 = pics/ad/soundness-dup-grad-3 = pics/ad/soundness-dup-grad-4 = pics/ad/soundness-dup-grad-5 =
= pics/ad/soundness-dup-grad-7 = pics/ad/soundness-dup-grad-8 = pics/ad/soundness-dup-grad-9 = pics/ad/soundness-dup-grad-10
4. Hierarchical hypergraphs and rewriting
Even though string diagrams are convenient for mathematical reasoning, the actual implementation of string diagram rewriting poses a challenge. In order to perform a rewrite step, we need to find a match in a string diagram, but the presence of a redex may depend on which representation of the diagram we pick among ones that are equivalent according to the laws of symmetric monoidal categories (DBLP:conf/lics/BonchiGKSZ16). A solution is identifying a data structure interpreting string diagrams with the property that equivalent representations of a string diagram all have the same interpretation. For standard string diagrams in symmetric monoidal categories, such a data structure is provided by hypergraphs with interfaces, and the interpretation allows to model string diagram rewriting efficiently as double-pushout rewriting of the corresponding hypergraphs (DBLP:conf/lics/BonchiGKSZ16). In this section, we do something similar for our hierarchical hypergraphs: we devise a suitable combinatorial structure, called hypernets, and show that two hierarchical hypergraphs are interpreted as the same hypernet whenever they are equivalent modulo the laws of symmetric monoidal categories (theorem 4.12). This allows us to conclude that string diagram rewriting for hierarchical hypergraphs can be ‘implemented’ as double-pushout rewriting of hypernets (theorem 4.14).
Hierarchical hypergraphs have been used before many times, see e.g. (DBLP:journals/cuza/BruniGL10; DBLP:journals/jcss/DrewesHP02; DBLP:journals/jcss/Palacz04). Our approach is broadly similar, but with enough subtle differences that it is necessary to give our own definitions. For a more detailed comparison, see Sec. 5.2.
4.1. Hierarchical hypergraphs and hypernets
A hierarchical hypergraph is a labelled, directed hypergraph with a parent relationship which determines the hierarchical structure. We fix sets of vertex and edge labels and . When comparing hierarchical hypergraphs to string diagrams, should also be the set of base types in the string diagrams, while should be the added operations.
Definition 4.1 ().
A hierarchical hypergraph is a tuple comprising a finite set of vertices , a finite set of edges , source and target functions , labelling functions and , and parent functions and .
While the source, target, and labelling functions are standard for labelled, directed hypergraphs, we must add some conditions to the the parent functions. First, we require an edge and any of its source and target vertices to have the same parent: namely that for all and respectively. Second, the parent relation must be acyclic, so that repeatedly applying should eventually end up in the right summand of . More precisely, we assume for all there is some such that where is the element of 1 and is the extension of adding .
When the parent of a vertex or edge is the element from the right summand, we say it is a outermost vertex or outermost edge. If the label of an edge is , we say (with some abuse) that it is unlabelled. When considering multiple hierarchical hypergraphs, we use subscripts to disambiguate these data.
We borrow terms from graph theory for hierarchical hypergraphs. An important example is that we call a hierarchical hypergraph connected when for every pair of outermost vertices, there is a sequence of edges (oriented either forward or reverse) joining the two vertices.
In every hierarchical hypergraph , associated to every edge is a subgraph consisting of edges (and vertices ) satisfying (and ) for some (and ). We denote this subgraph and call it “the inner hypergraph of ”. If a subgraph of a hierarchical hypergraph has the property that for all , we call down-closed. When depicting a hierarchical hypergraph, we indicate the inner hypergraph of an edge by nesting the inner subgraph within its edge, like abstraction in hierarchical string diagrams. An example can be seen in Fig. 15.
We give an example hierarchical hypergraph in Fig. 15. This hypergraph has six vertices (the six black dots), which we name from top to bottom then left to right. These vertices are labelled as follows: , , , and . There are three edges: with label , with label , and unlabelled but with an inner hypergraph. The sources and targets of these edges are mostly clear (and mostly one-element lists), except . As mentioned above, is the parent edge for most of the graph, so and for all other edges and vertices the parent function returns .
Definition 4.2 ().
In a hierarchical hypergraph , a vertex is an input vertex if it never occurs as a target, an output vertex if it never occurs as a source, an interface if it is either an input or an output vertex, and an isolated vertex if it is both an input and an output vertex.
We think of vertices of hierarchical hypergraphs as representing objects in a category and edges representing morphisms from the product of the source objects to the product of the target objects. However, hierarchical hypergraphs are generally much more expressive than string diagrams: multiple edges can use the same vertex as a source or a target, and there could be cycles in the graph. We will therefore be interested in a more restricted class of hypergraphs, which we call hypernets.
Definition 4.3 ().
A hypernet is a hierarchical hypergraph with the following additional properties: (1) acyclicity, (2) all vertices occur as a source for at most one (and at most once in ), (3) all vertices occur as a target for at most one (and at most once in ), (4) there are specified total orderings on the input and output vertices, and (5) if and only if is the empty hypergraph.
For now, we return our focus to hierarchical hypergraphs and situate them in a category.
Definition 4.4 ().
A morphism of hierarchical hypergraphs is a pair of functions with and .
These functions are required to respect the structure of the hierarchical hypergraphs in the following senses:
Note that we do not require that outermost vertices and edges are sent to outermost vertices and edges, due to the condition in (5) and (6). If conditions (5) and (6) hold for all and , we say the morphism is strict.
Hierarchical hypergraphs and the morphisms between them form a category. This category clearly has finite coproducts given by disjoint union. We investigate pushouts in this category in order to support double pushout rewriting. When restricted to strict morphisms, all pushouts exist and can be computed as in . Unfortunately, the category of hierarchical hypergraphs with strict morphisms is not expressive enough for the rewriting tasks we require.
When allowing all hierarchical hypergraph morphisms, the category does not have all pushouts or even pushouts along monos. This is primarily due to ambiguities in the parents of outermost vertices and edges. Two non-strict morphisms can embed a graph into two unmergeable parts of different graphs. However, there are enough pushouts in this category to support the double pushout structure we need.
In essence, given an arbitrary span with the property that the outermost interfaces of and are isomorphic together with a (monomorphic) matching of the leftmost graph in the span in another hierarchical hypergraph, the next few lemmas give conditions for a unique (up to isomorphism) and completing the following diagram, where all squares are pushouts:
Here is a copy of the outermost interface vertices of (and ) and (resp. ) is the inclusion of these vertices in the graph. That the inner two squares are the only pushout or pushout complement in this format is straightforward. Lemma 4.8 gives the requirements on in the leftmost square to make it a pushout complement. These conditions entail being a mono, which we will see in Lemma 4.7 is an important critera to get the existence of a pushout in the rightmost square.
Example 4.5 ().
We will illustrate the graph rewriting process with a running example. For now, we just give an example of the input data we are expecting: a span and a matching. We start with a span corresponding to a particular instance of the Abs rule and a matching of the left-hand side of this span in another hierarchical hypergraph.
To avoid clutter, we omit the vertex labels and do not give a full description of the morphisms other than to say they are the obvious map preserving edge labels. The goal of this section is to formally describe how the copy of in is replaced with a copy of .
Definition 4.6 ().
Suppose is a hierarchical hypergraph consisting of only isolated vertices, and let and be morphisms. We say and have complimentary images if
for all ,
for all , and
either never occurs as a source and never occurs as a target or vice versa.
Though the following result does not completely characterize pushouts in this category, it gives enough pushouts for us to construct the rightmost square in the diagram above.
Lemma 4.7 ().
Suppose is a hypergraph of isolated vertices, sends all vertices of to outermost vertices in the connected hypergraph , and the images of the vertices under have a single common parent. Then the span has a pushout.
If further and are monos and they have complimentary images, then and being hypernets implies the pushout is as well.
The pushout can be formed by taking the disjoint union of and , then identifying the images of under the respective maps. The parent of the outermost vertices and edges of is defined to be the common parent of the . The remaining properties of hierarchical hypergraphs are straightforward.
That and are monos with complimentary images ensures that when this identification occurs, equivalence classes of vertices have at most two representatives (one from and one from ) and that the resulting equivalence class is used as an input or an output by at most one edge from either graph. This makes the quotient is a hypernet. ∎
Next we establish the result constructing pushout complements in the leftmost square.
Lemma 4.8 ().
Suppose is a hypergraph of isolated vertices, is a bijection of vertices of with outermost interface vertices in the connected graph . Further suppose is a monomorphism with the following properties: (1) the image of is down-closed, and (2) edges in outside the image of are incident only with vertices outside the image of or vertices in . Then there is a unique graph such that is a pushout complement to . Further, is a monomorphism.
Note that condition (2) is the dangling condition from double pushout rewriting, and that is a monomorphism strengthens the identification condition. Hence, it is not surprising to define . The wrinkle introduced by hierarchical hypergraphs is the hierarchy: down-closedness (1) is required in order to remove (most of) the image of from without introducing ambiguity in the parent relationships. When an edge is in the image of in , this condition ensures all of its children are also in the image, so we do not need to redefine its parent after deletion. ∎
Example 4.9 ().
The leftmost pushout (complement) square excises the matching of from the graph in which it is embedded. The next two squares replace with while the portion of the morphism from marks where the result should be reinserted in . Finally, the rightmost pushout glues back into .
Remark 2 ().
This example illustrates one of the distinctive features of our approach to hierarchical hypergraph rewriting, namely the ability to send outermost vertices (and edges) to inner vertices (and edges). This is crucial, because both legs of the span require it. Equally crucial, but maybe less obvious, is that morphisms can sent outermost vertices (and edges) to images with different parents, as seen in the right leg of the span. It is these novel properties that provide the required level of expressiveness we need to formulate our string diagram equations as graph rewrites.
4.2. Soundness and completeness
Now that we understand how hierarchical hypergraphs can be rewritten, we turn to the the connection between string diagrams and hypernets. Suppose we fix a set of basic types and a set of operations from which the string diagrams of section 2 are built. From these basic types, generate types according to the rules of LABEL:sec:types. We restrict vertex labels for our hypernets to these atomic types. Just as with string diagrams, there is a notion of well-typedness for hypernets.
Definition 4.10 ().
A well-typed hypernet satisfies the following properties. (1) If is an operation, then the types of match the input types of this operation and similarly the types of match the output types of . (2) If , let be the types of the list of input interfaces in in the interface order given on the hypernet. Similarly let be the list of types on the output interfaces of . Then there is a list of types such that and .
As we will see, though property (4) of hypernets (a specified total ordering on sources and targets) is not so important in rewriting, it is necessary to establish a typing for hypernets. To reflect this, we depict the interface ordering graphically by drawing a copy of the interface in the specified ordering left-to-right in a blue box and the correspondence between the ordered interface and the hypergraph.
Every string diagram has an interpretation as a hypernet with these labels, inductively defined based on a(ny) decomposition of the string diagram. For base cases, contraction, weakening, evaluation, and any of the basic operations are interpreted as a single hyperedge with the corresponding label from . Identity and unary swap are slightly subtle: in the hypernet representation they do not require an edge. Rather, they are graphs of isolated vertices with different input and output orderings, as show in Figures 16(c) and 16(d). For induction steps, compositions and abstraction are as expected, with abstraction taking advantage of the hierarchical structure. We write for the interpretation of a string diagram under this scheme.
Note that this interepretation scheme absorbs the equations required of a symmetric monoidal category. Sequentially composing an identity hypernet with any other hypernet (that can be composed with that identity) results in a hypernet isomorphic to the given hypernet. Similarly, associativity of compositions and tensoring with the unit also yield isomorphic hypernets. Finally, the two sides of the equations for naturality and idempotency of symmetry, when interpreted under this scheme, also result in isomorphic hypernets.
In the other direction, we can show that every hypernet arises as the interpretation of a string diagram, and further that all string diagrams which have a given hypernet as their interpretation are equivalent modulo the equations of symmetric monoidal categories. Before we do this, it is useful to establish a result about hypernets similar to the foliation decomposition of lemma 2.2.
Lemma 4.11 ().
Let be a hypernet, and be a connected, down-closed, outermost-level subnet. Then there is a decomposition of into the sequential composition of (1) a hypernet, (2) the parallel composition of with some identity hypernets, and (3) another hypernet.
Topologically sort the hypernet. The edges after the last edge of form (3). Edges before the last of but not including any edges from form (1). Finally, any edges of , together with new vertices (identity hypernets) for any outputs of (1) not used in form (2). ∎
Note that this decomposition is certainly not unique.
Theorem 4.12 (Hierarchical Definability).
Let be well-typed hypernet. There exists a string diagram such that . If is any other string diagram with the property that , then .
For existence, we induct on the number of edges of the hypernet. If the hypernet has no edges, the output ordering is some permutation of the input ordering. Since all permutations are generated by the set of adjacent transpositions, there is a combination of unary swaps which has this hypernet as its interpretation.
If the hypernet has at least one edge, it has a outermost-level edge. Let be an arbitrary outermost-level edge, and be the hypernet inside (if there is one). By lemma 4.11, we can decompose this hypernet into three pieces. Hypernets (1) and (3) do not contain , so by the induction hypothesis they have a string diagram representation. If (2) has a basic label, it is the interpretation of the corresponding symbol in the signature. If (2) does not have a basic label (and thus has an interior hypernet), the induction hypothesis again tells us the interior is the interpretation of a string diagram. Then is the interpretation of the abstraction of that string diagram. The sequential composition of these three string diagrams is then a string diagram represented by . ∎
We denote the string diagram (up to symmetric monoidal axioms) of a hypernet as .
Lemma 4.13 ().
Let be a hypernet. For every connected subnet which contains all its descendents (as in assumption (1) of Lem. 4.8), there is a decomposition of the string diagram which includes the string diagram .
Axioms of the Cartesian structure are not “baked in” to the hypernet structure in the way the properties of symmetric monoidal structures are. Instead, these equations are modeled by bidirectional rewriting rules.
Theorem 4.14 ().
Every equation of Cartesian closed categories has a corresponding bidirectional rewrite rule. That is, for every equation of Cartesian closed categories (expressed as string diagrams) there is a rewrite rule such that applying this rule in a hypernet
5. Related work
5.1. String diagrams
Cartesian closed categories have been thoroughly studied in the context of logic and type theory, because of the well-known correspondence of their internal language with -calculus and intuitionistic logic (sorensen2006lectures). The linear version of this triad involves monoidal rather than cartesian categories, but also proof nets, and linear logic, as indicated already in the original paper (DBLP:journals/tcs/Girard87). (DBLP:conf/csl/Mellies06) provides the foundation on which we build our language of string diagrams, noting that all the basic ingredients are already there.
The route of using an enhancement of the monoidal closed structure with additional properties to control sharing is fruitful and has been employed many times. For example it is found in (DBLP:journals/jlp/BonchiSZ18), where the manipulation of variables endowed with algebraic theories is modeled as a cartesian structure on the top of a linear structure, or in (DBLP:conf/popl/Ghica07) to specify multiplexers and demultiplexers in high-level synthesis.
To the best of our knowledge, we provide the first fully specified string diagrammatic language for cartesian closed categories generated as a graphical syntax quotiented by equations. Our approach shares similarities with the formalisms of sharing graphs for describing -calculus computations (DBLP:conf/popl/Lamping90). The main difference is that string diagrams, albeit graphical in appearance, can be manipulated as a syntax, whereas sharing graphs are usually studied as combinatorial objects. Unlike syntax, reasoning about graphs algebraically requires a higher degree of technical sophistication (DBLP:journals/tcs/Guerrini99). Finally, sharing graphs are typically used to study low-level computational models for functional languages, in particular quantitative models (DBLP:journals/lmcs/MuroyaG19), whereas our approach is more focussed on equational reasoning and rewriting, and does not have the ambition of investigating the resources employed during computation.
Monoidal closed categories extend not only to cartesian closed categories, but also to -autonomous categories. This second variation is very much relevant to the study of multiplicative linear logic and it has been extensively studied in terms of proof nets. Our graphical calculus is essentially different from proof nets. The grammar of generating morphism does not stem from a sequent calculus, and we capture the intended semantics via equations rather than a correctness criterion. But the connection might be made precise relying on the existing translations between proof nets and string diagrams (hughes; shulman). This is not the only possible extension. For example, another direction of extending monoidal categories is to traced monoidal categories (DBLP:conf/tlca/Hasegawa97), which has interesting applications to modeling circuits with feedback (DBLP:conf/csl/GhicaJL17). Finally, a different style of hierarchical string diagrams appear in the literature to represent universal properties graphically such as Kan extensions (DBLP:conf/mpc/Hinze12) and free monads (DBLP:conf/icfp/PirogW16).
The only other proposal for a string-diagram language for monoidal closed categories which we are aware of is that of (baez2010physics). We found a great deal of inspiration in their proposal, but in the process of fully working out the equational properties and the combinatorial structure we felt compelled to deviate somewhat from loc. cit.. To keep the language of types as simple as possible and as strict as possible they propose an intriguing graphical innovation, a so-called clasp operator on stems. The exponential type is represented, using the clasp, as . Much like in our own language, a bubble is used to represent currying. A simple example which uses both these graphical devices is the name of a function , represented as . We found it difficult to work out some of the unspecified details in particular for the clasp, e.g. how it can be used to represent higher-order objects (e.g. ). But, more fundamentally, it was unclear to us what status we can give the clasp both as a syntactic and as a combinatorial object. To conclude, the approach in loc. cit. is ambitious and innovative and, if the details were figured out, potentially more elegant in that it preserves an appealing visual parallel between monoidal closed and compact closed structures. However, filling in the missing details proved to be too challenging.
5.2. Rewriting of hierarchical graphs
The notion of hierarchical hypergraph used in this paper is closely inspired by, and a formalisation of the graphs used in (spartan).
Although there is no consensus on a standard definition of hierarchical graphs, the various approaches to rewriting on these structures (DBLP:journals/cuza/BruniGL10; DBLP:journals/jcss/DrewesHP02; DBLP:journals/jcss/Palacz04) give slight variations on the idea of graphs containing other graphs and notions of morphism between them. Some of the variations are minor: our hierarchical hypergraphs are directed, but some works do not make this choice (DBLP:journals/jcss/DrewesHP02). Other differences are much more stark. Sometimes edges are permitted to connect vertices with different parents, as in (DBLP:journals/jcss/Palacz04), sometimes this is prohibited (as it is here), and sometimes it is possible with the aid of an explicit renaming function, as in (DBLP:journals/cuza/BruniGL10). Some approaches consider only “strict morphisms” sending items in the outermost level to the outermost level, but others consider a larger class where this need not hold. Due to the subtle but technically significant differences between our requirements and the properties of previous works, it was not possible to reuse previous work wholesale, and we found it necessary to introduce our own variation.
The formal correspondence between monoidal closed categories and hierarchical hypergraphs lies in a tradition of analogous results relating string diagram rewriting and double-pushout hypergraph rewriting, see (DBLP:conf/lics/BonchiGKSZ16). To the best of our knowledge, such correspondence has not been spelled out in the way presented in our work, although the idea of linking the exponential structure of closed categories with the hierarchy structure of hierarchical hypergraphs may be found in (DBLP:journals/entcs/CocciaGM02). Although it does not uses string diagrams or other categorical tools, the algebraic specification language for hierarchical graphs studied in (DBLP:conf/tgc/BruniGL10) is aiming towards similar goals.
5.3. Syntax as a graph-like data structure
Representing intermediate stages of the compiler as graphs is a long-established practice in compiler design and engineering. Graphs are an efficient syntactic representation which are recognised as a better target for optimisation and analysis than raw text. In its simplest incarnation the graph representation of terms is just a abstract syntax tree, but more sophisticated representations were increasingly used (DBLP:conf/irep/ClickP95), sometimes leading to specific and novel optimisation techniques (DBLP:conf/pldi/NandiWAWDGT20).
The use of graph-like representation outside of compiler engineering has a lot of untapped potential, as advocated by some (DBLP:journals/corr/abs-2102-02363). This is not entirely new, for example interaction nets are a graph-like semantics of higher-order computation (DBLP:conf/popl/Lafont90), but they are specified at a fairly high level of informality which string diagrams and hypernets make fully formal in two different ways.
Although not presented explicitly as a string diagram language, the treatment of closures in (DBLP:journals/entcs/SchweimeierJ99) is related in methodology to our work, although the use of partially-traced partially-closed pretmonoidal categories as the categorical setting, in order to accommodate for effects, is significantly different than our cartesian closed categorical language.
Finally, another related line of work which we found inspirational is the use of graph-like languages inspired by proof nets to bridge the gap between syntax and abstract machines, in order to provide a quantitative analysis of reduction strategies for the lambda calculus (DBLP:journals/tcs/Accattoli15).
5.4. Automatic differentiation
Our AD algorithm is an adaptation of (DBLP:journals/toplas/PearlmutterS08). Beyond the presentation based on string diagrams, the main differences are that our algorithm applies to simply-typed, recursion-free code and it acts as a source-to-source transformation, lacking the reflection features that enabled higher-order differentiation in the original work. We chose to focus on this algorithm for a few reasons: first, reverse-mode AD is both more immediately useful (see (DBLP:journals/jmlr/BaydinPRS17) for a comparison of both approaches) and harder to implement and prove correct than forward-mode AD. Simple forward-mode AD algorithms based on operator overloading (DBLP:conf/icfp/Karczmarczuk98; DBLP:conf/popl/PearlmutterS07) capable of handling higher-order functions predate (DBLP:journals/toplas/PearlmutterS08). Second, it is to our knowledge the first published algorithm for performing reverse-mode AD on higher-order code111An earlier algorithm appears in (karczmarczuk2000adjoint), however it is argued in (DBLP:journals/toplas/PearlmutterS08) that this algorithm results in different computation graphs – and worse asymptotic complexity – than ‘traditional’ reverse-mode AD, it forms the basis of a number of efficient implementations (DBLP:journals/corr/BaydinPS16; DBLP:journals/corr/SiskindP16a) and does not require more complex features, unlike e.g. (DBLP:journals/pacmpl/WangZDWER19) which makes use of mutable state and continuations or (DBLP:journals/pacmpl/BrunelMP20) which relies on a limited form of continuations to encode dual spaces.
A wave of recent research has also tackled the issues of correctness in automatic differentiation. Notably, (DBLP:journals/pacmpl/BrunelMP20) and (DBLP:conf/esop/Vakar21) provide correct reverse-mode AD algorithms capable of handling closures. Unlike the first work, however, our algorithm is purely functional and, while the second one can correctly differentiate terms with higher-order inputs and outputs, it achieves so by using a more expensive representation of tangents of function spaces. The main contribution of our approach, however, is the simplicity of the involved proofs thanks to our diagrammatic notation which we believe improves on the readability of the original paper (DBLP:journals/toplas/PearlmutterS08) and the denser proofs in newer literature (DBLP:journals/pacmpl/BrunelMP20; DBLP:conf/fossacs/HuotSV20; DBLP:conf/esop/Vakar21).
6. Conclusion and further work
In this paper we have presented a recipe for provably correct reverse automatic differentiation built around a new, hierarchical, calculus of string diagrams. As we have seen, the string diagram presentation simplifies much of a bookkeeping of variables which a term calculus would require, which, we believe, makes a complicated algorithm more readable. More importantly, the new perspective offered by string diagrams and in particular the presentation of terms as foliations opens the door for new and useful proof techniques. Finally, the combinatorial representation of string diagrams as hypergraphs makes it possible to formulate automatic differentiation using the established language of DPO graph rewriting.
In this paper we have not discussed implementation matters, yet these are of the essence. This algorithm is a practical one and it can be incorporated into real-life compilers for real-life programming languages. We surmise that the new an improved perspective on AD that string diagrams offers will help handle other challenging features of real-life languages, such as effects and, in particular, the crucial role that closures play. This is work is ongoing.