 # The differential calculus of causal functions

Causal functions of sequences occur throughout computer science, from theory to hardware to machine learning. Mealy machines, synchronous digital circuits, signal flow graphs, and recurrent neural networks all have behaviour that can be described by causal functions. In this work, we examine a differential calculus of causal functions which includes many of the familiar properties of standard multivariable differential calculus. These causal functions operate on infinite sequences, but this work gives a different notion of an infinite-dimensional derivative than either the Fréchet or Gateaux derivative used in functional analysis. In addition to showing many standard properties of differentiation, we show causal differentiation obeys a unique recurrence rule. We use this recurrence rule to compute the derivative of a simple recurrent neural network called an Elman network by hand and describe how the computed derivative can be used to train the network.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many computations on infinite data streams operate in a causal manner, meaning their th output depends only on the first

inputs. Mealy machines, clocked digital circuits, signal flow graphs, recurrent neural networks, and discrete time feedback loops in control theory are a few examples of systems performing such computations. When designing these kinds of systems to fit some specification, a common issue is figuring out how adjusting one part of the system will affect the behaviour of the whole. If the system has some real-valued semantics, as is especially common in machine learning or control theory, the derivative of these semantics with respect to a quantity of interest, say an internal parameter, gives a locally-valid first-order estimate of the system-wide effect of a small change to that quantity. Unfortunately, since the most natural semantics for infinite data streams is in an infinite-dimensional vector space, it is not practical to use the resulting infinite-dimensional derivative.

To get around this, one tactic is to replace the infinite system by a finite system obtained by an approximation or heuristic and take derivatives of the replacement system. This can be seen, for example, in

backpropagation through time 

, which trains a recurrent neural network by first unrolling the feedback loop the appropriate number of times and then applying traditional backpropagation to the unrolled network.

This tactic has the advantage that we can take derivatives in a familiar (finite-dimensional) setting, but the disadvantage that it is not clear what properties survive the approximation process from the unfamiliar (infinite-dimensional) setting. For example, it is not immediately clear whether backpropagation through time obeys the usual rules of differential calculus, like a sum or chain rule, nor is this issue confronted in the literature, to the best of our knowledge. Thus, useful compositional properties of differentiation are ignored in exchange for a comfortable setting in which to do calculus.

In this work, we take advantage of the fact that causal functions between sequences are already essentially limits of finite-dimensional functions and therefore have derivatives which can also be expressed as essentially limits of the derivatives of these finite-dimensional functions. This leads us to the basics of a differential calculus of causal functions. Unlike arbitrary functions between sequences, this limiting process allows us to avoid the use of normed vector spaces, and so we believe our notion of derivative is distinct from Fréchet derivatives.

Outline. In section 2, we define causal functions and recall several mechanisms by which these functions on infinite data can be defined. In particular, we recall a coalgebraic scheme finding causal functions as the behaviour of Mealy machines (proposition 2.1), and give a definitional scheme in terms of so-called finite approximants (definition 2.2). In section 3, we define differentiability and derivatives of causal functions on real-vector sequences (definition 3.2) and compute several examples. In section 4, we obtain several rules for our differential causal calculus analogous to those of multivariable calculus, including a chain rule, parallel rule, sum rule, product rule, reciprocal rule, and quotient rule (propositions 4.1, 4.1, 4.1, 4.1, 4.2, and 4.2, respectively). We additionally find a new rule without a traditional analogue we call the recurrence rule (theorem 4.3). Finally, in section 5, we apply this calculus to find derivatives of a simple kind of recurrent neural network called an Elman network  by hand. We also demonstrate how to use the derivative of the network with respect to a parameter to guide updates of that parameter to drive the network towards a desired behaviour.

## 2 Causal functions of sequences

A sequence or stream in a set is a countably infinite list of values from , which we also think of as a function from the natural numbers to . If is a stream in , we denote its value at by . We may also think of a stream as a listing of its image, like . The set of all sequences in is denoted .

Given and , we can form a new sequence by prepending to . The sequence is defined by and . This operation can be extended to prepend arbitrary finite-length words by the obvious recursion. Conversely, we can destruct a given sequence into an element and a second sequence with functions and defined by and .

[slicing] If is a stream and are natural numbers, the slicing is the list .

[causal function] A function is causal means implies for all and .

### 2.1 Causal functions via coalgebraic finality

A standard coalgebraic approach to causal functions is to view them as the behaviour of Mealy machines.

[Mealy functor] Given two sets , the functor is defined by on objects and on morphisms.

-coalgebras are Mealy machines with input alphabet and output alphabet , and possibly an infinite state space. The set of causal functions carries a final -coalgebra using the following operations, originally observed by Rutten in .

The Mealy output of a causal function is the function defined by for any .

Given and a causal function , the Mealy (-)derivative of is the causal function defined by .

Note is well-defined even though may be freely chosen due to the causality of .

[Proposition 2.2, ] The set of causal functions carries an -coalgebra via , which is a final -coalgebra.

Hence, a coalgebraic methodology for defining causal functions is to define a Mealy machine and take the image of a particular state in the final coalgebra. By constructing the Mealy machine cleverly, one can ensure the resulting causal function has some desired properties. This is the core idea behind the “syntactic method” using GSOS definitions in . In that work, a Mealy machine of terms is built in such a way that all causal functions can be recovered.

Suppose is a vector space over . This vector space structure can be extended to componentwise in the obvious way. To illustrate the coalgebraic method, we characterise this structure with coalgebraic definitions.

To define sequence vector sum coalgebraically, we define a Mealy machine with one state, satisfying and . Then is defined to be the image of in the final -coalgebra.

Note that technically the vector sum in should be a function of type , so we are tacitly using the isomorphism between and . We will be using similar recastings of sequences in the sequel without bringing up this point again.

The zero vector can similarly be defined by a single state Mealy machine with input alphabet 1 and output alphabet , satisfying and . The zero vector of is the global element picked out by the image of .

Finally, scalar multiplication can be defined with a Mealy machine with states , such that and . Then , where is the image of in the final -coalgebra.

We immediately begin dropping the subscripts from and and when the relevant vector space can be inferred from context.

### 2.2 Causal functions via finite approximation

Another approach to causal functions is consider them as a limit of finite approximations, replacing the single function on infinite data with infinitely many functions on finite data. There are (at least) two approaches with this general style, which we briefly describe next.

Let be a causal function and .

The pointwise approximation of is the sequence of functions defined by .

The stringwise approximation of is the sequence of functions defined by .

Again, these are well-defined despite being arbitrary due to ’s causality. We chose the letters and deliberately—sometimes the pointwise approximants of a causal function are called its Unrollings, and the stringwise approximants are called its Truncations.

Conversely, given an arbitrary collection of functions for , there is a unique causal function whose pointwise approximation is the sequence . Thus we have the following bijective correspondence:

 (1)

We can nearly do the same for stringwise approximations, but the sequence must satisfy for all and .

The interchangeability between a causal function and its approximants is a crucial theme in this work. Since a function’s pointwise and stringwise approximants are inter-obtainable, we will sometimes refer to a causal function’s “finite approximants” by which we mean either family of approximants.

### 2.3 Causal functions via recurrence

Finite approximants are a very flexible way of defining causal functions, but causal functions may have a more compact representation when they conform to a regular pattern. Recurrence is one such pattern where a causal function is defined by repeatedly using an ordinary function and an initial value to obtain via:

 [reci(g)(σ)]k={g(σ0,i) if k=0g(σk,[reci(g)(σ)]k−1) if k>0

Recurrent definitions can be converted into finite approximant definitions using the following: . Note these pointwise approximants satisfy the recurrence relation .

The unary running product function can be defined by a recurrence relation:

 ∏(σ)=τ⇔{τk+1=σk+1⋅τk after τ0=σ0⋅1

Here is multiplication of reals and . In approximant form, .

A special case of recurrent causal functions occurs when there is an such that for all . In this case, and in particular does not depend on the initial value or any entry for . We denote by in this special case since it maps componentwise across the input sequence.

## 3 Differentiating causal functions

Our goal in this work is to develop a basic differential calculus for causal functions. Thus we will focus our attention on causal functions between real-vector sequences for , specializing from causal functions on general sets from the last section. We will draw many of our illustrating examples for derivatives from Rutten’s stream calculus , which describes many such causal functions between real-number streams. More importantly,  establishes many useful algebraic properties of these functions rigorously via coalgebraic methods.

There are many different approaches one might consider to defining differentiable causal functions. One might be to take the original coalgebraic definition and replace the underlying category () with a category of finite-dimensional Cartesian spaces and differentiable (or smooth) maps. Unfortunately, the space of differentiable functions between finite-dimensional spaces is not finite-dimensional, so the exponential needed to define the functor in this category does not exist.

Another approach is to think of causal functions as functions between infinite dimensional vector spaces and take standard notions from analysis, like Fréchet derivatives, and apply them in this context. However, norms on sequence spaces usually impose a finiteness condition like bounded or square-summable on the domains and ranges of sequence functions. These restrictions are compatible with many causal functions like the pointwise sum function above, but other causal functions like the running product function become significantly less interesting.

Our approach to differentiating causal functions is to consider a causal function differentiable when all of its finite approximants are differentiable via the correspondence (1). We will develop this idea rigorously in section 3.2, but first we need to know a bit about linear causal functions.

### 3.1 Linear causal functions

Stated abstractly, the derivative of a function at a point is a linear map which provides an approximate change in the output of a function given an input representing a small change in the input to that function . Since linear functions are in bijective correspondence with their slopes, typically in single-variable calculus the derivative of a function at a point is instead given as a single real number. In multivariable calculus, derivatives are usually represented by (Jacobian) matrices since matrices represent linear maps between finite dimensional spaces. Linear functions between infinite dimensional vector spaces do not have a similarly compact, computationally-useful representation, but we can still define derivatives of (causal) functions at points to be linear (causal) maps.

We described the natural vector space structure of in Example 2.1. A linear causal function is a causal function which is also linear with respect to this vector space structure.

A causal function is linear when and for all and .

Let be a causal function. The following are equivalent:

1. is linear,

2. is linear for all , and

3. is linear for all .

This refines the correspondence (1), allowing us to define a linear causal function by naming linear finite approximants.

Since linear functions between finite dimensional vector spaces can be represented by matrices, we can think of linear causal functions as limits of the matrices representing its finite approximants. This view results in row-finite infinite matrices, such as:

 ⎡⎢ ⎢ ⎢ ⎢ ⎢⎣A0000…A10A110…A20A21A22…⋮⋮⋮⋱⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

where the are -row, -column blocks such that for all entries are 0. These are related to the matrices for the approximants of the causal function as follows.

1. The matrix is the matrix representing .

2. The matrix is the matrix representing . The compatibility conditions on the functions ensure that the matrix for can be found in the upper left corner of the matrix for . Note also the upper triangular nature of the matrices for are a consequence of causality—the first outputs can depend only on the first inputs, so the last entries in the top row must all be 0 and so on.

Unlike finite-dimensional matrices, we do not think these infinite matrices are a computationally useful representation, but they are conceptually useful to get an idea of how causal linear functions can be considered the limit of their linear truncations.

### 3.2 Definition of derivative

As we have mentioned, we will use the derivatives of the approximants of a causal function to define the derivative of the causal function itself. We denote the -row, -column Jacobian matrix of a differentiable function at by . Recall this matrix is

 ⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣∂φ1∂x1(x)∂φ1∂x2(x)…∂φ1∂xn(x)∂φ2∂x1(x)∂φ2∂x2(x)…∂φ2∂xn(x)⋮⋮⋱⋮∂φm∂x1(x)∂φm∂x2(x)…∂φm∂xn(x)⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

where and . We will also be glossing over the distinction between a matrix and the linear function it represents, using to mean either when convenient.

A causal function is differentiable at if all of its finite approximants are differentiable at for all . If is differentiable at , the derivative of at is the unique linear causal function satisfying .

In this definition we are using the correspondence (1), refined in Lemma 3.1, which allows us to define a causal (linear) function by specifying its (linear) finite approximants. We could equally well have used stringwise approximants in this definition rather than pointwise approximants, as the following lemma states.

The causal function is differentiable at if and only if each of are differentiable at for all . In this case, satisfies .

Though we have mentioned this is not particularly useful computationally, the derivative of a differentiable function at a point has a representation as a row-finite infinite matrix.

If is differentiable at , each has an -row, -column Jacobian matrix representing its derivative at . Let be -row, -column blocks of this Jacobian, so that The derivative of at is the linear causal function represented by the row-finite infinite matrix

 D∗f(σ)=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣A0000…A10A110…A20A21A22…⋮⋮⋮⋱⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

Note that this linear causal function can be evaluated at a sequence by multiplying the infinite matrix by , considered as an infinite column vector.

### 3.3 Examples

Next, we use this definition of derivative to find the causal derivatives of some basic functions from Rutten’s stream calculus.

We show the pointwise sum stream function is its own derivative at every point . Note , so . This is the matrix representation of itself, so or, in other notation, for any .

This argument can be repeated for all pointwise sum functions , replacing the “1” blocks in the Jacobian above with .

Since the derivative of any constant is , the derivative of any constant sequence must necessarily be the zero sequence. In stream calculus, there are two important constant sequences defined corecursively: defined by and for all and defined by and . Written out as sequences, and .

.

Next, we consider the Cauchy sequence product. Under the correspondence between sequences and formal power series , the Cauchy product is the sequence operation corresponding to the (Cauchy) product of formal power series. This operation is coalgebraically characterized in Rutten  as the unique function satisfying and . For our purposes, the explicit definition is more useful: .

We compute the derivative of the Cauchy product.

Notice that multiplying this matrix by (an initial segment) of a small change sequence yields

 J(Uk(×))(σ0,τ0,…,σk,τk)(Δσ0,Δτ0,…,Δσk,Δτk)=k∑i=0Δσi⋅τk−i+k∑i=0σi⋅Δτk−i

Therefore, .

Another sequence product considered in the stream calculus is the Hadamard product, also called the pointwise product. Defined coalgebraically, the Hadamard product is the unique binary operation defined by and . This has a similar derivative to the Cauchy product: .

Note that these derivatives make sense without any reference to properties of the sequences used. We are not aware of a way to realize this derivative as an instance of a notion of derivative known in analysis. The most obvious notion to try is a Fréchet derivative induced by a norm on the space of sequences. However, all norms we know on these spaces, including -norms and -geometric norms for , restrict the space of sequences to various extents.

## 4 Rules of causal differentiation

Just as it is impractical to compute all derivatives from the definition in undergraduate calculus, it is also impractical to compute causal derivatives directly from the definition. To ease this burden, one typically proves various “rules” of differentiation which provide compositional recipes for finding derivatives. That is our task in this section.

There are at least two good reasons to hope a priori that the standard rules of differentiation might hold for causal derivatives. First, causal derivatives were defined to agree with standard derivatives in their finite approximants. Since these approximant derivatives satisfy these rules, we might hope that they hold over the limiting process. Second, smooth causal functions form a Cartesian differential category, as was shown in . The theory of Cartesian differential categories includes as axioms or theorems abstract versions of the chain rule, sum rule, etc. However, neither of these reasons are immediately sufficient, so we must provide independent justification.

### 4.1 Basic rules and their consequences

We begin by stating some rules familiar from undergraduate calculus.

[causal chain rule] Suppose and are causal functions. Suppose further is differentiable at and is differentiable at . Then is differentiable at and its derivative is .

###### Proof.

Let , , and . We know . We show the stringwise approximants of and match.

 Tk(D∗(g∘f)(σ)) =J(hk)(σ0:k)=J(gk∘fk)(σ0:k) =J(gk)(fk(σ0:k))×J(fk)(σ0:k) (∗) =J(gk)(f(σ)0:k)×J(fk)(σ0:k) =Tk(D∗g(f(σ)))∘Tk(D∗f(σ))=Tk(D∗g(f(σ))∘D∗f(σ))

where the starred line is by the classical chain rule. ∎

Since we have already overloaded for both Cauchy stream product and matrix product, we use for the parallel composition of functions, where the parallel composition of and is defined by for and . We do not know of a standard name for this rule, but in multivariable calculus there is a rule , which we shall call the parallel rule. There is a similar rule for causal derivatives we describe next.

[causal parallel rule] Suppose and are causal functions, and that they are differentiable at and , respectively. Then is differentiable at and its derivative is .

###### Proof.

The stringwise approximants of and match:

 Tk(D∗(f∥h)(σ,τ)) =J(Tk(f∥h))(σ0:k,τ0:k)=J(Tk(f)∥Tk(h))(σ0:k,τ0:k) =J(Tk(f))(σ0:k)∥J(Tk(h))(τ0:k) (∗) =Tk(D∗f(σ))∥Tk(D∗h(τ))=Tk(D∗f(σ)∥D∗h(τ))

where the starred line is by the classical parallel rule. ∎

[causal linearity] If is a linear causal function, it is differentiable at every and its derivative is .

These three results are the fundamental properties of causal differentiation we will be using. Many other standard rules are consequences of these. For example, we can derive a sum rule from these properties.

The sum of two causal maps is defined to be , where is the sequence duplication map.

[causal sum rule] If and as in Definition 4.1 are both differentiable at , so is their sum and its derivative is .

###### Proof.

Using the properties above, we find

 D∗(f+g)(σ) =D∗(+∘(f∥g)∘Δ(Rn)ω)(σ) (sum of maps def’n) =D∗(+)((f∥g∘Δ(Rn)ω)(σ))∘D∗(f∥g∘Δ(Rn)ω)(σ) (causal chain rule) =+∘D∗(f∥g∘Δ(Rn)ω)(σ) (linearity of +) =+∘D∗(f∥g)(Δ(Rn)ω(σ))∘D∗(Δ(Rn)ω)(σ) (causal chain rule) =+∘D∗(f∥g)(σ,σ)∘Δ(Rn)ω (def’n & linearity of Δ) =+∘(D∗f(σ)∥D∗g(σ))∘Δ(Rn)ω (causal parallel rule) =D∗f(σ)+D∗g(σ) (sum of maps def’n)

as desired. ∎

For functions , we can define their Cauchy and Hadamard products and with the pattern of Definition 4.1 and prove two product rules using the derivatives of the binary operations and we computed earlier.

[causal product rules] If are causal functions differentiable at , so are their Cauchy and Hadamard products, and their derivatives are

 D∗(f×g)(σ)(Δσ) =D∗f(σ)(Δσ)×g(σ)+f(σ)×D∗g(σ)(Δσ) D∗(f⊙g)(σ)(Δσ) =D∗f(σ)(Δσ)⊙g(σ)+f(σ)⊙D∗g(σ)(Δσ)

A typical point of confusion in undergraduate calculus is the role of constants: sometimes they are treated like elements of the underlying vector space and sometimes like functions which always return that vector. In our calculus, a constant can similarly sometimes mean a fixed sequence picked out by or the composition of this map after a discarding map . We have described the derivative of a constant element in Example 3.3, now we treat constant maps.

[causal constant rule] The derivative of is . If is a constant map, its derivative is the constant map .

[causal constant multiple rule] If is a constant function and is any other causal function differentiable at , so is and its derivative is .

###### Proof.

Combine the causal product rule and the causal constant rule. ∎

### 4.2 Implicit causal differentiation

We have seen the standard rules presented in the last section are useful as computational shortcuts, just as they are in undergraduate calculus. In the causal calculus they turn out to be perhaps even more crucial, since some differentiable causal functions do not have simple closed forms, so trying to find their derivative from the definition is extremely difficult.

The stream inverse  is the first partial causal function we will consider. This operation is defined on such that with the unbounded-order recurrence relation

 [σ−1]k=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩1σ0 if k=0−1σ0⋅k−1∑i=0(σn−i⋅[σ−1]i) if k>0.

Reasoning about this function in terms of its components is extraordinarily difficult since each component is defined in terms of all the preceding components. However, there is a useful fact from Rutten  which we can use to find the derivative of this operation at all where it is defined: .

[causal reciprocal rule] The partial function is differentiable at all such that , and its derivative is

 (D∗(⋅)−1)(σ)(Δσ)=[−1]×σ−1×σ−1×Δσ
###### Proof.

Since , their derivatives must also be equal. In particular:

 =D∗=D∗(σ×σ−1)(Δσ)=σ×(D∗(⋅)−1)(σ)(Δσ)+Δσ×(σ−1)

using the causal product rule. Solving this equation for yields

 (D∗(⋅)−1)(σ)(Δσ)=[−1]×σ−1×σ−1×Δσ

where we are implicitly using many of the identities established in . ∎

When adopting the conventions that and , this rule looks quite like the usual rule for the derivative of the reciprocal function: .

[causal quotient rule] If are causal functions differentiable at and , then is also differentiable at and its derivative is

 D∗f(σ)(Δσ)×g(σ)+[−1]×f(σ)×D∗g(σ)(Δσ)g(σ)2.

### 4.3 The recurrence rule

So far, causal differential calculus is rather similar to traditional differential calculus. There are two different product rules corresponding to two different products. We were forced to use an implicit differentiation trick to find the derivative of the reciprocal function, but in the end we found a familiar result. However, next we state a rule with no traditional analogue.

[causal recurrence rule] Let be differentiable (everywhere) and . Then is differentiable (everywhere) as a causal function and its derivative satisfies the following recurrence:

 {τk+1=g(σk+1,τk) after τ0=g(σ0,i)Δτk+1=Jg(σk+1,τk)(Δσk+1,Δτk) after Δτ0=Jg(σ0,i)(Δσ0,0Rm)
###### Proof.

We check by induction on . To simplify our notation, we write . The base case is easy:

 U0([D∗reci(g)](σ))(Δσ0) =J(U0(reci(g)))(σ0)(Δσ0) =J(λx.g(x,i))(σ0)(Δσ0)=Jg(σ0,i)(Δσ0,0Rm)

The induction step uses the fact that .

 Uk([D∗reci(g)](σ))(Δσ0:k) =Juk(σ0:k)(Δσ0:k) =[Jg(σk,τk−1)∘⟨Jπk(σ0:k),J(uk−1∘¯¯¯¯¯πk)(σ0:k)⟩](Δσ0:k) =[Jg(σk,τk−1)∘⟨πk,Juk−1(σ0:k−1)∘¯¯¯¯¯πk⟩](Δσ0:k) =Jg(σk,τk−1)(Δσk,Juk−1(σ0:k−1)(Δσ0:k−1)) =Jg(σk,τk−1)(Δσk,Δτk−1)

where is the map discarding the last element of a list. ∎

Degenerate recurrences, which do not refer to previous values generated by the recurrence, are a special instance of this rule.

[causal map rule] Let be a differentiable function. Then is differentiable as a causal function, and its derivative is .

To illustrate the recurrence rule, we revisit the running product function, introduced in Example 2.3, and compute its derivative.

The unary running product function was defined to be where is binary multiplication of reals. In approximant form, . We compute a recurrence for the derivative of this function using the recurrence rule.

Since is binary multiplication, . By the recurrence rule, satisfies the recurrence

 {τk+1=σk+1⋅τk after τ0=σ0Δτk+1=Δσk+1⋅τk+σk+1⋅Δτk after Δτ0=Δσ0

Note that a direct computation of the derivative of this function is available since we have a simple form for its pointwise approximants. Directly from the definition we would get

 Δτk=Uk(D∗rec1(g)(σ))(Δσ0:k)=k∑i=0k∏j=0ρij

where is if and otherwise.

Used naively, this formula results in