# Borel Kernels and their Approximation, Categorically

This paper introduces a categorical framework to study the exact and approximate semantics of probabilistic programs. We construct a dagger symmetric monoidal category of Borel kernels where the dagger-structure is given by Bayesian inversion. We show functorial bridges between this category and categories of Banach lattices which formalize the move from kernel-based semantics to predicate transformer (backward) or state transformer (forward) semantics. These bridges are related by natural transformations, and we show in particular that the Radon-Nikodym and Riesz representation theorems - two pillars of probability theory - define natural transformations. With the mathematical infrastructure in place, we present a generic and endogenous approach to approximating kernels on standard Borel spaces which exploits the involutive structure of our category of kernels. The approximation can be formulated in several equivalent ways by using the functorial bridges and natural transformations described above. Finally, we show that for sensible discretization schemes, every Borel kernel can be approximated by kernels on finite spaces, and that these approximations converge for a natural choice of topology. We illustrate the theory by showing two examples of how approximation can effectively be used in practice: Bayesian inference and the Kleene star operation of ProbNetKAT.

• 1 publication
• 1 publication
• 1 publication
• 1 publication
08/05/2022

### A Fock space approach to the theory of strictly positive kernels

In this paper, we give a new approach to the theory of strictly positive...
09/17/2019

### A Linear Exponential Comonad in s-finite Transition Kernels and Probabilistic Coherent Spaces

This paper concerns a stochastic construction of probabilistic coherent ...
05/18/2020

### Weakest Preexpectation Semantics for Bayesian Inference

We present a semantics of a probabilistic while-language with soft condi...
01/23/2013

### On Transformations between Probability and Spohnian Disbelief Functions

In this paper, we analyze the relationship between probability and Spohn...
05/06/2019

### Characterizing the invariances of learning algorithms using category theory

Many learning algorithms have invariances: when their training data is t...
06/02/2021

### Transformers are Deep Infinite-Dimensional Non-Mercer Binary Kernel Machines

Despite their ubiquity in core AI fields like natural language processin...
06/15/2020

### Parametrized Fixed Points on O-Categories and Applications to Session Types

O-categories generalize categories of domains to provide just the struct...

## 1. Introduction

Finding a good category in which to study probabilistic programs is a subject of active research (Staton et al., 2016; Kozen, 2016; Clerc et al., 2017; Staton, 2017). In this paper we present a dagger symmetric monoidal category of kernels whose dagger-structure is given by Bayesian inversion. The advantages of this new category are two-fold.

Firstly, the most important new construct introduced by probabilistic programming, viz. Bayesian inversion, is interpreted completely straightforwardly by the -operation which is native to our category. In particular we never leave the world of kernels and we therefore do not require any normalization construct. Consider for example the following simple Bayesian inference problem in Anglican ((Wood et al., 2014))

(defquery example
(let [x (sample (normal 0 1))]
(observe (normal x 1) 0.5)
(> x 1)))

The semantics of this program is build easily and compositionally in our category:

• [leftmargin=*]

• The second line builds a Borel space equipped with a normally distributed probability measure – an object

of our category.

• The (normal x 1) instruction builds a Borel kernel – a morphism in our category.

• The observe statement builds the Bayesian inverse of the kernel – the morphism in our -category.

• Finally, the kernel is evaluated, i.e. the denotation of the program above is .

The functoriality of ensures compositionality.

Second, since Bayesian inference problems are in general very hard to compute (although the one given above has an analytical solution), it makes sense to seek approximate solutions, i.e. approximate denotations to probabilistic programs. As we will show, our category of kernels comes equipped with a generic and endogenous approximating scheme which relies on its involutive structure and on the structure of standard Borel spaces. Moreover, this approximation scheme can be shown to converge for any choice of kernel for a natural choice of topology.

#### Main contributions.

1. [leftmargin=*]

2. We build a category of Borel kernels (§2) and we show how two kernels which agree almost everywhere can be identified under a categorical quotient operation. This technical construction is what allows us to define Bayesian inversion as an involutive functor, denoted . This is a key technical improvement on (Clerc et al., 2017) where the -structure111Suggested to us by Chris Heunen. was hinted at but was not functorial. We show that is a dagger symmetric monoidal category.

3. We introduce the category of Banach lattices and -order continuous positive operators as well as the Köthe dual functor 3). These will play a central role in studying convergence of our approximation schemes.

4. We provide the first222To the best of our knowledge. categorical understanding of the Radon-Nikodym and the Riesz representation theorems. These arise as natural transformations between two functors relating kernels and Banach lattices (§4).

5. We show how the -structure of can be exploited to approximate kernels by averaging (§5). Due to an important structural feature of (Th. 1) every kernel in can be approximated by kernels on finite spaces.

6. We show a natural class of approximations schemes where the sequence of approximating kernels converges to the kernel to be approximated. The notion of convergence is given naturally by moving to and considering convergence in the Strong Operator Topology (§6).

7. We apply our theory of kernel approximations to two practical applications (§7). First, we show how Bayesian inference can be performed approximately by showing that the -operation commutes with taking approximations. Secondly, we consider the case of ProbNetKAT, a language developed in (Foster et al., 2016; Smolka et al., 2017) to probabilistically reason about networks. ProbNetKAT includes a Kleene star operator with a complex semantics which has proved hard to approximate. We show that can be approximated, and that the approximation converges.

All the proofs can be found in the Appendix.

#### Related work.

Quasi-Borel sets have recently been proposed as a semantic framework for higher-order probabilistic programs in (Staton et al., 2016). The main differences with our approach are: (i) unlike (Staton et al., 2016; Staton, 2017) we never leave the realm of kernels, and in particular we never need to worry about normalization. This makes the interpretation of observe statements, i.e. of Bayesian inversion, simpler and more natural. However, (ii) unlike the quasi-Borel sets of (Staton et al., 2016), our category is not Cartesian closed. We can therefore not give a semantics to all higher-order programs. This shortcoming is partly mitigated by the fact that the category of Polish space, on which our category ultimately rests, does have access to many function spaces, in particular all the spaces of functions whose domain is locally compact. We can thus in principle provide a semantics to higher-order programs, provided that -abstraction is restricted to locally compact spaces like the reals and the integers, although this won’t be investigated in this paper.

The approximation of probabilistic kernels has been a topic of investigation in theoretical computer science for nearly twenty years (see e.g. (Desharnais et al., 2000; Danos et al., 2003; Desharnais et al., 2004; Chaput et al., 2014)), and for much longer in the mathematical literature (e.g. (Choo-Whan, 1972)). Our results build on the formalism developed in (Chaput et al., 2014) with the following differences: (i) we can approximate kernels, their associated stochastic operator (backward predicate transformer), or their associated Markov operator (forward state transformer) with equivalent ease, and move freely across the three formalisms. (ii) Given a kernel , we can define its approximation along any quotients of and of as in (Chaput et al., 2014), but we can also ‘internalize’ the approximation as a kernel of the original type. Morally and are the same approximation, but the second approximant, being of the same type as the original kernel, can be compared with it. In particular it becomes possible to study the convergence of ever finer approximations, which we do in Section 6. Finally, (iii) we opt to work with Banach lattices rather than the normed cones of (Selinger, 2004; Chaput et al., 2014) because it allows us to formulate the operator side of the theory very naturally, and it connects to a large body of classic mathematical results ((Aliprantis and Border, 1999; Zaanen, 2012)) which have been used in the semantics of probabilistic programs as far back as Kozen’s seminal (Kozen, 1981).

## 2. A category of Borel kernels

In (Clerc et al., 2017) the first three authors presented a category of Borel kernels similar in spirit to the construction of this section, but with a major shortcoming. As we will shortly see, our category of Borel kernels can be equipped with an involutive functor – a dagger operation in the terminology of (Selinger, 2007) – which captures the notion of Bayesian inversion and is absolutely crucial to everything that follows. In (Clerc et al., 2017) this operation had merely been identified as a map, i.e. not even as a functor. In this section we show that Bayesian inversion does indeed define a -structure on a more sophisticated – but measure-theoretically very natural – category of kernels.

### 2.1. Standard Borel spaces and the Giry monad

A standard Borel space – or SB space for short – is a measurable space for which there exists a Polish topology on whose Borel sets are the elements of , i.e. such that (see e.g. (Kechris, 1995) for an overview). Let us write for the category of standard Borel spaces and measurable maps. One key structural feature of is the following:

###### Theorem 1.

Every object is a limit of a countable co-directed diagram of finite spaces.

The Giry monad was originally defined in two variants (Giry, 1981): - As an endofunctor of , the category of Polish spaces, one sets to be the space of Borel probability measures over together with the weak topology. This space is Polish (Kechris, 1995, Th 17.23), and the Portmanteau Theorem (Kechris, 1995, Th 17.20)) gives multiple characterizations of the weak topology. - As an endofunctor of , the category of measurable spaces, one sets to be the set of probability measures on together with the initial -algebra for the maps .

In both cases the Giry monad is defined on an arrow as the map which sends a measure on to the pushforward measure on , defined as for a measurable subset of .

We want to define the Giry monad on the category of standard Borel spaces (and measurable maps), and the two versions of the Giry monad described above offer us natural ways to do this: given an SB space we can either compute and take the associated standard Borel space, or directly compute . Fortunately, the two methods agree.

###### Theorem 2 ((Kechris, 1995), Th 17.24).

Let denote the functor sending a Polish space to its associated SB-space and leaving morphisms unchanged, then

 GMeas∘B=B∘GPol.

We define the Giry monad on SB spaces to be the endofunctor defined by either of the two equivalent constructions above. The monadic data of is given at each SB space by the unit , the Dirac measure at , and the multiplication . We refer the reader to (Giry, 1981) for proofs that and are measurable.

### 2.2. The construction of Krn

Let us denote by the Kleisli category associated with the Giry monad . We denote Kleisli arrows, i.e. Markov kernels, by , and we call such an arrow deterministic if it can be factorized as an ordinary measurable function followed by the unit . Kleisli composition is denoted by . The category has arrows as objects, where is the one point SB space (the terminal object in ). An arrow from to is a arrow such that , i.e. such that for any measurable subset of . This situation will be denoted in short by , and we will call a pair a measured SB space.

We want to construct a quotient of , such that two arrows are identified if they disagree on a null set w.r.t. the measure on their domain. For , we define .

###### Lemma 3 ().

is a measurable set.

We now define a relation on by saying that for any two arrows , . This clearly defines an equivalence relation on . In order to perform the quotient of the category modulo , we need to check that it is compatible with composition.

If , then .

###### Definition 5.

Let be the category obtained by quotienting hom-sets with .

The following Theorem is of great practical use and generalizes the well-known result for deterministic arrows.

###### Theorem 6 (Change of Variables in Krn).

Let be a -morphism. For any measurable function , if is -integrable, then is -integrable and

 ∫Yϕ dν=∫Xϕ∙f dμ

#### The symmetric monoidal structure of Krn

is defined on a pair of objects by the Cartesian product and the product of measure, i.e. . On pairs of morphisms and it is defined by . The unitors, associator and braiding transformations are given by the obvious bijections.

### 2.3. The dagger structure of Krn

has an extremely powerful inversion principle:

###### Theorem 7 (Measure Disintegration Theorem, (Kechris, 1995), 17.35).

Let be a deterministic -morphism, there exists a unique morphism such that

 (1) f∙f†μ=id(Y,ν).

The kernel is called the disintegration of along . As our notation suggests, the disintegration depends fundamentally on the measure over the domain, however we will omit this subscript when there is no ambiguity. The following lemma relates disintegrations to conditional expectations.

###### Lemma 8 ((Dahlqvist et al., 2016b)).

Let be a deterministic -morphism, and let be measurable, then -a.e.

 ϕ∙f†∙f=E[ϕ∣σ(f)]

We can extend the definition of to any -morphism in a functorial way, although will not in general be a right inverse to . The construction of is detailed in (Clerc et al., 2017), but let us briefly recall how it works. The category has products which are built in the same way as in via the product of -algebras333Unlike the category which does not have products.. Given any kernel , we can canonically construct a probability measure on the product of SB-space by defining it on the rectangles of as

 (2) γf(A×B)=∫x∈X1A(x)⋅f(x)(B) dμ.

Equivalently, , where is the diagonal map. Letting and be the canonical projections, we observe that and : in other words, is a coupling of and . The disintegration of along is a kernel . Finally we define:

 (3) f†=πX∙π†Y.

The following diagram sums up the situation:

where is explicitly given by . The following property characterizes the action of on -morphisms:

###### Theorem 9.

For all , is the unique morphism satisfying for all measurable sets , the following equation:

 (4) ∫x∈X1A(x)⋅f(x)(B) dμ=∫y∈Yf†(y)(A)⋅1B(y) dν

In view of Eq. (4), we will call the Bayesian inversion of , and refer to as the Bayesian inversion operation on . It will be crucial throughout the rest of this paper. It is important to see that absolutely depends on the choice of and not only on seen as a function. We can now improve on (Clerc et al., 2017) and show that is indeed a -operation in the strict categorical meaning of the term.

###### Theorem 10.

is a dagger symmetric monoidal category, with given by Bayesian inversion.

## 3. Banach lattices

It is well-known that kernels can alternatively be seen as predicate – i.e. real-valued function –transformers, or as state – i.e. probability measure – transformers. The latter perspective was adopted by Kozen in (Kozen, 1981) to describe the denotational semantics of probabilistic programs (without conditioning). We shall see in this section and the next, that the predicate and state transformer perspectives are dual to one another in the category of Banach lattices, a framework incidentally also used in (Kozen, 1981). For an introduction to the theory of Banach lattices we refer the reader to e.g. (Aliprantis and Border, 1999; Zaanen, 2012).

An ordered real vector space

is a real vector space together with a partial order

which is compatible with the linear structure in the sense that for all

 u≤v⇒u+w≤v+wandu≤v⇒λu≤λv

An ordered vector space is called a Riesz space if the poset structure forms a lattice. A vector in a Riesz space is called positive if , and its absolute value is defined as . A Riesz space is -order complete if every non-empty countable subset of which is order bounded has a supremum.

A normed Riesz space is a Riesz space equipped with a lattice norm, i.e. a map such that:

 (5) |v|≤|w| implies ∥v∥≤∥w∥.

A normed Riesz space is called a Banach lattice if it is (norm-) complete, i.e. if every Cauchy sequence (for the norm ) has a limit in .

###### Example 1.

For each measured space – and in particular -objects – and each , the space is a Riesz space with the pointwise order. When it is equipped with the usual -norm, it is a Banach lattice. This fact is often referred to as the Riesz-Fischer theorem (see (Aliprantis and Border, 1999, Th 13.5)). We will say that are Hölder conjugate if either of the following conditions hold: (i) and , or (ii) and , or (iii) and .

###### Theorem 2 (Lemma 16.1 and Theorem 16.2 of (Zaanen, 2012)).

Every Banach lattice is -order complete.

There are two very natural modes of ‘convergence’ in a Banach lattice: order convergence and norm convergence. The latter is well-known, the former less so. An order bounded sequence in a -complete Riesz space (and thus in a Banach lattice) converges in order to if either of the following equivalent conditions holds:

 v=liminfnvn:=⋁n⋀n≤mvm,v=limsupnvn:=⋀n⋁n≤mvm.

For a monotone increasing sequence , this definition simplifies to , which is often written .

In a general -complete Riesz space, order and norm convergence are disjoint concepts, i.e. neither implies the other (see (Zaanen, 2012, Ex. 15.2) for two counter-examples). However if a sequence converges both in order and in norm then the limits are the same (see (Zaanen, 2012, Th. 15.4)). Moreover, for monotone sequences norm convergence implies order convergence:

###### Proposition 3 ((Zaanen, 2012) Theorem 15.3).

If is an increasing sequence in a normed Riesz space and if converges to in norm (notation , then .

In a Banach lattice we have the following stronger property.

###### Proposition 4 (Lemma 16.1 and Theorem 16.2 of (Zaanen, 2012)).

If is a sequence of positive vectors in a Banach lattice such that converges, then exists and .

It can also happen that order convergence implies norm convergence. A lattice norm on a Riesz space is called -order continuous if ( is a decreasing sequence whose infimum is 0) implies .

###### Example 5.

For , the -norm is -order continuous, and thus order convergence and norm convergence coincide. However, for this is not the case as the following simple example shows. Consider the sequence of essentially bounded functions : it is decreasing for the order on with the constant function as its infimum, i.e. . However for all .

Many types of morphisms between Banach lattices are considered in the literature but most are at least linear and positive, that is to say they send positive vectors to positive vectors. From now on, we will assume that all morphisms are positive (linear) operators. Other than that, we will only mention two additional properties, corresponding to the two modes of convergence which we have examined. The first notion is very well-known: a linear operator between normed vector spaces is called norm-bounded if there exists such that for every . The following result is familiar:

###### Theorem 6.

An operator between normed vector spaces is norm-bounded iff it is continuous.

Thus norm-bounded operators preserve norm-convergence. The corresponding order-convergence concept is defined as follows: an operator between -order complete Riesz spaces is said to be -order continuous if whenever , . It follows that we can consider two types of dual spaces on a Banach lattice : on the one hand we can consider the norm-dual:

 V∗={f:V→R∣f is norm-continuous}

and the -order-dual:

 Vσ={f:V→R∣f is σ-order continuous}

The latter is sometimes known as the Köthe dual of (see (Dieudonné, 1951; Zaanen, 2012)). The two types of duals coincide for a large class of Banach spaces of interest to us.

###### Theorem 7.

If a Banach lattice admits a strictly positive linear functional and has a -order-continuous norm, then .

###### Example 8.

The result above can directly be applied to our running example: given a measured space and an integer , the Lebesgue integral provides a strictly positive functional on , and we already know from Example 5 that has a -order-continuous norm. It follows that

 Lp(X,μ)∗=Lp(X,μ)σ

Moreover, it is well-known that if are Hölder conjugate and , then , and thus . It is also known that , and thus .

However Theorem 7 does not hold for since the -norm is not -order continuous, as was shown in Example 5. It is well-known that , and in fact can be concretely described as the Banach lattice of charges (i.e. finitely additive finite signed measures) which are absolutely continuous w.r.t, on (see (Dunford et al., 1971, IV.8.16)). However, as is shown in e.g. (Zaanen, 2012; Chaput et al., 2014)

 (6) L∞(X,μ)σ=L1(X,μ)

As Examples 5 and 8 show, the operation brings a lot of symmetry to the relationship between -spaces since

 Lp(X,μ)σ=Lq(X,μ)

for any Hölder conjugate pair . For this reason we will consider the category whose objects are Banach lattices and whose morphisms are -order continuous positive operators. Note that the Köthe dual of a Banach lattice is a Banach lattice, and it easily follows that in fact defines a contravariant functor which acts on morphisms by pre-composition. As we will now see, is the category in which predicate and state transformers are most naturally defined.

## 4. From Borel kernels to Banach lattices

#### The functors Sp and Tp.

For , the operation which associates to a -object the space can be thought of as either a contravariant or a covariant functor. We define the functors as expected on objects, and on -morphisms via the well-known ‘predicate transformer’ perspective:

 Sp(f):Lp(Y,ν)→Lp(X,μ),ϕ↦λx.∫Yϕdf(x)=ϕ∙f

For a proof that this defines a functor see (Clerc et al., 2017). We define the covariant functors as .

#### The functor M≪⋅.

An ideal of a Riesz space is a sub-vector space with the property that if and then . An ideal is called a band when for every subset if exists in , then it also belongs to . Every band in a Banach lattice is itself a Banach lattice. Of particular importance is the band generated by a singleton , which can be described explicitly as

 Bv={w∈V∣(|w|∧n|v|)↑|w|}
###### Example 1.

Let be an SB-space and denote the set of measures of bounded variation on . It can be shown ((Aliprantis and Border, 1999, Th 10.56)) that is a Banach lattice. The linear structure on is as expected, the Riesz space structure is given by

 (μ∨ν)(A)=sup{μ(B)+ν(A∖B)∣B measurable ,B⊆A}

and the dual definition for the meet operation. The norm is given by the total variation i.e.

Given , the band generated by is just the set of measures of bounded variation which are absolutely continuous w.r.t. . In particular is a Banach lattice.

We can now define the functor by:

 {M≪⋅(X,μ):=BμM≪⋅f:M≪⋅(X,μ)→M≪⋅(Y,ν),ρ↦f∙ρ

We will usually write as .

###### Proposition 2.

Let be a arrow. Let be a finite measure on such that . Then , and thus defines a functor.

We now present a first pair of natural transformations which will establish a natural isomorphism between the functors and . First, we define the Radon-Nikodym transformation at each -object by the map

 rn(X,μ):M≪μ(X)→L1(X,μ),rn(X,μ)(ρ)=dρdμ

where is of course the Radon-Nikodym derivative of w.r.t. . The fact that this transformation defines a positive operator between Banach lattices is simply a restatement of the usual Radon-Nikodym theorem (Dunford et al., 1971, III.10.7.), combined with the well-known linearity property of the Radon-Nikodym derivative. To see that it is also -order-continuous, consider a monotone sequence converging in order to in . This means that for any measurable set of , . Since is bounded in -norm the function exists and is simply the pointwise limit . It now follows from the monotone convergence theorem (MCT) that

in other words, and is well-defined. That is also natural has – to our knowledge – never been published.

###### Theorem 3.

Secondly, we define the Measure Representation transformation at each -object by the map defined as

 mr(X,μ)(f)(BX)=∫BXfdμ

This is a very well-known construction in measure theory, and the fact that is a -order continuous operator between Banach lattices is immediate from the linearity of integrals and the MCT.

###### Theorem 4.

The Measure Representation transformation is natural.

#### Riesz representations are natural.

We now present a second pair of natural transformations which will establish a natural isomorphism between and . First, we define the Riesz Representation transformation at each -object by the map defined as

 rr(X,μ)(F)(BX)=F(1BX)

This construction is key to a whole collection of results in functional analysis commonly known as Riesz Representation Theorems (see (Aliprantis and Border, 1999) Chapter 14 for an overview). One can readily check that the Riesz Representation transformation is well-defined: and the -additivity of follows from the -order-continuity of . To see that , assume that , then clearly -a.e., i.e. in , and thus .

###### Theorem 5.

The Riesz Representation transformation is natural.

Finally, we define the Functional Representation transformation at each -object by the map by

 fr(X,μ)(μ)(ϕ)=∫Xϕdμ

This construction is also completely standard in measure theory, although it has never to our knowledge been seen as a natural transformation.

###### Theorem 6.

The Functional Representation transformation is well-defined, i.e. is a -order continuous positive operator, and is natural.

#### Natural Isomorphisms

We have now defined the following four natural transformations:

In fact, both pairs form natural isomorphisms, and these can be restricted to arbitrary Hölder conjugate pairs .

###### Theorem 7.

and are inverse of one another, in particular there exists a natural isomorphism between and .

###### Theorem 8.

and are inverse of each other, in particular there exists a natural isomorphism between and .

We can now conclude that the isomorphism proved in Theorem 6 of (Clerc et al., 2017) is in fact natural.

###### Corollary 9.

There exists a natural isomorphism between and .

We can in fact restrict this result to any Hölder conjugate pair :

###### Theorem 10.

For with Hölder conjugate , the natural transformation restricts to a natural transformation .

The correspondence between the various categories and functors discussed in this section are summarized as follows:

 (7)

## 5. Approximations

In this section we develop a scheme for approximating kernels which follows naturally from the -structure of . Consider and a pair of deterministic maps and (typically these maps coarsen the spaces and ).

 (8)

The -structure of allows us to define the new kernels

 (9) fp,q:=q†ν∙q∙f∙p†μ∙p :X⇾Y (10) fp,q:=q∙f∙p†μ :X′⇾Y′

The supscript notation is meant to indicate that the approximation lives ‘upstairs’ in Diagram (8) and conversely for the subscripts. Intuitively, and take the average of over the fibres given by according to and (see Section 7 for concrete calculations). The advantage of (10) is that we can approximate a kernel on a huge space by a kernel on a, say, finite one. The advantage of (9) is that although it is more complicated, it is morally equivalent and has the same type as , which means that we can compare it to .

A very simple consequence of our definition is that Bayesian inversion commutes with approximations. We shall use this in §7.1 to perform approximate Bayesian inference.

###### Theorem 1.

Let , let and be a pair of deterministic maps, then

 (f†)q,p=(fp,q)† and (f†)q,p=(fp,q)†

In practice we will often consider endo-kernels with a single coarsening map to a finite space. In this case (9) simplifies greatly.

###### Proposition 2.

Under the situation described above

 (11) fp:=p†ν∙p∙f∙p†μ∙p=f∙p†μ∙p

In the case covered by Proposition 2, the interpretation of is very natural: for each the measure is approximated by its average over the fibre to which belongs, conditioned on being in the fibre. For fibres with strictly positive -probability, this is simply

 fp(x)(A)=∫y∈p−1(p(x))f(y)(A) dμμ(p−1(p(x))

However (11) also covers the case of -null fibres. Note also that in the case where , the map corresponds to what is known as a strong functional bisimulation for .

#### Approximating is non-expansive.

It is well-known that conditional expectations are non-expansive and we know from Lemma 8 that pre-composing by as in (11) amounts to conditioning. The following lemma is an easy consequence.

###### Lemma 3 ().

Let and be a deterministic quotient, then for all and

 ∥Spfq(ϕ)∥p≤∥Spf(ϕ)∥p

#### Compositionality of approximations

In the case where we wish to approximate a composite kernel , it might be convenient, for modularity reasons, to approximate and separately. This does not entail any loss of information provided the quotient maps are hemi-bisimulations, in the following sense. Let be deterministic quotients and let be composable kernels. We say that is a left hemi-bisimulation for if , and conversely that it is a right hemi-bisimulation for if holds. In either case, one can verify using Theorems 7 and 1 that approximation commutes with composition, i.e. that .

#### Discretization schemes

We will use (10) and (11) to build sequences of arbitrarily good approximations of kernels. For this we introduce the following terminology.

###### Definition 4.

We define a discretization scheme for an SB-space to be a countable co-directed diagram (ccd) of finite spaces for which is a cone (not necessarily a limit).

If is a discretization scheme of and are the maps making a cone, then it follows from the definition that if , where is the -algebra generated by . For each the finite quotient defines a measurable partition of whose disjoint components we will call cells.

By Theorem 1 every SB-space has a discretization scheme for which it is not just a cone but a limit.

In practice we will work with discretization schemes linearly ordered by . In this case the sequence defines what probabilists call a filtration and we will denote the approximation given by (11) simply by .

## 6. Convergence

We now turn to the question of convergence of approximations. There appears to be little literature on the subject of the convergence of approximations of Markov kernels. One rare reference is (Choo-Whan, 1972). Via the functor defined above in Sections 3 and 4 we can seek a topology in terms of the operators associated to a sequence of kernels. Indeed, following (Choo-Whan, 1972), we will prove convergence results for the Strong Operator Topology (SOT).

###### Definition 1.

We will say that a sequence of kernels converges to in strong operator topology, and write , if   converges to in the strong operator topology, i.e. if

 limn→∞∥S1fn(ϕ)−S1f(ϕ)∥1=0

#### Proving convergence.

We start with the following key lemma which is a consequence of Lévy’s upward convergence Theorem ((Williams, 1991, Th. 14.2)) .

###### Lemma 2 ().

Let be a -morphism and let be a discretization scheme such that for the Borel -algebra of we have

 BX=σ(⋃nσ(pn))

and let be measurable, then for

 limn→∞fn(x)(A)=f(x)(A)

for -almost every . Moreover,

 limn→∞∥S1fn(1A)−S1f(1A)∥1=0
###### Theorem 3 (Convergence of Approximations Theorem).

Under the conditions of Lemma 2, for -almost every

 limn→∞fn(x)(A)=f(x)(A)

for all Borel subsets . Moreover,

 limn→∞∥S1fn(ϕ)−S1f(ϕ)∥1=0

for any . In other words .

Note that operators of the shape obtained from a discretization scheme are finite rank operators. Thus, we, in fact, also obtained a theorem to approximate stochastic operators by stochastic operators of finite rank for the SOT topology. In general, we cannot hope for convergence in the stronger norm topology since the identity operator – which is stochastic – is a limit of operators of finite rank in the norm topology iff the space is finite dimensional.

Note also that the various relationships established in Section 4 allow us to move from an approximation of a kernel to an approximation of the corresponding Markov operator. Since a discretization scheme making will also make , it follows from Theorem 7 that we get a finite rank approximation of the Markov operator .

## 7. Applications

### 7.1. Approximate Bayesian Inference

Consider again the inference problem from the introduction. There one needed to invert with prior . We can use Theorem 1 to see how our approximate Bayesian inverse compares to the exact solution which in this simple case is known to be . To do this, we use a doubly indexed discretization scheme:

 qmn:R→2×m×n+2

defining a window of width centred at divided in equal intervals; with the remaining intervals and each sent to a point (hence the above).

Since all classes induced by have positive -mass, approximants can be computed simply as:

 fm,n([k])([l])=μ[k]−1∫x∈[k]N(x,1)([l]) dμ

where , range over classes of . The corresponding stochastic matrices are shown in Fig. 3 and 3 for and respectively.

Since these approximants are finite, their Bayesian inverse can be computed directly by Bayes theorem (i.e. taking the adjoint of the stochastic matrices):

 (12) fm,n†([l])([k])=μ[k]⋅fm,n([k])([l])ν[l]

with . Commutation of inversion and approximation guarantees that the converge to .

Indeed, Fig. 1 shows the the Lebesgue density of for (in dashed blue) and (dashed red). The latter approximant is already hardly distinguishable from the exact solution (solid black).

It must be emphasized that this example is meant only as an illustration and does not constitute a universal solution to the irreducibly hard (not even computable in general  (Ackerman et al., 2011)) problem of performing Bayesian inversion. Also, not all quotients are equally convenient: what makes the approach computationally tractable is that the fibres are easily described and the measure conveniently evaluated on such fibres.

### 7.2. Approximating the Kleene star of ProbNetKAT

ProbNetKAT ((Foster et al., 2016; Smolka et al., 2017)) is a probabilistic network specification language extending Kleene Algebras with Tests ((Kozen, 1997)) with network primitives and a binary probabilistic choice operator . For the purpose of the example shown here we will not need to introduce the full syntax and semantics of ProbNetKAT, rather we will focus on a single ProbNetKAT program which we will call and is given by:

 (13) cantor:=p;(dup;p)∗wherep:=π0!⊕\nicefrac12π1!

The program acts on sets of finite sequences of 0 and 1, which can be thought of as packet histories. We will write for the set of all packet histories and for the set of histories of length as most . A ProbNetKAT program is always interpreted as a kernel . Programs with both and revealed to be quite complex from the earliest development of the language. As we will describe, denotes a continuous distribution and hence having a way to approximate it is crucial for practical uses of the language. The denotation of on a single sequence is:

 ⟦π0!⟧({(a0,…,an)})=δ{(0,a1,…,an)}

in other words overwrites the first entry in the sequence with . Similarly, overwrites the first entry with . This semantics is extended to sets of sequences in the obvious way by taking direct images. The semantics of is thus:

 ⟦p⟧(a)=0.5δ⟦π0!⟧(a)+0.5δ⟦π1!⟧(a)

The denotation of is given on singleton histories by

 ⟦dup⟧({(a0,…,an)})=δ{(a0,a0,…,an)}

i.e. shifts the history to the right and duplicates the first entry. Again, this is extended to sets of histories by taking direct images. The sequential composition operator is interpreted by Kleisli composition.

The interpretation of the Kleene star is more involved, and we here describe it categorically. To avoid any confusion we will not use Kleisli arrows in this construction, i.e. all kernels will be explicitly typed as kernels. Note first that the infinite product can be defined as the limit of the ccd given by the maps dropping the last component. By Bochner’s theorem ((Dahlqvist et al., 2016a)) this also holds of . Next, consider any program . We turn into a cone for the diagram with limit via the inductively defined maps:

 (14) a1 =η⊗⟦r⟧∙Δ1:2H→G(2H×2H) (15) an =an−1⊗⟦r⟧∙Δn:(2H)n→G((2H)n×2H)

where is the map copying the last entry. It is easy to check , and the diagram described by the morphisms makes a cone for . There must therefore exist a unique morphism

 ⟦r⟧∞:2H→G((2H)∞).

For each input, this kernel builds a distribution on the sample paths of the discrete-time stochastic processes associated with and this input. We now define

 ⟦r∗⟧:=G(⋃)∘⟦r⟧∞

where is the map taking infinitary unions. Since the definition above makes sense for any kernel on , we will overload the Kleene star and put . Given the input , a sample path of