# The Geometry of Bayesian Programming

We give a geometry of interaction model for a typed lambda-calculus endowed with operators for sampling from a continuous uniform distribution and soft conditioning, namely a paradigmatic calculus for higher-order Bayesian programming. The model is based on the category of measurable spaces and partial measurable functions, and is proved adequate with respect to both a distribution-based and a sampling based operational semantics.

## Authors

• 25 publications
• 1 publication
11/26/2020

### Universal Semantics for the Stochastic Lambda-Calculus

We define sound and adequate denotational and operational semantics for ...
10/11/2017

### Abductive functional programming, a semantic approach

We propose a call-by-value lambda calculus extended with a new construct...
07/16/2020

### Probabilistic Programming Semantics for Name Generation

We make a formal analogy between random sampling and fresh name generati...
11/27/2017

### Measurable Cones and Stable, Measurable Functions

We define a notion of stable and measurable map between cones endowed wi...
11/09/2014

### Applications of sampling Kantorovich operators to thermographic images for seismic engineering

In this paper, we present some applications of the multivariate sampling...
12/08/2016

### Implementing Operational calculus on programming spaces for Differentiable computing

We provide an illustrative implementation of an analytic, infinitely-dif...
09/23/2020

### Semantics of a Relational λ-Calculus (Extended Version)

We extend the λ-calculus with constructs suitable for relational and fun...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Randomisation provides the most efficient algorithmic solutions, at least concretely, in many different contexts. A typical example is the one of primality testing, where the Miller-Rabin test [Miller1976, Rabin1980] remains the preferred choice despite polynomial time deterministic algorithms are available from many years now [AKS2002]

. Probability theory can be exploited even more fundamentally in programming, by way of so-called probabilistic (or, more specifically, Bayesian) programming, as popularized by languages like, among others,

ANGLICAN [WMM2014] or CHURCH [GMRBT08]. This has stimulated research about probabilistic programming languages and their semantics [jones1990, DH2002, EPT2018], together with type systems [DLG2017, BDL2018], equivalence methodologies [DLSA2014, CDL2014], and verification techniques [SABGGH2019].

Giving a satisfactory denotational semantics to higher-order functional languages is already problematic in presence of probabilistic choice [jones1990, JT1998], and becomes even more challenging when continuous distributions and scoring are present. Recently, quasi-Borel spaces [hksy2017] have been proposed as a way to give semantics to calculi with all these features, and only very recently [VKS2019] this framework has been shown to be adaptable to a fully-fledged calculus for probabilistic programming, in which continuous distributions and soft-conditioning are present. Probabilistic coherent spaces [DE2011] are fully abstract [EPT2018] for -calculi with discrete probabilistic choice, and can, with some effort, be adapted to calculi with sampling from continuous distributions [EPT2018POPL], although without scoring.

A research path which has been studied only marginally, so far, consists in giving semantics to Bayesian higher-order programming languages through interactive forms of semantics, e.g. game semantics [HO2000, AJM2000] or the geometry of interaction [girard1989]. One of the very first models for higher-order calculi with discrete probabilistic choice was in fact a game model, proved fully abstract for a probabilistic calculus with global ground references [DH2002]. After more than ten years, a parallel form of Geometry of Interaction (GoI) and some game models have been introduced for -calculi with probabilistic choice [DLFVY2017, CCPW2018, CP2018], but in all these cases only discrete probabilistic choice can be handled, with the exception of a recent work on concurrent games and continuous distributions [PW2018].

In this paper, we will report on some results about GoI models of higher-order Bayesian languages. The distinguishing features of the introduced GoI model can be summarised as follows:

• Simplicity. The category on which the model is defined is the one of measurable spaces and partial measurable functions, so it is completely standard from a measure-theoretic perspective.

• Expressivity. As is well-known, the GoI construction [jsv, ahs2002] allows to give semantics to calculi featuring higher-order functions and recursion. Indeed, our GoI model can be proved adequate for , a fully-fledged calculus for probabilistic programming.

• Flexibility. The model we present is quite flexible, in the sense of being able to reflect the operational behaviour of programs as captured by both the distribution-based and the sampling-based semantics.

• Intuitiveness. GoI visualises the structure of programs in terms of graphs, from which dependencies between subprograms can be analyzed. Adequacy of our model provides diagrammatic reasoning principle about observational equivalence of .

This paper’s contributions, beside the model’s definition, are two adequacy results which precisely relate our GoI model to the operational semantics, as expressed (following [bdlgs2016]), in both the distribution and sampling styles. As a corollary of our adequacy results, we show that the distribution induced by sampling-based operational semantics coincides with distribution-based operational semantics.

### 1.1 Turning Measurable Spaces into a GoI Model

Before entering into the details of our model, it is worthwhile to give some hints about how the proposed model is obtained, and why it differs from similar GoI models from the literature.

The thread of work the proposed model stems from is the one of so-called memoryful geometry of interaction [hmh2014, mhh2016]. The underlying idea of this paper is precisely the same: program execution is modelled as an interaction between the program and its environment, and memoisation takes place inside the program as a result of the interaction.

In the previous work on memoryful GoI by the second author with Hasuo and Muroya, the goal consisted in modelling a -calculus with algebraic effects. Starting from a monad together with some algebraic effects, they gave an adequate GoI model for such a calculus, which is applicable to wide range of algebraic effects. In principle, then, their recipe could be applicable to , sinc sampling-based operational semantics enables us to see scoring and sampling as algebraic effects acting on global states. However, the that would not work for , since the category of measurable spaces111We need to work on because we want to give adequacy for distribution-based semantics. is not cartesian closed, and we thus cannot define a state monad by way of the exponential .

In this paper, we side step this issue by a series of translations, to be described in Section 4 below. Instead of looking for a state monad on , we embed into the category of -objects and Mealy machines (Section 5) and use a state monad on this category. This is doable because is a compact closed category given by the -construction [ahs2002]. The use of such compact closed categories (or, more generally, of traced monoidal categories) is the way GoI models capture higher-order functions.

### 1.2 Outline

The rest of the paper is organised as follows. After giving some necessary measure-theoretic preliminaries in Section 2 below, we introduce in Section 3 the language , together with the two kinds of operational semantics we were referring to above. In Section 4, we introduce our GoI model informally, while in Section 5 a more rigorous treatment of the involved concepts is given, together with the adequacy results. We discuss in Section 10 an alternative way of giving a GoI semantics to based on s-finite kernels, and we conclude in Section LABEL:sec:conclusion.

## 2 Measure-Theoretic Preliminaries

We recall some basic notions in measure theory that will be needed in the following. We also fix some useful notations. For more about measure theory, see standard text books such as [billingsley1986].

A -algebra on a set is a family consisting of subsets of such that ; and if , then the complement is in ; and for any family , the intersection is in . A measurable space is a set equipped with a -algebra on . We often confuse a measurable space with its underlying set . For example, we simply write instead of . For measurable spaces and , we say that a partial function (in this paper, we use for both partial functions and total functions) is measurable when for all , the inverse image

 {x∈X:f(x) is defined and is equal to an element of A}

is in . A measurable function from to is a totally defined partial measurable function. A (partial) measurable function is invertible when there is a measurable function such that and are identities. In this case, we say that is an isomorphism from to and say that is isomorphic to .

We denote a singleton set by , and we regard the latter as a measurable space by endowing it with the trivial -algebra. We also regard the empty set as a measurable space in the obvious way. In this paper, denotes the measurable set of all non-negative integers equipped with the -algebra consisting of all subsets of , and denotes the measurable set of all real numbers equipped with the -algebra consisting of Borel sets, that is, the least -algebra that contains all open subsets of . By the definition of , a function is measurable whenever for all open subsets . Therefore, all continuous functions on are measurable.

When is a subset of the underlying set of a measurable space , we can equip with a -algebra . This way, we regard the unit interval and the set of all non-negative real numbers as measurable spaces, and indicate them as follows:

 R[0,1]={a∈R:0≤a≤1},R≥0={a∈R:a≥0}

For measurable spaces and , we define the product measurable space and the coproduct measurable space by

 |X×Y| =|X|×|Y|, |X+Y| ={(∙,x):x∈X}∪{(∘,y):y∈Y}

where the underlying -algebras are:

 ΣX×Y =the least σ-algebra such that A×B∈ΣX×Y for all A∈ΣX and B∈ΣY, ΣX+Y ={{∙}×A∪{∘}×B:A∈ΣX and B∈ΣY}.

We assume that has higher precedence than , i.e., we write for . In this paper, we always regard finite products as the product measurable space on . It is well-known that the -algebra is the set of all Borel sets, i.e., is the least one that contains all open subsets of . Partial measurable functions are closed under compositions, products and coproducts.

Let be a measurable space. A measure on is a function from to that is the set of all non-negative real numbers extended with , such that

• ; and

• for any mutually disjoint family , we have .

We say that a measure on is finite when and that it is -finite if for some family satisfying .

For a measurable space , we write for a measure on given by for all . If is a measure on a measurable space , then for any non-negative real number , the function is also a measure on . The Borel measure on is the unique measure that satisfies

 μBorel([a1,b1]×⋯×[an,bn])=∏1≤i≤n|ai−bi|.

We define the Borel measure on by . For a measurable function and a measurable subset , we denote the integral of with respect to the Borel measure restricted to by

 ∫Xf(u)du.

For a measurable space and for an element , a Dirac measure on is given by

 δx(A)=[x∈A]={1,if x∈A;0,if x∉A.

The square bracket notation in the right hand side is called Iverson’s bracket. In general, for a proposition , we have when is true and when is false.

###### Proposition 2.1.

For every -finite measures on a measurable space and on a measurable space , there is a unique measure on such that for all and .

The measure is called the product measure of and . For example, the Borel measure on is the product measure of the Borel measure on .

Finally, let us recall the notion of a kernel, which is a well-known concept in the theory of stochastic processes. For measurable spaces and , a kernel from to is a function such that for any , the function is a measure on , and for any , the function is measurable. Notions of finite and -finite kernels can be naturally given, following the emponymous constraint on measures. Those kernels which can be expressed as the sum of countably many finite kernels are said to be s-finite [staton2017]. We use kernels to give semantics for our probabilistic programming language, to be defined in the next section.

## 3 Syntax and Operational Semantics

### 3.1 Syntax and Type System

Our language for higher order Bayesian programming can be seen as Plotkin’s endowed with real numbers, measurable functions, sampling from the uniform distribution on and soft-conditioning. We first define types , values and terms as follows:

 A,B ::=Unit\Large{∣}Real\Large{∣}A→B, V,W ::=skip\Large{∣}x% \Large{∣}λxA.M\Large% {∣}ra\Large{∣}fixA,B(f,x,M), M,N ::=V\Large{∣}VW\Large{∣}letxbeMinN\Large{∣}ifz(V,M,N) \Large{∣}F(V1,…,V|F|)\Large{∣}sample\Large{∣}score(V).

Here, varies over a countably infinite set of variable symbols, and varies over the set of all real numbers. Each function identifier is associated with a measurable function from to . For terms and , we write for the capture-avoiding substitution of in by .

Terms in are restricted to be A-normal forms, in order to make some of the arguments on our semantics simpler. This restriction is harmless for the language’s expressive power, thanks to the presence of -bindings. For example, term application can be defined to be .

The term constructor and the constant enable probabilistic programming in . Evaluation of has the effect of multiplying the weight of the current probabilistic branch by , this way enabling a form of soft-conditioning. The constant generates a real number randomly drawn from the uniform distribution on . Only one sampling mechanism is sufficient because we can model sampling from other standard distributions by composing with measurable functions [wcgc2018].

Terms can be typed in a natural way. A context is a finite sequence consisting of pairs of a variable and a type such that every variable appears in at most once. A type judgement is a triple consisting of a context , a term and a type . We say that a type judgement is derivable when we can derive from the typing rules in Figure 1. Here, the type of is , and the type of is because returns a real number, and the purpose of scoring is its side effect.

In the sequel, we only consider derivable type judgements and typable closed terms, that is, closed terms such that is derivable for some type .

### 3.2 Distribution-Based Operational Semantics

We define distribution-based operational semantics following [bdlgs2016] where, however, a -algebra on the set of terms is necessary so as to define evaluation results of terms to be distributions (i.e. measures) over values. In this paper, we only consider evaluation of terms of type and avoid introducing -algebras on sets of closed terms, thus greatly simplifying the overall development.

Distribution-based operational semantics is a function that sends a closed term to a measure on . Because of the presence of , the measure may not be a probabilistic measure, i.e., may be larger than , but the idea of distribution-based operational semantics is precisely that of associating each closed term of type with a measure over .

As common in call-by-value programming languages, evaluation is defined by way of evaluation contexts:

 E[−]::=[−]\Large{∣}letxbeE[−]inM.

The distribution-based operational semantics of is a family of binary relations between closed terms of type and measures on inductively defined by the evaluation rules in Figure 2 where the evaluation rule for is inspired from the one in [staton2017]. The binary relation in the precondition of the third rule in Figure 2 is called deterministic reduction and is defined as follows as a relation on closed terms:

 (λxA.M)V red⟶M{V/x}, letxbeVinM red⟶M{V/x}, fixA,B(f,x,M)V red⟶M{fixA,B(f,x,M)/f,V/x}, ifz(ra,M,N) F(ra,…rb) red⟶rfunF(a,…,b).

The last evaluation rule in Figure 2 makes sense because in the precondition is a kernel from to :

###### Lemma 3.1.

For any and for any term

 x1:Real,…,xm:Real⊢M:Real,

there is a finite kernel from to such that for any and for any measure on ,

 M{ra1/x1,…,ram/xm}⇒nμ⟺μ=k(u,−)

where .

###### Proof.

Let be a context of the form . In this proof, for a finite sequence , and for a term , we denote

 M{ra1/x1,…,ram/xm}

by . We prove the statement by induction on . (Base case) Let be a kernel from to given by

 k(u,A)=0.

Then for any ,

 M{ra1/x1,…,ram/xm}⇒0μ⟺μ=∅R⟺μ=k(u,−).

(Induction step) We define a redex by

 R::= score(V)\Large{∣}sample\Large{∣}(λxA.M)V\Large{∣}fixA,B(f,x,M)V

We note that in the above BNF can be variables. By induction on the size of type derivation, we can show that every term is either a value or of the form for some evaluation context and some redex . Given a term where , we prove the induction step by case analysis.

• If is a value, then is either a variable or a constant . When is a variable , we have

 xi{ra1/x1,…,ram/xm}≡rai⇒n+1μ⟺μ=δai.

When is a constant , we have

 ra{ra1/x1,…,ram/xm}≡ra⇒n+1μ⟺μ=δa.

Both given by

 k((a1,…,am),A)=δai(A),h((a1,…,am),A)=δa(A)

are kernels from to .

• If is of the form , then by induction hypothesis, there is a kernel from to such that for any ,

 E[y]{ru/(Δ,y:Real)}⇒nμ⟺μ=k(u,−).

We define a kernel from to by

 h((a1,…,am),A)=∫R[0,1]k((a1,…,am,a),A)da.

This is a kernel because if is a non-negative measurable function, then

 (b,…,c)↦∫Rf(a,b,…,c)da

is measurable. See [billingsley1986, Theorem 18.3]. Then, for any ,

 E[sample]{ru/Δ}⇒n+1μ ⟺μ=∫R[0,1]k((a1,…,am,a),−)da ⟺μ=h(u,−).
• If is of the form for some , then by induction hypothesis, there is a kernel from to such that for any ,

 E[skip]{ru/Δ}⇒nμ⟺μ=k(u,−).

We define a kernel to by

 h((a1,…,am),A)=|ai|k((a1,…,am),A).

Then, for any ,

 E[score(xi)]{ru/Δ}⇒n+1μ ⟺E[skip]{ru/Δ}⇒nν and μ=|ai|ν ⟺μ=h(u,−).
• If is of the form for some , then by induction hypothesis, there is a kernel from to such that for any ,

 E[skip]{ru/Δ}⇒nμ⟺μ=k(u,−).

We define a kernel to by

 h((a1,…,am),A)=|a|k((a1,…,am),A).

Then, for any ,

 E[score(xi)]{ru/Δ}⇒n+1μ ⟺E[skip]{ru/Δ}⇒nν and μ=|a|ν ⟺μ=h(u,−).
• If is of the form , then by induction hypothesis, there is a kernel from to such that for all ,

 E[N{V/x}]{ru/Δ}⇒nμ⟺μ=k(u,−).

Hence,

 ⟺E[N{V/x}]{ru/Δ}⇒nμ ⟺μ=k(u,−).
• If is of the form , then by induction hypothesis, there is a kernel from to such that for all ,

Hence,

 E[fixA,B(f,x,N)V]⇒n+1μ ⟺μ=k(u,−).
• If is of the form , then is equal to either a variable or a constant . For simplicity, we suppose that and and . By induction hypothesis, there is a kernel from to such that for all ,

 E[y]{ru/(Δ,y:Real)}⇒nμ⟺μ=k(u,−).

We define a kernel from to by

 h((a1,…,am),A)=k((a1,…,am,funF(ai,a)),A).

Then, for any ,

 E[F[xi,ra]]{ru/Δ}⇒n+1μ ⟺E[y]{ru/Δ,rfunF(ai,a)/y}⇒nμ ⟺μ=k((u,funF(ai,a)),−)=h(u,−).
• If is of the form , then by induction hypothesis, there is a kernel from to such that for all ,

 E[N{V/x}]{ru/Δ}⇒nμ⟺μ=k(u,−).

Hence,

 E[letxbeVinN]{ru/Δ}⇒n+1μ ⟺E[N{V/x}]{ru/Δ}⇒nμ ⟺μ=k(u,−).
• If is of the form for some , then by induction hypothesis, there are kernels and from to such that for any ,

 E[N]{ru/Δ}⇒nμ ⟺μ=k(u,−), E[L]{ru/Δ}⇒nμ ⟺μ=k′(u,−).

We define a kernel from to by

 h(u,A)={k(u,A),if ai=0,k′(u,A),if ai≠0 % where u=(a1,…,an).

Then, for any ,

 E[ifz(xi,N,L)]{ru/Δ}⇒n+1μ ⟺(E[N]{ru/Δ}⇒nμ and ai=0) ⟺μ=h(u,−).
• If is of the form , then by induction hypothesis, there is a kernel from to such that for any ,

 E[N]{ru/Δ}⇒nμ⟺μ=k(u,−).

Hence,

 E[ifz(r0,N,L)]{ru/Δ}⇒n+1μ ⟺E[N]{ru/Δ}⇒nμ ⟺μ=k(u,−).
• If is of the form for some real number , then by induction hypothesis, there is a kernel from to such that

 E[L]{ru/Δ}⇒nμ⟺μ=k(u,−).

Hence,

 E[ifz(ra,N,L)]{ru/Δ}⇒n+1μ ⟺E[L]{ru/Δ}⇒nμ ⟺μ=k(u,−).

Lemma 3.1 implies that the relations can be seen as functions from the set of closed terms of type to the set of measures on .

The step-indexed distribution-based operational semantics approximates the evaluation of closed terms by restricting the number of reduction steps. Thus, the limit of the step-indexed distribution-based operational semantics represents the “true” result of evaluating the underlying term.

###### Definition 3.1.

For a closed term and a measure on , we write when there is a family of measures on such that and for all ,

 μ(A)=supn∈Nμn(A).

The binary relation is a function from the set of closed terms of type to the set of measures on . This follows from Lemma 3.1 and that the family of measures on such that forms an ascending chain with respect to the pointwise order. Moreover, it can be proved that for any , given by is an s-finite kernel.

### 3.3 Sampling-Based Operational Semantics

can be endowed with another form of operational semantics, closer in spirit to inference algorithms, called the sampling-based operational semantics. The way we formulate it is deeply inspired from the one in [bdlgs2016].

The idea behind sampling-based operational semantics is to give the evaluation result of each probabilistic branch somehow independently. We specify each probabilistic branch by two parameters: one is a sequence of random draws, which will be consumed by ; the other is a likelihood measure called weight, which will be modified by .

###### Definition 3.2.

A configuration is a triple consisting of a closed term , a real number called the configuration’s weight, and a finite sequence of real numbers in , called its trace.

Below, we write for the empty sequence. For a real number and a finite sequence consisting of real numbers, we write for the finite sequence obtained by putting on the head of . In Figure 3, we give the evaluation rules of sampling-based operational semantics where is the deterministic reduction relation introduced in the previous section. We denote the reflective transitive closure of by . Intuitively, means that by evaluating , we get the real number with weight consuming all the random draws in .

## 4 Towards Mealy Machine Semantics

In this section, we give some intuitions about our GoI model, which we also call Mealy machine semantics. Giving Mealy machine semantics for requires translating into the linear -calculus. This is because GoI is a semantics for linear logic, and is thus tailored for calculi in which terms are treated as resources. Schematically, Mealy machine semantics for translates terms in into Mealy machines in the following way.

 \xymatrix@R=7ptPCFSS\ar[d]−(1)Moggi% 's translationMoggi's meta-language\,+sample+score \ar[d]−(2)Girard translationthe linear λ-calculus\,+sample+score\ar[d]−(3)proof structures+sample+score\ar[d]−(4)Mealy machines\makebox[0.0pt].

In Section 4.1, we explain the first three steps. The last step deserves to be explained in more detail, which we do in Section 4.2. For the sake of simplicity, we ignore the translation of conditional branching and the fixed point operator.

### 4.1 From PCFSS to Proof Structures

#### 4.1.1 Moggi’s Translation

In the first step, we translate into an extension of the Moggi’s meta-language by Moggi’s translation [moggi1991]. Here, in order to translate scoring and sampling in , we equip Moggi’s meta-language with base types and and the following terms:

 \inferΔ⊢ra:Reala∈R,\inferscore(M):TUnitΔ⊢M:Real,\inferΔ⊢sample:TReal

where is the monad of Moggi’s meta-language. Any type of is translated into the type defined as follows:

 Unit♯=Unit,Real♯=Real,(A→B)♯=A♯→TB♯.

Terms and in are translated into and in Moggi’s meta-language respectively. See [moggi1991] for more detail about Moggi’s translation.

#### 4.1.2 Girard Translation

We next translate the extended Moggi’s meta-language into an extension of the linear -calculus, by way of the so-called Girard translation [girard1987]. Types are given by

 A,B::=Unit\Large{∣}Real\Large{∣}State\Large{∣}A⊥\Large{∣}A⊗B\Large{∣}A℘B\Large{∣}\ocA

where , and are base types, and terms are generated by the standard term constructors of the linear -calculus, plus the following rules:

 \inferΔ⊢ra:Reala∈R\inferΔ⊢score(M):State⊸State⊗\ocUnitΔ⊢M:\ocReal\inferΔ⊢sample:State⊸State⊗\ocReal

(as customary in linear logic, is an abbreviation of ). These typing rules are derived from the following translation of types of the extended Moggi’s meta-language into types of the extended linear -calculus:

 Unit♭=Unit,Real♭=Real,(A→B)♭=\ocA♭⊸B♭,(TA)♭=State⊸State⊗\ocA♭

The definition of is motivated by the following categorical observation: let be the syntactic category of the extended linear -calculus, which is a symmetric monoidal closed category endowed with a comonad with certain coherence conditions (see e.g. [hs2003]), and let be the coKleisli category of the comonad . Then, by composing the adjunction between and with a state monad on , we obtain a monad on :

which sends an object to . This use of the state monad is motivated by sampling-based operational semantics: we can regard as a call-by-value -calculus with global states consisting of pairs of a non-negative real number and a finite sequence of real numbers, and we can regard and as effectful operations interacting with those states.

#### 4.1.3 The Third Step

We translate terms in the extended linear -calculus into (an extension of proof structures) [lafont1995], which are graphical presentations of type derivation trees of linear -terms. We can also understand proof structures as string diagrams for compact closed categories [selinger2011]. Operators of the pure, linear, -calculus, can be translated as usual [lafont1995]. For example, type derivation trees

are translated into proof structures

respectively where nodes labelled with and are proof structures associated to type derivations of and . Terms of the form , and , require new kinds of nodes:

.

This is not a direct adaptation of typing rules for and in the linear -calculus, but the correspondence can be recovered by way of multiplicatives: