The pervasive role of stochastic models in a variety of domains (such as machine learning, natural language, verification) has prompted a vast body of research on probabilistic programming languages; such a language supports at least discrete distributions by providing an operator which models sampling. In particular, the functional style of probabilistic programming, pioneered by, attracts increasing interest because it allows for higher-order computation, and offers a level of abstraction well-suited to deal with mathematical objects. Early work [18, 25, 22, 27, 23] has evolved in a growing body of software development and theoretical research. In this context, the -calculus has often been used as a core language.
In order to model higher-order probabilistic computation, it is a natural approach to take the -calculus as general paradigm, and to enrich it with a probabilistic construct. The most simple and concrete way to do so ([24, 8, 12]) is to equip the untyped -calculus with an operator , which models flipping a fair coin. This suffices to have universality, as proved in , in the sense that the calculus is sound and complete with respect to computableprobability distributions. The resulting calculus is however non-confluent, as it has been observed early (see  for an analysis). We revise the issue in Example 1. The problem with confluence is handled in the literature by fixing a deterministic reduction strategy, typically the leftmost-outermost strategy. This is not satisfactory both for theoretical and practical reasons, as we discuss later.
In this paper, we propose a more general point of view. Our goal is a foundational calculus, which plays the same role as the -calculus does for deterministic computation. More precisely, taking the point of view propounded by Plotkin in , we discriminate between a calculus and a programming language. The former defines the reduction rules, independently from any reduction strategy, and enjoys confluence and standardization, the latter is specified by a deterministic strategy (an abstract machine). Standardization is what relates the two: the programming language implements the standard strategy associated to the calculus. Indeed, standardization implies the existence of a strategy (the standard strategy) which is guaranteed to reach the result, if it exists.
In this spirit, we consider a probabilistic calculus to be characterized by a specific calling mechanism; the reduction is otherwise only constrained by the need of discriminating between duplicating a function which samples from a distribution, and duplicating the result of sampling. Think of tossing a coin and duplicating the result, versus tossing the coin twice, which is indeed the issue at the core of confluence failure, as the following examples (adapted from [9, 8]) show.
Example 1 (Confluence).
Let us consider the untyped -calculus extended with a binary operator which models fair, binary probabilistic choice: reduces to either or with equal probability ; we write this as . Intuitively, the result of evaluating a probabilistic term is a distribution on its possible values.
Consider the term , where , and ; is the standard construct for exclusive , and are terms which code boolean values.
– If we first reduce , we obtain or , with equal probability . This way, evaluates to , i.e. with probability .
– If we reduce the outermost redex first, reduces to , and the term evaluates to the distribution .
The two resulting distributions are not even comparable.
The same phenomenon appears even if we restrict ourselves to call-by-value. Consider for example the reductions of with as in (1.), and . We obtain the same two different distributions as above.
In this paper, we define two probabilistic -calculi, respectively based on the call-by-value (CbV) and call-by-name (CbN) calling mechanism. Both enjoy confluence and standardization, in an extended way: indeed we revisit these two fundamental notions to take into account the asymptotic behaviour of terms. The common root of the two calculi is a further calculus based on Linear Logic, which is an extension of Simpson’s linear -calculus , and which allows us to develop a unified, modular approach.
Content and Contributions
In Section IV, we introduce a call-by-value calculus, denoted , as a probabilistic extension of the call-by-value -calculus of Plotkin (where the -reduction fires only in case the argument is a value, i.e. either a variable or a -abstraction.) We choose to study in detail call-by-value for two main reasons. First, it is the most relevant mechanism to probabilistic programming (most of the abstract languages we cited are call-by-value, but also real-world stochastic programs such as Church ). Second, call-by-value is a mechanism in which dealing with functions, and duplication of functions, is clean and intuitive, which allows us to address the issue at the core of confluence failure. The definition of value (in particular, a probabilistic choice is not a value) together with a suitable restriction of the evaluation context for the probabilistic choice, allow us to recover key results: confluence and a form of standardization (Section V). Let us remind that, in the classical -calculus, standardization means that there is a strategy which is complete for all reduction sequences, i.e., for every reduction sequence there is a standard reduction sequence from to . A standard reduction sequence with the same property exists also here. An unexpected result is that strategies which are complete in the classical case, are not so here, notably the leftmost strategy.
In Section VI we study the asymptotic behavior of terms. Our leading question is how the asymptotic behaviour of different sequences starting from the same term compare. We first analyze if and in which sense confluence implies that the result of a probabilistically terminating computation is unique. We formalize the notion of asymptotic result via limit distributions, and establish that there is a unique maximal one.
In Section VII we address the question of how to find such greatest limit distribution, a question which arises from the fact that evaluation in is non-deterministic, and different sequences may terminate with different probability. With this aim, we extend the notion of standardization to limits; this extension is non-trivial, and demands the development of new sophisticated proof methods.
We prove that the new notion of standardization supplies a family of complete reduction strategies which are guaranteed to reach the unique maximal result. Remarkably, we are able to show that, when evaluating programs, i.e., closed terms, this family does include the leftmost strategy. As we have already observed, this is the deterministic strategy which is typically adopted in the literature, in either its call-by-value ([18, 7]) or its call-by-name version ([24, 12]), but without any completeness result with respect to probabilistic computation. Our result offers an “a posteriori” justification for its use!
The study of allows us to develop a crisp approach, which we are then able to use in the study of different probabilistic calculi. Because the issue of duplication is central, it is natural to expect a benefit from the fine control over copies which is provided by Linear Logic. In Section IX we use our tools to introduce and study a probabilistic linear -calculus, . The linear calculus provides not only a finer control on duplication, but also a modular approach to confluence and standardization, which allow us to formalize a call-by-name version of our calculus, namely , in Section X. We prove that enjoys properties analogous to those of , in particular confluence and standardization.
The idea of extending the -calculus with a probabilistic construct is not new; without any ambition to be exhaustive, let us cite [22, 27], [24, 12, 8, 5]. In all these cases, a specific reduction strategy is fixed; they are indeed languages, not calculi, according to Plotkin’s distinction.
Confluence failure in the probabilistic case is well-analyzed in , which then studies the operational semantics of both a CbV and CbN language, and the translation between the two.
The issue about confluence appears every time the -calculus is extended with a choice effect: quantum, algebraic, non-deterministic. Confluence for an algebric calculus is dealt with in  for the call-by-value, and in  for the call-by-name. In the quantum case we would like to cite [7, 6, 10], which use Simpson’s calculus . The ways of framing the same problem in different settings are naturally related, and we were inspired by them.
To our knowledge, the only proposal of a probabilistic -calculus in which the reduction is independent from a specific strategy is for call-by-name, namely the calculus of , in the line of work of differential  and algebric  -calculus. The focus in  is essentially semantical, as the author want to study an equational theory for the -calculus, based on an extension of Böhm trees.  develops results which in their essence are similar to those we obtain for call-by-name in Sec. X, in particular confluence and standardization, even if his calculus –which internalizes the probabilistic behavior– is quite different from ours, and so are the proof techniques.
Ii Background and Motivational observations
In this section, we first revise -in a non-technical way- the specificities of probabilistic program, and how they differ from classical ones. We then focus on some motivational observations which are relevant to our work. First, we give an example of features which are lost if a programming language is characterized by a strategy which is not rooted in a more general calculus. Then, we illustrate some of the issues which appear when we study a general calculus, instead of a specific reduction strategy. To address these issues in the paper, will lead us to develop new notions and tools.
Ii-a Classical vs. Probabilistic Programs
A classical program defines a deterministic input-output relation; it terminates (on a given input), or does not; if it terminates, it takes finitely many steps to do so. Instead, a probabilistic program generates a probability distribution over possible outputs; it terminates (on a given input) with a certain probability; it may take infinitely many steps even when termination has probability .
A probabilistic program is a stochastic model. The intuition is that the probabilistic program is executed, and random choices are made by sampling; this process defines a distribution over all the possible outputs of . Even if the termination probability is (almost sure termination), that degree of certitude is typically not reached in any finite number of steps, but it appears as a limit. A standard example is a term which reduces to either the normal form or itself, with equal probability . After steps, reduces to with probability . Only at the limit this computation terminates with probability .
Probabilistic vs. Quantitative
Ii-B Confluence of the calculus is relevant to programming
Functional languages have their foundation in the -calculus and its properties, and such properties (notably, confluence and standardization) have theoretical and practical implications. A strength of classical functional languages -which is assuming growing importance- is that they are inherently parallel (we refer e.g. to  for discussion on deterministic parallel programming): every sub-expression can be evaluated in parallel, because of referential transparency abstracts over the execution order; still, we can perform reasoning, testing and debugging on a program using a sequential model, because the result of the calculus is independent from the evaluation order. Not to force a sequential strategy impacts the implementation of the language, but also the conception of programs. As advocated by Harper , the parallelism of functional languages exposes the “dependency structure of the computation by not introducing any dependencies that are not forced on us by the nature of the computation itself."
This feature of functional languages is rooted in the confluence of the -calculus, and is an example of what is lost in the probabilistic setting, if we give-up either confluence, or the possibility of non-deterministic evaluation.
Ii-C The result of probabilistic computation
A ground for our approach is the distinction between calculus and language. Some of the issues which we will need to address do not appear when working with probabilistic languages, because they are based on a simplification of the -calculus. Programming languages only evaluate programs, i.e., closed terms (without free variables). A striking simplification appears from another crucial restriction, weak evaluation, which does not evaluate function bodies (the scope of -abstractions). In weak call-by-value (base of the ML/CAML family of probabilistic languages) values are normal forms.
What is the result of a probabilistic computation is well understood only in the case of programming languages: the result of a program is a distribution on its possible outcomes, which are normal forms w.r.t. a chosen strategy. In the literature of probabilistic -calculus, two main deterministic strategies have been studied: weak left strategy in CbV  and head strategy in CbN , whose normal forms are respectively the closed values and the head normal forms.
When considering a calculus instead of a language, the identity between normal forms and results does not hold anymore, with important consequences in the definition of limit distributions. We investigate this issue in Sec. VI. The approach we develop is general and uniform to all our calculi.
Iii Technical Preliminaries
We review basic notions on discrete probability and rewriting which we use through the paper. We assume that the reader has some familiarity with the -calculus.
Iii-a Basics on Discrete Probability
A discrete probability space is given by a pair , where is a countable set, and is a discrete probability distribution on , i.e. is a function from to such that . In this case, a probability measure is assigned to any subset as
. In the language of probability theory, a subset ofis called an event.
Let be as above. Any function , where is another a countable set, induces a probability distribution on by composition: i.e. . In the language of probability theory, is called a discrete random variable on .
Example 2 (Die).
Consider tossing a die once. The space of possible outcomes is the set . The probability measure of each outcome is . The event
“result is odd"is the subset , whose probability measure is .
Let be a set with two elements , and the obvious function from to . induces a distribution on , with and .
Iii-B Subdistributions and
Given a countable set , a function is a probability subdistribution if . We write for the set of subdistributions on . With a slight abuse of language, we will use the term distribution also for subdistribution. Subdistributions allow us to deal with partial results and non-successful computations.
Order: is equipped with the standard order relation of functions : if for each .
Support: The support of is .
Representation: We represent a distribution by explicitly indicating the support, and (as superscript) the probability assigned to each element by . We write if and otherwise.
To syntactically represent the global evolution of a probabilistic system, we rely on the notion of multidistribution .
A multiset is a (finite) list of elements, modulo reordering, i.e. ; the multiset has three elements. Let be a countable set and a multiset of pairs of the form , with , and . We call (where the index set ranges over the elements of ) a multidistribution on if . We denote by the set of all multidistributions on .
We write the multidistribution simply as . The sum of multidistributions is denoted by , and it is the concatenation of lists. The product of a scalar and a multidistribution is defined pointwise: .
Intuitively, a multidistribution is a syntactical representation of a discrete probability space where at each element of the space is associated a term of and a probability measure. To the multidistribution we associate a probability distribution as follows:
and we call the probability distribution associated to .
Example 3 (Distribution vs. multidistribution).
If , then . Please observe the difference between distribution and multidistribution: if , then , but .
Iii-D Binary relations (notations and basic definitions)
Let be a binary relation on a set . We denote its reflexive and transitive closure. We denote the reflexive, symmetric and transitive closure of . If , we write if there is no such that ; in this case, is in -normal form. Figures convention: as is standard, in the figures we depict as ; solid arrows are universally quantified, dashed arrows are existentially quantified.
Confluence and Commutation
Let . The relations and on commute if ( and ) imply there is such that ( and ); they diamond-commute (-commute) if ( and ) imply there is such that ( and ). The relation is confluent (resp. diamond) if it commutes (resp. -commutes) with itself. It is well known that -commutation implies commutation, and diamond implies confluence.
Iv Call-by-Value calculus
In this section we define a CbV probabilistic -calculus, which we denote by .
Iv-a Syntax of
Iv-A1 The language
Terms and values are generated respectively by the grammars:
where ranges over a countable set of variables (denoted by ). and denote respectively the set of terms and of values. Free variables are defined as usual. denotes the term obtained by capture-avoiding substitution of for each free occurrence of in .
Contexts () and surface contexts () are generated by the grammars:
where denotes the hole of the term context. Given a term context , we denote by the term obtained from by filling the hole with , allowing the capture of free variables. All surface contexts are contexts. Since the hole will be filled with a redex, surface contexts formalize the fact that the redex (the hole) is not in the scope of a -abstraction, nor of a .
denotes the set of multi-distributions on .
Observe that, usually, a reduction step is given by the closure under context of the reduction rules. However, to define a reduction from term to term is not informative enough, because we still have to account for the probability. The meaning of the probabilistic choice is that this term reduces to either or , with equal probability . There are various way to formalize this fact; here, we use multidistributions.
Reduction Rules and Steps
The reduction rules on the terms of are defined in Fig. 1.
We lift the reduction relation to a relation , as defined in Fig. 3. Observe that is a reflexive relation.
We define in the same way the lifting of any relation to a binary relation on . In particular, we lift to .
A -sequence (or reduction sequence) from is a sequence such that (). We write to indicate that there is a finite sequence from to , and for an infinite sequence.
We write for the transitive, reflexive and symmetric closure of ; abusing the notation, we will write for .
Given , a term is in normal form if there is no such that . We also write . We denote by the set of the normal forms of .
It is immediate to check that all closed -normal forms are values, however a value is not necessarily a -normal form.
Iv-A3 Full Lifting
The definition of lifting allows us to apply a reduction step to any number of in the multidistribution . If no is reduced, then (the relation is reflexive). Another important case is when all for which a reduction step is possible are indeed reduced. This notion of full reduction, denoted by , is defined as follows.
Obviously, . As for the case of lifting, also the notion of full lifting can be extended to any reduction. So, for any , its full lifting is denoted by .
The relation plays an important role in VII.
Iv-B and the -calculus
A comparison between and the -calculus is in order.
is a conservative extension of . A translation can be defined as follows, where is a fresh variables which is used by no term:
The translation is injective (if then ) and preserves values.
Proposition 4 (Simulation).
The translation is sound and complete. Let .
implies there is a (unique) , with and .
Iv-C Discussion (Surface Contexts)
The notion of surface context which we have defined is familiar in the setting of -calculus, it corresponds to weak evaluation, which we have discussed in II-C.
In , the -reduction is unrestricted. Closing the -rules under surface context expresses the fact that the -redex is not reduced under -abstraction, nor in the scope of another . The former is fundamental to confluence: it means that a function which samples from a distribution can be duplicated, but we cannot pre-evaluate the sampling. The latter is a a technical simplification, which we adopt to avoid unessential burdens with associativity. To require no reduction in the scope of is very similar to allow no reduction in the branches of an if-then-else.
V Confluence and Standardization
We prove that is confluent. We modularize the proof using the Hindley-Rosen lemma. The notions of commutation and -commutation which we use are reviewed in Sec. III-D.
Let and be binary relations on the same set . Their union is confluent if both and are confluent, and and commute.
The following criterion allows us to work pointwise in proving commutation and confluence of binary relations on multidistributions, namely and .
Lemma 5 (Pointwise Criterion).
Let and their lifting (as defined in IV-A2). Property (*) below implies that -commute.
(*) If and then s.t. and .
We prove that (**) and imply exists such that and .
Let . By definition of Lifting, for each , we have and , with and . It is easily checked, that for each , it exists s.t. and . If either or uses reflexivity (rule ), it is immediate to obtain . Otherwise, is given by property (*). Hence satisfies (**).
The reduction is confluent.
Assume and . We first observe that if , then and are respectively of the shape , , with and . By Prop. 4, we can project such reduction sequences on , obtaining that for each , and . Since in CbV -calculus is confluent, there are such that and . By Prop. 4.2, for each there is a unique such that , and the proof is given. ∎
We prove that the reduction is diamond, i.e., the reduction diagram closes in one step.
The reduction is diamond.
We prove that if and , then such that and . The claim then follows by Lemma 5, by taking .
Let , and . Because of definition of surface context, the two -redexes do not overlap: is a subterm of and is a subterm of . Hence we can reduce those redexes in and , to obtain the same .
We prove commutation of and by proving a stronger property: they -commute.
The reductions and -commute.
By using Lemma 5, we only need to prove that if and , then such that and .
The proof is by induction on . Cases and are not possible given the hypothesis.
Case . is the only possible -redex. Assume the -redex is inside (the other case is similar), and that , . It is immediate that satisfies the claim.
Case . cannot have the form because neither nor could contain a -redex.
Assume that the -redex is inside , and the -redex inside . We have (with ), (with ). It is immediate that satisfies the claim. The symmetric case is similar.
Assume that both redexes are inside . Let us write as . Assume , , therefore and . We use the inductive hypothesis on to obtain such that , , . We conclude that for , it holds that and .
The reduction is confluent.
Let us say that is a normal-forms multidistribution if all are -normal forms (i.e. ). An immediate consequence of confluence is the following:
The normal-forms multidistribution to which reduces, if any, is unique.
While immediate, the content of Cor. 10 is hardly useful, for two reasons. First, we know that probabilistic termination is not necessarily reached in a finite number of steps; the relevant notion is not that , but rather that of a distribution which is defined as limit by the sequence . Second, in the Plotkin CbV calculus the result of computation is formalized by the notion of value, and considering normal forms as values is unsound (, page 135). In Section VI-B we introduce a suitable notion of limit distribution, and study the implications of confluence on it.
V-B A Standardization Property
In this section, we first introduce surface and left reduction as strategies for . In the setting of the CbV -calculus, the former corresponds to weak reduction, the latter to the standard strategy originally defined in . We then establish a standardization result, namely that every finite -sequence can be partially ordered as a sequence in which all surface reductions are performed first. A counterexample shows that in , a standardization result using left reduction fails.
V-B1 Surface and Left Reduction
We remind that in the -calculus, a deterministic strategy defines a function from terms to redexes, associating to every term the next redex to be reduced. More generally, we call reduction strategy for a reduction relation such that . The notion of strategy can be easily formalized through the notion of context. With this in mind, let us consider surface and left contexts.
Surface contexts have been defined in Sec.IV-A1.
Left contexts are defined by the following grammar:
Note that in particular a left contexts is a surface context.
We call surface reduction, denoted by (with lifting ) and left reduction, denoted by (with lifting ), the closure of the reduction rules in Fig. 1 under surface contexts and left contexts, respectively. It is clear that . Observe that .
A reduction step is deep, written , if it is not a surface step. A reduction step is internal (written ) if it is not a left step. Observe that .
() Let , where . Then and ; instead, , .
() Let . Then and , while and
Intuitively, left reduction chooses the leftmost of the surface redexes. More precisely, this is the case for closed terms (for example, the term has a -step, but no -step).
Surface Normal Forms: We denote by the set of -normal forms. We observe that all values are surface normal forms (but the converse does not hold): (and ). The situation is different if we restrict ourselves to close term, in fact the following result holds, which is easy to check.
If is a closed term, the following three are equivalent:
is a -normal form;
is a -normal form;
is a value.
V-B2 Finitary Surface Standardization
The next theorem proves a standardization result, in the sense that every finite reduction sequence can be (partially) ordered in a sequence of surface steps followed by a sequence of deep steps.
Theorem 13 (Finitary Surface Standardization).
In , if then exists such that and .
Finitary Left Standardization does not hold
The following statement is false for .
“If then exists such that and ."
Example 14 (Counter-example).
Let us consider the following sequence, where and . . If we anticipate the reduction of , we have , from where we cannot reach . Observe that the sequence is already surface-standard !
Vi Asymptotic Evaluation
The specificity of probabilistic computation is to be concerned with asymptotic behavior; the focus is not what happens
after a finite number of steps, but when tends to infinity.
In this section, we study the asymptotic behavior of -sequences with respect to evaluation.
The intuition is that a reduction sequence defines a distribution on the possible outcomes of the program.
We first clarify what is the outcome of evaluating a probabilistic term, and then
we formalize the idea of result “at the limit" with the notion of limit distribution (Def. 19). In Sec. VI-B we investigate
how the asymptotic result of different sequences starting from the same compare.
We remind that to any multidistribution on is associated a probability distribution on , as described in Sec.III-C. We assume the following letter convention: given a multidistribution we denote the associated distribution by the corresponding greek letter If is a -sequence, then is the sequence of associated distributions.
Vi-a Probabilistic Evaluation
Vi-A1 To be valuable
In the CbV -calculus, the key property of a term is to be valuable, i.e., can reduce to a value. To be valuable is a yes/no property, whose probabilistic analogous is the probability to reduce to a value. If describes the result of a computation step, the probability that such a result is a value is simply , i.e. the probability of the event . Since the set of values is closed under reduction, the following property holds:
If and , then , with , and .
Let be a -sequence, and the sequence of associated distributions. The sequence of reals is nondecreasing and bounded, because of Fact 15. Therefore the limit exists, and is the supremum: This fact allows us the following definition.
The sequence evaluates with probability
if , written .
is -valuable if is the greatest probability to which a sequence from can evaluate.
Let and .
Consider the term where . Then . Since , is -valuable.
Consider the term , where . Then It is immediate that is -valuable.
Let , so that is a divergent term, and let . Then is -valuable.
Vi-A2 Result of a CbV computation
The notion of being -valuable allows for a simple definition, but it is too coarse. Consider Example 16; both points (1.) and (2.) give examples of -valuable term. However, in (1.) the probability is concentrated in the value , while in (2.) and have equal probability . Observe that and are different normal forms, and are not -equivalent. To discriminate between and , we need a finer notion of evaluation. Since the calculus is CbV, the result “at the limit" is intuitively a distribution on the possible values that the term can reach. Some care is needed though, as the following example shows.
Consider Plotkin’s CbV -calculus. Let ; the term has the following -reduction: . We obtain a reduction sequence where , . Each is a value, but there is not a "final" one in which the reduction ends. Transposing this to , let , . The -sequence is -valuable, but the distribution on values is different at every step. In other words, , the sequence has no limit. Observe that however all the values are -equivalent.
Vi-A3 Observations and Limit distribution
Example 17 motivates the approach that we develop now: the result of probabilistic evaluation is not a distribution on values, but a distribution on some events of interest. In the case of , the most informative events are equivalence classes of values.
We first introduce the notion of observation, and then that of limit distribution.
A set of observations for is a set such that , and if then .
Given , has probability (similarly to the event "the result is Odd" in Example 2).
It follows immediately from the definition that, given a sequence , then for each the sequence is nondecreasing and bounded, and therefore has a limit, the . This allows us to define a distribution on .
Let be a set of observations. The sequence defines a distribution , where , for each .
We call such a the limit distribution of . Letter convention: greek bold letters denote limit distributions.
The sequence converges to (or evaluates to) the limit distribution , written
If has a sequence which converges to , we write
Given , we denote by the set of all limit distributions of . If has a greatest element, we indicate it by .
If is clear from the context, we omit the index which specifies it, and simply write , , .
The notion of limit distribution formalizes what is the result of evaluating a probabilistic term, once we choose the set of observations which interest us. In VI-B we prove that confluence implies that has a unique maximal element.
Sets of Observations for