Paradoxes of Probabilistic Programming

01/09/2021 ∙ by Jules Jacobs, et al. ∙ 0

Probabilistic programming languages allow programmers to write down conditional probability distributions that represent statistical and machine learning models as programs that use observe statements. These programs are run by accumulating likelihood at each observe statement, and using the likelihood to steer random choices and weigh results with inference algorithms such as importance sampling or MCMC. We argue that naive likelihood accumulation does not give desirable semantics and leads to paradoxes when an observe statement is used to condition on a measure-zero event, particularly when the observe statement is executed conditionally on random data. We show that the paradoxes disappear if we explicitly model measure-zero events as a limit of positive measure events, and that we can execute these type of probabilistic programs by accumulating infinitesimal probabilities rather than probability densities. Our extension improves probabilistic programming languages as an executable notation for probability distributions by making it more well-behaved and more expressive, by allowing the programmer to be explicit about which limit is intended when conditioning on an event of measure zero.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Probabilistic programming languages such as Stan (Carpenter_stan:a), Church (GoodmanMRBT08), and Anglican (WoodMM14) allow programmers to express probabilistic models in statistics and machine learning in a structured way, and run these models with generic inference algorithms such as importance sampling, Metropolis-Hastings, SMC, HMC. At its core, a probabilistic programming language is a notation for probability distributions that looks much like normal programming with calls to random number generators, but with an additional observe construct.

There are two views on probabilistic programming. The pragmatist says that probabilistic programs are a convenient way to write down a likelihood function, and the purist says that probabilistic programs are a notation for structured probabilistic models. The pragmatist interprets an observe statement as “soft conditioning”, or imperatively multiplying the likelihood function by some factor. The purist interprets an observe statement as true probabilistic conditioning in the sense of conditional distributions. The pragmatist may also want to write a probabilistic program to compute the likelihood function of a conditional distribution, but the pragmatist is not surprised that there are non-sensical probabilistic programs that do not express any sensible statistical model. After all, if one writes down an arbitrarily likelihood function then it will probably not correspond to a sensible, structured, non-trivial statistical model. The pragmatist blames the programmer for writing non-sensical programs, just as it would have been the fault of the programmer if they had written down the same likelihood function manually. The purist, on the other hand, insists that any probabilistic program corresponds to structured statistical model, and that each observe statement in a probabilistic program has a probabilistic interpretation whose composition results in the statistical model. We will show that the current state is not satisfactory for the purist, and we will show how to make probabilistic programming languages satisfactory in this respect.

The difficulties with conditioning in probabilistic programs can be traced back to a foundational issue in probability theory. When the event

being conditioned on has nonzero probability, the conditional distribution is defined as:

However, this formula for conditional probability is undefined when , since then also and the fraction is undefined. In probabilistic programming we often wish to condition on events with probability , such as “”, where

is a continuous random variable. There are several methods to condition on measure-zero events. For continuous distributions that have probability density functions, we can replace the probabilities in the above formula with probability densities, which are (usually) nonzero even if

is zero. For more complicated situations, we can use the Radon–Nikodym derivative or disintegration (Chang93conditioningas; ShanR17; DahlqvistK20; ackerman_freer_roy_2017).

A general method for conditioning on measure-zero events is to define a sequence of events parameterized by a number such that in some sense converges to in the limit , but for all . We then define the conditional distribution to be the limit of :

In the book Probability Theory: The Logic of Science (jaynes03), E.T. Jaynes explains that conditioning on measure-zero events is inherently ambiguous, because it depends not just on but also on the limiting operation we choose:

Yet although the sequences and tend to the same limit “”, the conditional densities [ and ] tend to different limits. As we see from this, merely to specify “” without any qualifications is ambiguous; it tells us to pass to a measure-zero limit, but does not tell us which of any number of limits is intended. […] Whenever we have a probability density on one space and we wish to generate from it one on a subspace of measure zero, the only safe procedure is to pass to an explicitly defined limit by a process like [ and ]. In general, the final result will and must depend on which limiting operation was specified. This is extremely counter-intuitive at first hearing; yet it becomes obvious when the reason for it is understood.

The other methods implicitly make the choice for us. Conditioning on events of measure-zero using those methods can lead to paradoxes such as the Borel-Komolgorov paradox, even in the simplest case when probability density functions exist. Paradoxes occur because seemingly unimportant restatements of the problem, such as using a different parameterization for the variables, can affect the choice of that those methods make, and thus change the value of the limit. Consider the following probabilistic program:

  h = rand(Normal(1.7, 0.5))
  if rand(Bernoulli(0.5))
     observe(Normal(h, 0.1), 2.0)
  end

We first sample a value (say, a person’s height) from a prior normally distributed around 1.7 meters and then with probability 0.5 we observe a measurement normally distributed around the height to be 2.0. We ran this program in Anglican with importance sampling, and obtained the following expectation values for

: 1.812 1.814 1.823 1.813 1.806 (10000 samples each). Suppose that we had measured the height in centimeters instead of meters:

  h = rand(Normal(170, 50))
  if rand(Bernoulli(0.5))
     observe(Normal(h, 10), 200)
  end

We might naively expect this program to produce roughly the same output as the previous program, but multiplied by a factor of 100 to account for the conversion of meters to centimeters. Instead, we get 170.1 170.4 171.5 170.2 169.4. This behavior happens because even though the units of the program appear to be correct, the calculations that importance sampling does to estimate the expectation value involve arithmetic with inconsistent units (in this case, adding a quantity with units

to a quantity with neutral units). The issue is not particular to Anglican or importance sampling, but due to the interaction of stochastic branching with way the likelihood is calculated with probability densities; other algorithms (paige-nips-2014; tolpin-ecml-2015) have the same behavior. In fact, formal semantics based on likelihood accumulation, such as the commutative semantics (10.1007/978-3-662-54434-1_32) and the semantics based on on Quasi-Borel spaces (HeunenKSY17), also perform arithmetic with inconsistent units for this example. Lexical likelihood weighting (pmlr-v80-wu18f) does give the right answer for this example111Many thanks to Alex Lew for pointing this out., but still exhibits unit anomalies for other examples described in Section 3.

Unit errors in a programming language’s implementation or semantics may seem like a very serious issue, but we do not believe that this is a show-stopper in practice, because practitioners can always take the pragmatist view and avoid writing such programs. Although we consider this to be an important foundational issue, it does not invalidate existing work on probabilistic programming.

It is known that conditionals can be problematic. Some inference algorithms, like SMC, will make assumptions that exclude observe inside conditionals. For example, (probprogfrankwood) mentions the following when describing SMC:

Each breakpoint needs to occur at an expression that is evaluated in every execution of a program. In particular, this means that breakpoints should not be associated with expressions inside branches of if expressions. […] An alternative design, which is often used in practice, is to simply break at every observe and assert that each sample has halted at the same point at run time.

If the design is used where breakpoints happen at every observe, then the assertion that breakpoints should not be associated with expressions inside branches of if expressions will disallow using SMC with programs that have observes inside conditionals. Languages such as Stan, that do not have or do not allow stochastic branching, also do not suffer from the preceding example. In section 3 we will show that the problem is not limited to conditionals; there are programs that do not have conditionals but nevertheless have paradoxical behavior. Furthermore, we show that the standard method of likelihood accumulation for implementing probabilistic programming languages can sometimes obtain an answer that disagrees with the purist’s exact value for even if is nonzero, due to a confusion between probabilities and probability densities.

We identify three types paradoxes that affect probabilistic programming languages that allow dynamically conditioning on events of measure-zero. These paradoxes are based on the idea that it should not matter which parameter scale we use for variables. It shouldn’t matter whether we use meters or centimeters to measure height, but it also shouldn’t matter whether we use energy density or decibels to measure sound intensity. The change from centimeters to meters involves a linear parameter transformation by , whereas the change from energy density to decibels involves a nonlinear parameter transformation . We give several example programs that show that the output of a probabilistic program can depend on the parameter scale used when we condition on events of measure zero.

Following Jaynes’ advice, we extend the language with notation for explicitly choosing which limit we mean in an observe statement. We give an implementation of likelihood accumulation using infinitesimal probabilities instead of probability densities, and show that this does not suffer from the three types of paradoxes. Infinitesimal probabilities give meaning to conditioning on measure-zero events in terms of a limit of events of strictly positive measure. Since events of strictly positive measure are unproblematic, paradoxes can no longer occur.

Furthermore, we add explicit language support for parameter transformations. This is only soundly possible due to the introduction of infinitesimal probabilities. We show that introducing a parameter transformation in an observe statement does not change the behavior of the probabilistic program. That is, we show that in our language, observe(D,I) has the same behavior as observe(D’,I’) where D’,I’ is D,I in a different parameter scale.


Our contributions are the following.

  • We identify a problem with existing probabilistic programming languages, in which likelihood accumulation with probability densities can result in three different types of paradoxes when conditioning on a measure-zero event. The three paradoxes violate the principle that the output of a program should not depend on the parameter scale used (Section 3).

  • We analyze the event that probabilistic programs with observe statements condition on, taking the paradox-free discrete case as a guide, in order to determine what observe ought to mean in the continuous case (Section 2).

  • We propose a change to probabilistic programming languages to avoid the paradoxes of the continuous measure-zero case, by changing the observe construct to condition on measure-zero events as an explicit limit of (Sections 4 and 5), and

    • a method for computing the limit by accumulating infinitesimal probabilities instead of probability densities, which we use to implement the adjusted observe construct,

    • a theorem that shows that infinitesimal probabilities correctly compute the limit of , ensuring that programs that use observe on measure-zero events are paradox free,

    • a translation from the existing observe construct to our new observe construct, which gives the same output if the original program was non-paradoxical,

    • language support for parameter transformations, which we use to show that the meaning of programs in our language is stable under parameter transformations,

    • an implementation of our language as an embedded DSL in Julia (measurezeroartifact) (Section 6).

2. On the Event that Observe Conditions On

Different probabilistic programming languages have different variants of the observe statement. Perhaps it’s simplest variant, observe(b) takes a boolean b and conditions on that boolean being true. For instance, if we throw two dice and want to condition on the sum of the dice being 8, we can use this probabilistic program, in pseudocode:

  function twoDice()
     x = rand(DiscreteUniform(1,6))
     y = rand(DiscreteUniform(1,6))
     observe(x + y == 8)
     return x
  end

The program twoDice represents the conditional distribution where and

are uniformly distributed numbers from

to . We wrap the program in a function and use the return value to specify the variable x whose distribution we are interested in. Anglican has a defquery construct analogous to the function definition that we use here.

Probabilistic programming languages allow us to sample from the distribution specified by the probabilistic program and compute expectation values. The simplest method to implement observe is rejection sampling (vonNeumann1951; GoodmanMRBT08): we start a trial by running the program from the beginning, drawing random samples with rand, and upon encountering observe(x + y == 8) we test the condition, and if the condition is not satisfied we reject the current trial and restart the program from the beginning hoping for better luck next time. If all observes in a trial are satisfied, then we reach the return statement and obtain a sample for x. We estimate expectation values by averaging multiple samples.

What makes probabilistic programs such an expressive notation for probability distributions is that we have access to use the full power of a programming language, such as its control flow and higher order functions (HeunenKSY17). The following example generates two random dice throws x and y, and a random boolean b, and uses an observe statement to condition on the sum of the dice throws being 8 if b = true, with control flow:

  x = rand(DiscreteUniform(1,6))
  y = rand(DiscreteUniform(1,6))
  b = rand(Bernoulli(0.5))
  if b
     observe(x + y == 8)
  end
  return x

This code expresses the conditional probability distribution where are distributed according to the given distributions, and is the event That is, a trial is successful if or if .

In general, a probabilistic program conditions on the event that the tests of all observe statements that are executed succeed. A bit more formally, we have an underlying probability space and we think of an element as the “random seed” that determines the outcome of all rand calls (it is sufficient to take ; a real number contains an infinite amount of information, sufficient to determine the outcome of an arbitrary number of rand calls, even if those calls are sampling from continuous distributions). The execution trace of the program is completely determined by the choice . For some subset , the tests of all the observe calls that are executed in the trace succeed. This is the event that a probabilistic program conditions on. Rejection sampling gives an intuitive semantics for the observe statement:

For a boolean b, the statement observe(b) means that we only continue with the current trial only if b = true. If b = false we reject the current trial.

Unfortunately, rejection sampling can be highly inefficient when used to run a probabilistic program. If we use 1000-sided dice instead of 6-sided dice, the probability that the sum is a particular fixed value is very small, so most trials will be rejected and it may take a long time to obtain a successful sample. Probabilistic programming languages therefore have a construct observe(D,x) that means observe(rand(D) == x)

, but can be handled by more efficient methods such as importance sampling or Markov Chain Monte Carlo (MCMC). The previous example can be written using this type of

observe as follows:

  x = rand(DiscreteUniform(1,6))
  b = rand(Bernoulli(0.5))
  if b
     observe(DiscreteUniform(1,6), 8 - x)
  end
  return x

This relies on the fact that x + y == 8 is equivalent to y == 8 - x. The intuitive semantics of observe(D,x) is as follows:

For discrete distributions D, the statement observe(D,x) means that we sample from D and only continue with the current trial if the sampled value is equal to x.

This variant of observe can be implemented more efficiently than rejection sampling. We keep track of the weight of the current trial that represents the probability that the trial is still active (i.e. the probability that it was not yet rejected). An observe(D,x) statement will multiply the weight of the current trial by the probability P(D,x) that a sample from D is equal to x:

For discrete distributions D, the statement observe(D,x) gets executed as weight *= P(D,x), where P(D,x) is the probability of x in D.

The output of a trial of a probabilistic program is now weighted sample: a pair of random value x and a weight. Weighted samples can be used to compute expectation values as weighted averages (this is called importance sampling 222More advanced MCMC methods can use the weight to make intelligent choices for what to return from rand calls, whereas importance sampling uses a random number generator for rand calls. We focus on importance sampling because this is the simplest method beyond rejection sampling.). Estimating an expectation value using importance sampling will usually converge faster than rejection sampling, because importance sampling’s observe will deterministically weigh the trial by the probability P(D,x) rather than randomly rejecting the trial with probability 1 - P(D,x). If P(D,x) = 0.01 then rejection sampling would reject 99% of trials, which is obviously very inefficient. It is important to note that multiplying weight *= P(D,x) is the optimized implementation of observe, and we may still semantically think of it as rejecting the trial if sample(D) != x.

If the distribution D is a continuous distribution, then the probability that a sample from D is equal to any particular value x becomes zero, so rejection sampling will reject of trials; it becomes infinitely inefficient. This is not surprising, because on the probability theory side, the event that we are now conditioning on has measure zero. Importance sampling, on the other hand, continues to work in some cases, provided we replace probabilities with probability densities:

For continuous distributions D, the statement observe(D,x) gets executed as weight *= pdf(D,x), where pdf(D,x) is the probability density of x in D.

For instance, if we want to compute where and are distributed according to distributions, conditioned on their sum being , we can use the following probabilistic program:

  x = rand(Normal(2,3))
  observe(Normal(2,3), 8 - x)
  return x

This allows us to draw (weighted) samples from the distribution where are distributed according to . Unfortunately, as we shall see in the next section, unlike the discrete case, we do not in general have a probabilistic interpretation for observe(D,x) on continuous distributions D when control flow is involved, and we can get paradoxical behavior even if control flow is not involved.

3. Three Types of Paradoxes

We identify three types of paradoxes. The first two involve control flow where we either execute observe on different variables in different control flow paths, or an altogether different number of observes in different control flow paths. The third paradox is a variant of the Borel-Komolgorov paradox and involves non-linear parameter transformations.

3.1. Paradox of Type 1: Different Variables Observed in Different Control Flow Paths

Consider the following probabilistic program:

  h = rand(Normal(1.7, 0.5))
  w = rand(Normal(70, 30))
  if rand(Bernoulli(0.5))
     observe(Normal(h, 0.1), 2.0)
  else
     observe(Normal(w, 5), 90)
  end
  bmi = w / h^2

We sample a person’s height and weight from a prior, and then we observe a measurement of the height or weight depending on the outcome of a coin flip. Finally, we calculate the BMI, and want to compute its average. If is the measurement sampled from and is the measurement sampled from and is the boolean sampled from , then the event that this program conditions on is . This event has measure zero.

Just like the program in the introduction, this program exhibits surprising behavior when we change from meters to centimeters: even after adjusting the formula to account for the change of units, the estimated expectation value for still changes. Why does this happen?

The call to observe(D,x) is implemented as multiplying the weight by the probability density of x in D. Importance sampling runs the program many times, and calculates the estimate for bmi as a weighted average. Thus the program above effectively gets translated as follows by the implementation:

  weight = 1
  h = rand(Normal(1.7, 0.5))
  w = rand(Normal(70, 30))
  if rand(Bernoulli(0.5))
     weight *= pdf(Normal(h, 0.1), 2.0)
  else
     weight *= pdf(Normal(w, 90), 5)
  end
  bmi = w / h^2

Where is the probability density function of the normal distribution:

Importance sampling runs this program times, obtaining a sequence .
It estimates with a weighted average:

The problem that causes this estimate to change if we change the units of h is that the formula adds quantities with inconsistent units: some have unit (inverse length) and some have unit (inverse mass).

It might be surprising that the weights have units at all, but consider that if we have a probability distribution over values of unit , then the probability density function has units . The formula for shows this in the factor of in front of the (unitless) exponential, which has a unit because has a unit.

The call pdf(Normal(h, 0.1), 2.0) has units and the call pdf(Normal(w, 90), 5) has units , and thus the variable weight has units or depending on the outcome of the coin flip. The weighted average estimate for adds weights of different runs together, which means that it adds values of unit to values of unit . This manifests itself in the estimate changing depending on whether we use or : computations that do arithmetic with inconsistent units may give different results depending on the units used. This calls into question whether this estimate is meaningful, since the estimate depends on whether we measure a value in or , or in or , which arguably should not matter at all.

The reader might now object that conditionally executed observe statements are always wrong, and probabilistic programs that use them should be rejected as erroneous. However, in the discrete case there are no unit errors, because in that case the weight gets multiplied by a probability rather than a probability density, and probabilities are unitless. Furthermore, in the preceding section we have seen that conditionally executed observe statements have a rejection sampling interpretation in the discrete case. This gives the programs a probabilistic meaning in terms of conditional distributions, even if the discrete observe statements are inside conditionals. The event that is being conditioned on involves the boolean conditions of the control flow. Ideally we would therefore not want to blame the programmer for using conditionals, but change the implementation of observe on continuous variables so that the program is meaningful in the same way that the analogous program on discrete variables is meaningful.

3.2. Paradox of Type 2: Different Number of Observes in Different Control Flow Paths

Let us analyze the program from the introduction:

  h = rand(Normal(1.7, 0.5))
  if rand(Bernoulli(0.5))
     observe(Normal(h, 0.1), 2.0)
  end
  return h

This program exhibits unit anomalies for the same reason: some of the have units and some have no units, and adding those leads to the surprising behavior. Rather than taking this behavior as a given, let us analyze what this program ought to do, if we reason by analogy to the discrete case.

This program has the same structure as the dice program from section 2, the difference being that we now use a normal distribution instead of a discrete uniform distribution. By analogy to that discrete case, the event that is being conditioned on is , where is the measurement from .

Surprisingly, this event does not have measure zero! The event has measure zero, but the event has measure , so the entire event has measure . We can therefore unambiguously apply the definition of conditional probability . Our probability space is , corresponding to , , , and and . The posterior , so the marginal posterior for is simply . That is, the whole if statement with the observe ought to have no effect.

We can understand this intuitively in terms of rejection sampling: if the sampled boolean , then the observe statement will reject the current trial with probability 1, because the probability of sampling exactly 2.0 from a normal distribution is zero. Hence if then the trial will almost surely get rejected, whereas if the trial will not get rejected. The trials where are negligibly rare, so even though the expectation of is affected in those trials, they do not contribute to the final expectation value; only trials with do.

As an aside: if we added an extra unconditional observe(Normal(h, 0.1), 1.9) to the program, then the whole event will have measure zero, but nevertheless, trials with will dominate over trials with , relatively speaking. In general, the control flow path with the least number of continuous observes dominates. If there are multiple control flow paths with minimal number of observes, but also control flow paths with a larger number of observes, we may have a paradox of mixed type 1 & 2.

This reasoning would imply that the if statement and the observe statement are irrelevant; the program ought to be equivalent to return rand(Normal(1.7, 0.5)). If this still seems strange, consider the following discrete analogue:

  h = rand(Binomial(10000, 0.5))
  if rand(Bernoulli(0.5))
     observe(binomial(10000, 0.9), h)
  end
  return h

That is, we first sample

between 0 and 10000 according to a binomial distribution, and then with probability

we observe that is equal to a number sampled from another binomial distribution that gives a number between and . Since that binomial distribution is highly biased toward numbers close to , we might expect the average value of to lie significantly higher than . This is not the case. The rejection sampling interpretation tells us that most of the trials where the coin flipped , will be rejected, because the sample from is almost never equal to . Thus, although those samples have an average significantly above , almost all of the successful trials will be trials where the coin flipped , and thus the expected value of will lie very close to .

Since we know that rejection sampling agrees with importance sampling in expectation, importance sampling will also compute an estimate for the expectation value of that lies very close to . The further we increase the number 10000, the stronger this effect becomes, because the probability that the second sample is equal to further decreases. In the continuous case this probability becomes , so the successful samples will almost surely be from trials where the coin flipped to . Therefore the average value of in the continuous case should indeed be , unaffected by the if statement and the observe.

Another way to express this point, is that in the discrete case importance sampling, rejection sampling, and the exact value given by the conditional expectation are all in agreement, even if conditionals are involved. On the other hand, in the continuous case, importance sampling with probability densities gives a different answer than rejection sampling and the exact value given by the conditional expectation (the latter two are equal to each other; both ).

The reader may insist that the semantics of the program is defined to be weight accumulation with probability densities, that is, the semantics of the program is defined to correspond to

  weight = 1
  h = rand(Normal(1.7, 0.5))
  if rand(Bernoulli(0.5))
     weight *= pdf(Normal(h, 0.1), 2.0)
  end
  return h

We can only appeal to external principles to argue against this, such as unit consistency, analogy with the discrete case, the probabilistic interpretation of observe, and the rejection sampling interpretation of observe, but the reader may choose to lay those principles aside and take this implementation of observe as the semantics of observe. We do hope to eventually convince this reader that a different implementation of observe that does abide by these principles, could be interesting. Although our semantics will differ from the standard one, it will agree with lexicographic likelihood weighting(pmlr-v80-wu18f) for this example, which does not exhibit this particular paradox.

3.3. Paradox of Type 3: Non-Linear Parameter Transformations

Consider the problem of conditioning on given and , and computing the expectation . Written as a probabilistic program,

  x = rand(Normal(10,5))
  observe(Normal(15,5),x)
  return exp(x)

In a physical situation, might be values measured in decibels and may be (relative) energy density. We could change parameters to and . Then and . Since the event is the same as , we might naively expect the program to be equivalent to:

  a = rand(LogNormal(10,5))
  observe(LogNormal(15,5),a)
  return a

This is not the case. The two programs give different expectation values. Compared to type 1 & 2 paradoxes, this type 3 paradox shows that the subtlety is not restricted to programs that have control flow or to distributions that are not continuous; the normal and lognormal distributions are perfectly smooth.

This paradox is closely related to the Borel-Komolgorov paradox. Another variant of the original Borel-Komolgorov paradox is directly expressible in Hakaru (ShanR17), but not in Anglican or Stan. Hakaru allows the programmer to condition a measure-zero condition such as directly without having to manually invert the relationship to

, and performs symbolic manipulation to do exact Bayesian inference. Hakaru allows a single such observe at the very end of a program, which allows it to sidestep the previous paradoxes related to control flow. The semantics of the single observe is defined by disintegration, which means that the semantics of a Hakaru program depends on the form of

. That is, if we take another function with the same solution set as , the output may change. The programmer can use this mechanism to control which event they want to condition on. Our version of the paradox shows that the subtlety of conditioning on measure-zero events is not restricted to programs that use that type of disintegration.

4. Avoiding Events of Measure Zero with Intervals

Unit anomalies cannot occur with discrete distributions, because in the discrete case we only deal with probabilities and not with probability densities. Recall that for discrete probability distributions D, an observe statement observe(D,x) gets executed as weight *= P(D,x) where P(D,x) is the probability of in the distribution . Probabilities have no units, so the weight variable stays unitless and the weighted average is always unit correct if the probabilistic program is unit correct, even if observe statements get executed conditionally. Furthermore, in the discrete case we have a probabilistic and rejection sampling interpretation of observe, and we may view weight accumulation as an optimization to compute the same expectation values as rejection sampling, but more efficiently. We wish to extend these good properties to the continuous case.

The reason that the discrete case causes no trouble is not due to D being discrete per se. The reason it causes no trouble is that P(D,x) is a probability rather than a probability density. In the continuous case the probability that rand(D) == x is zero, and that’s why it was necessary to use probability densities. However, even in the continuous case, the probability that a sample from D lies in some interval is generally nonzero. We shall therefore change the observe statement to observe(D,I) where I is an interval, which conditions on the event . In the discrete case we can allow I to be a singleton set, but in the continuous case we insist that I is an interval of nonzero width.

We have the following rejection sampling interpretation for observe(D,I):

For continuous or discrete distributions D, the statement observe(D,I) means that we sample from D and only continue with the current trial if the sampled value lies in I.

And the following operational semantics for observe(D,I):

For continuous or discrete distributions D, the statement observe(D,I) gets executed as weight *= P(D,I) where P(D,I) is the probability that a value sampled from D lies in I.

Let . We can calculate using the cumulative density function . This probability allows us to update the weight of the trial. For instance, a call observe(Normal(2.0,0.1), [a,b]) can be executed as weight *= normalcdf(2.0,0.1,b) - normalcdf(2.0,0.1,a) where is the cumulative density function for the normal distribution.

Notice how this change from probability densities to probabilities prevents unit anomalies: if we change the variables from meters to centimeters, then we must write observe(Normal(200,10), [a,b]), which gets executed as weight *= normalcdf(200,10,b) - normalcdf(200,10,a). We introduced a factor to convert and from meters to centimeters. This conversion ensures that the result of the program remains unchanged, because for all . Hence the computed weight will be exactly the same whether we work with meters or centimeters. On the other hand, for the probability density function it is not the case that . It is precisely this lack of invariance that causes unit anomalies with probability densities.

4.1. Conditioning on Measure Zero Events as a Limit of Positive Measure Events

We can approximate the old observe(D,x) behavior with observe(D,I) by choosing to be a very small interval of width w around x (taking w to be a small number, such as w = 0.0001). This has two important advantages over observe(D,x):

  1. We no longer get unit anomalies or other paradoxes; if we change the units of x, we must also change the units of w, which keeps the weight the same.

  2. Unlike for observe(D,x), we have an unambiguous probabilistic and rejection sampling interpretation of observe(D,I) for intervals of nonzero width, because the event being conditioned on has nonzero measure.

However, the number w = 0.0001 is rather arbitrary. We would like to let and recover the functionality of observe(D,x) to condition on an exact value. With sufficiently small w we can get arbitrarily close, but we can never recover its behavior exactly.

We therefore parameterize probabilistic programs by a dimensionless parameter eps. The BMI example then becomes:

  function bmi_example(eps)
     h = rand(Normal(170, 50))
     w = rand(Normal(70, 30))
     if rand(Bernoulli(0.5))
       observe(Normal(200, 10), (h, A*eps))
     else
        observe(Normal(90, 5), (w, B*eps))
     end
     return w / h^2
  end

Since eps is dimensionless, we can not simply use eps as the width of the intervals: because h is in , the width of the interval around h has to be in , and the width of the interval around w has to be in . We are forced to introduce a constant A with units and a constant B with units that multiply eps in the widths of the intervals in the observes.

We could now run importance sampling on bmi_example(eps) for n=10000 trials for eps=0.1, eps=0.01, eps=0.001 and so on, to see what value it converges to. If we run each of these independently, then the rand calls will give different results, so there will be different randomness in each of these, and it may be difficult to see the convergence. In order to address this, we can run the program with different values of eps but with the same random seed for the random number generator. This will make the outcomes of the rand calls the same regardless of the value of eps. In fact, for a given random seed, the result of running importance sampling for a given number of trials will be a deterministic function f(seed,eps) of the random seed and eps

If we assume that the program uses only in the widths of the intervals, and not in the rest of the program, then for a fixed seed, the function will be a function of of a specific form, because importance sampling compute

In this fraction, the are a function of , but the are independent of if only occurs inside the widths of intervals. Since the weight gets multiplied by on each observe(D,I), the is of a very specific form:

where the constant contains all the probabilities accumulated from observes that did not involve , multiplied by a product of probabilities that did involve . Since , we could, in principle determine the precise function and hence for any given seed. We could then, in principle, compute the exact limit of this function as , with a computer algebra system. This is, of course, impractical. The next section shows that we can compute the limit efficiently by doing arithmetic with infinitesimal numbers.

5. Using Infinitesimal Numbers to Handle Measure-Zero Observations

In order to recover the behavior of the old observe(D,x) using observe(D,I) with an interval , we want to take the limit , to make an infinitesimally small interval around . We accomplish this using symbolic infinitesimal numbers333In the philosophy literature there has been work on using non-standard analysis and other number systems to handle probability 0 events, see (pedersen)and (however) and references therein. of the form , where and . We allow , so that can also represent “infinitely large” numbers as well as “infinitesimally small” numbers. We will not make use of this possibility, but it makes the definitions and proofs more general and more uniform.444These infinitesimal numbers may be viewed as the leading terms of Laurent series. This bears some resemblance to the dual numbers used in automatic differentiation, which represent the constant and linear term of the Taylor series. In our case we only have the first nonzero term of the Laurent series, but the order of the term is allowed to vary.

Definition 5.1 ().

An infinitesimal number is a pair , which we write as .555The exponent of will play the same role as the number of densities in lexicographic likelihood weighting(pmlr-v80-wu18f).

The infinitesimals of the form correspond to the real numbers.

Definition 5.2 ().

Addition, subtraction, multiplication, and division on those infinitesimal numbers are defined as follows:

Like ordinary division, division of infinitesimals is a partial function, which is undefined if the denominator is exactly zero.

These rules may be intuitively understood by thinking of as a very small number; e.g. if then will be negligible compared to , which is why we define in that case, and keep only the lowest order term.

We represent intervals as midpoint-width pairs , where may be an infinitesimal number.

Definition 5.3 ().

If is a continuous distribution, we compute the probability that lies in the interval as:

(1)

Where and are the cumulative and probability density functions, respectively.

Note that the two cases agree in the sense that if is very small, then

Definition 5.4 ().

We say that is a “probability expression” in the variable if is defined using the operations , constants, and where are constants, and is a probability distribution with differentiable cdf.

We can view as a function from reals to reals (on the domain on which it is defined, that is, excluding points where division by zero happens), or as a function from infinitesimals to infinitesimals by re-interpreting the operations in infinitesimal arithmetic. The value of on the symbolic infinitesimal tells us something about the limiting behavior of near zero:

Theorem 5.5 ().

If is a probability expression, and if evaluation of is not undefined, and , then .

Note that the theorem only tells us that if evaluates to with infinitesimal arithmetic. If evaluating results in division by zero, then the theorem does not give any information. In fact, the converse of the theorem does not hold: it may be that but evaluating results in division by zero.

Proof.

By induction on the structure of the expression.
We know that evaluation of did not result in division by zero, and . We need to show that .

  • If is a constant , then we have , and indeed .

  • If . Now , and

  • If . Since evaluation of did not result in division by zero, neither did evaluation of the subexpressions and , so and for some . Therefore, by the induction hypothesis we have and .

  • Case : Now , and we have

  • Case : Now , and since we have

    Therefore

  • Case . Analogous to the previous case.

  • If . Analogous to the case for addition.

  • If . Since evaluation of did not result in division by zero, neither did evaluation of the subexpressions and , so and for some . Therefore, by the induction hypothesis we have and . Then

  • If . Since evaluation of did not result in division by zero, neither did evaluation of the subexpressions and , so and for some . Therefore, by the induction hypothesis we have and . By the assumption that no division by exactly zero occurred in the evaluation of , we have . Then

This finishes the proof. ∎

Some subtleties of limits and infinitesimals

In order to think about infinitesimals one must first choose a function of which one wishes to learn something about the limit as . Thinking about infinitesimal arithmetic independent of such a function leads to confusion. Furthermore, the result of evaluating depends not just on as a function on real numbers, but also on the arithmetic expression used for computing . Consider the functions :

As functions on real numbers, , but nevertheless, with infinitesimal arithmetic their results differ:

Applying the theorem to these results gives the following limits for and :

Both of these limits are correct, but this example shows that which limit the theorem says something about may depend on how the function is computed. The limit for gives more information than the limit for ; the limit for is conservative and doesn’t tell us as much as the limit for does. Fortunately, this won’t be a problem for our use case: we intend to apply the theorem to the weighted average of importance sampling, where the probabilities may be infinitesimal numbers. In this case the power of of the numerator and denominator are always the same, so the final result will always have power , and the theorem will then tell us about the limit .

Another subtlety is that the converse of the theorem does not hold. It is possible that , but evaluation of with infinitesimal arithmetic results in division by exactly zero. An example is . We have , but when evaluating , division by zero occurs, because we have the evaluation sequence:

If we used full Laurent series as our representation for infinitesimal numbers, then we would potentially be able to compute more limits, even some of those where exact cancellation happens in a denominator. Keeping only the first term is sufficient for our purposes, and more efficient, because our infinitesimal numbers are pairs of a real (or floating point) number and an integer , whereas Laurent series are infinite sequences of real numbers .

The lemmas about computing limits have the form “For all , if , and , and , then ”. It is not true in general that . It is possible that the limit on the left hand side exists, even when the limits on the right hand side fail to exist, or when the right hand side is . Therefore, in order to apply these theorems about limits, we must know that the right hand side is not undefined, prior to applying such a lemma. In the proof above, the existence of the limits follows from the induction hypothesis, and that the denominator is nonzero follows from the assumption that division by zero does not occur. This is why we must assume that no division by exactly zero occurs in the evaluation of with infinitesimal arithmetic, and it is also why the converse of the theorem does not hold.

5.1. Intervals of Infinitesimal Width Make Paradoxes Disappear

The proposed observe construct allows finite width intervals observe(D,(a,w)) where w is an expression that returns a number, as well as infinitesimal width intervals, as in observe(D,(a,w*eps)) where w is some expression that returns a number and eps is the symbolic infinitesimal . It is possible to allow higher powers of eps to occur directly in the source program, and it is possible to allow eps to occur in other places than in widths of intervals, but for conceptual simplicity we shall assume it doesn’t, and that observe is always of one of those two forms. That is, we will assume that eps is only used in order to translate exact conditioning observe(D,x) to observe(D,(x,w*eps)).

We translate the example from the introduction as follows:

  h = rand(Normal(170, 50))
  if rand(Bernoulli(0.5))
     observe(Normal(200, 10), (h,w*eps))
  end

Where the pair (h,w*eps) represents an interval of width w*eps centered around h, in order to condition on the observation to be “exactly ”.

Let us now investigate the meaning of this program according to the rejection sampling interpretation of observe. Assuming the coin flip results in , we reject the trial if the sample from does not fall in the interval . If the coin flip results in , we always accept the trial. If we let then the probability of rejecting the trial goes to if the coin flips to , so almost all successful trials will be those where the coin flipped to . Therefore the expected value of converges to as , and expected value of running this program should be .

We translate the BMI example as follows:

  h = rand(Normal(170, 50))
  w = rand(Normal(70, 30))
  if rand(Bernoulli(0.5))
     observe(Normal(200, 10), (h, A*eps))
  else
     observe(Normal(90, 5), (w, B*eps))
  end
  bmi = w / h^2

Where A and B are constants with units and , respectively. The units force us to introduce these constants: since (h, A*eps) represents an interval centered at h (in cm), the width A*eps must also be a quantity in . If we change the units of h or w, we also need to change the units of A or B. If we change the units of h and A from centimeters to meters, the numerical value of h and A will both get multiplied by . This additional factor for A*eps, which cannot be provided in the original non-interval type of observe(D,x) statement, is what will make this program behave consistently under change of units.

Both branches of the if statement contain observes with intervals of infinitesimal width, so with rejection sampling both branches will be rejected with probability 1, regardless of the outcome of the coin flip. We must therefore interpret the example with eps tending to , but not being exactly . If we chose A to be 1 meter, and B to be 1 kg, and change B to be 1000 kg, then the observe in the else branch is 1000x more likely to succeed compared to before, because the width of the interval goes from 1*eps to 1000*eps. If we made this change then most of the successful trials would be trials where the coin flipped to false. Thus even in the infinitesimal case, the relative sizes of the intervals matter a great deal. The relative sizes of the intervals are an essential part of the probabilistic program, and omitting them will inevitably lead to unit anomalies, because changing units also requires resizing the intervals by a corresponding amount (by 1000 in case we change from to ). If we do not resize the intervals, that changes the relative rejection rates of the branches, or the relative weights of the trials, and thus the estimated expectation value . As Jaynes notes, conditioning on measure-zero events is ambiguous; even though in the limit the intervals (w,1*eps) and (w,1000*eps) both tend to the singleton set {w}, relative to the interval (h,A*eps) it matters which of these limits is intended, and the final result will and must depend on which limit was specified.

We translate the third example as follows:

  x = rand(Normal(10,5))
  observe(Normal(15,5), (x,eps))
  return exp(x)

After a parameter transformation from to we get the following program:

  exp_x = rand(LogNormal(10,5))
  observe(LogNormal(15,5), (exp_x,exp_x*eps))
  return exp_x

Note that the width of the interval is now exp_x*eps and not simply eps. In general, if we apply a differentiable function to an interval of width around , we obtain an interval of width around . If we take the exponential of an interval of small width around , we get an interval of width around , not an interval of width around . Both of these programs should give the same estimate for the expectation value of , so that infinitesimal width intervals allow us to correctly express non-linear parameter transformations without running into Borel-Komolgorov-type paradoxes.

5.2. On the Statistical Meaning of Conditioning With Intervals and “Soft Conditioning”

It is debatable whether conditioning on small but finite width intervals is preferable to conditioning on measure zero events. Real measurement devices do not measure values to infinite precision. If a measurement device displays 45.88, we might take that to mean an observation in the interval . The measurement may in addition measure the true value x plus some Normal(0,sigma) distributed noise rather than the true value x. In this case it might be appropriate to use observe(Normal(x,sigma), (45.88, 0.01)). The finite precision of the device and its noisy measurement are in principle two independent causes of uncertainty. The rejection sampling interpretation of this program is that we first sample a value from Normal(x,sigma) and then continue with the current trial if this lies in the interval , which matches the two sources of uncertainty. An argument for using infinitesimal width intervals is that observe on a finite interval requires the evaluation of the distribution’s CDF, which is usually more complicated and expensive to compute than the distribution’s PDF.

The term “soft conditioning” is sometimes used for observe(D,x) statements, particularly when the distribution D is the normal distribution. This term can be interpreted as an alternative to the rejection sampling interpretation in several ways:

  1. Rather than conditioning on x being exactly y, we instead condition on x being “roughly” y.

  2. The statement observe(D,x) means that we continue with the current trial with probability pdf(D,x) and reject it otherwise.

We argue that neither of these interpretations is completely satisfactory. For (1) it is unclear what the precise probabilistic meaning of conditioning on being “roughly” is. One possible precise meaning of that statement is that we reject the trial if the difference is too large, and continue otherwise, but this is not what a statement such as observe(Normal(y,0.01), x) does. Rather, it weighs trials where is close to higher, and smoothly decreases the weight as the distance between and gets larger. It may seem that (2) makes this idea precise, but unfortunately pdf(D,x) is not a probability but a probability density, and can even have units or be larger than . Furthermore, the statement “continue with the current trial with probability pdf(D,x)” seems to have nothing to do with the distribution D as a probability distribution, and instead seems to be a statement that suggests that the statistical model is a biased coin flip rather than drawing a sample from D. Indeed, under our rejection sampling interpretation, if one wants to have a program whose statistical model is about coin flips, one can use the program observe(Bernoulli(f(x)), true). That program does mean “flip a biased coin with heads probability f(x) and continue with the current trial if the coin landed heads”. This makes sense for any function f(x) provided the function gives us a probability in the range . If that function has a roughly bump-like shape around y, then this will indeed in some sense condition on x being roughly y. The function similar to the PDF of the normal distribution does have a bump-like shape around , so it is possible to use that function for f, if one makes sure that and are such that it is unitless and everywhere less than 1 (note that this normalization is not the same as the normalization that makes its integral sum to ).

We therefore suggest to stick with the rejection sampling interpretation of observe statements, and suggest that a statistician who wants to do “soft conditioning” in the senses (1) and (2) writes their probabilistic program using observe(Bernoulli(f(x)), true) where f is a function of the desired soft shape rather than observe(D,x) where the PDF of D has that shape.

5.3. Importance Sampling with Infinitesimal Probabilities

To do importance sampling for programs with infinitesimal width intervals we need to change almost nothing. We execute a call observe(D,I) as weight *= P(D,I) where P(D,I) has been defined in (1). Since P(D,I) returns an infinitesimal number if the width of I is infinitesimal, the computed weight variable will now contain a symbolic infinitesimal number (where is allowed to be ), rather than a real number. It will accumulate the product of some number of ordinary probabilities (for observe on discrete distributions or continuous distributions with an interval of finite width) and a number of infinitesimal probabilities (for observe on continuous distributions with intervals of infinitesimal width).

We now simply evaluate the estimate for using the usual weighted average formula, with infinitesimal arithmetic

(2)

In the denominator we are adding numbers of the form . Only the numbers with the minimum value matter; the others are infinitesimally small compared to those, and do not get taken into account due to the definition of on infinitesimal numbers. The same holds for the numerator: the values associated with weights that are infinitesimally smaller do not get taken into account (an optimized implementation could reject a trial as soon as weight becomes infinitesimally smaller than the current sum of accumulated weights, since those trials will never contribute to the estimate of ). Therefore the form of the fraction is

that is, the infinitesimal factors cancel out in the estimate for , and we obtain a non-infinitesimal result.

We shall now suppose that the symbolic infinitesimal eps only occurs in the width of intervals inobserve(D,(x,r*eps)) calls, and not, for instance, in the return value of the probabilistic program. In this case, the estimate (2) of satisfies the conditions of Theorem 5.5. The calculated estimate may be viewed as a probability expression of (Definition 5.4), and since , the theorem implies that . Therefore the estimate calculated by importance sampling with infinitesimal arithmetic indeed agrees with taking the limit . Figure 1 shows three example probabilistic programs that are parameterized by the interval width. The blue lines show several runs of the probabilistic program as a function of the interval width, and the orange line shows the result when taking the width to be . Taking the width to be exactly 0 results in division by zero in the weighted average, but taking it to be correctly computes the limit: the blue lines converge to the orange lines as the width goes to 0.

  function example1(width)
    h = rand(Normal(1.70, 0.2))
    w = rand(Normal(70, 30))
    if rand(Bernoulli(0.5))
      observe(Normal(2.0,0.1),
        Interval(h,10*width))
    else
      observe(Normal(90,5),
        Interval(w,width))
    end
    return w / h^2
  end
  function example2(width)
    h = rand(Normal(1.7,0.5))
    if rand(Bernoulli(0.5))
      observe(Normal(2.0,0.1),
        Interval(h,width))
    end
    return h
  end
  function example3(width)
    x = rand(Normal(10,5))
    observe(Normal(15,5),
      Interval(x,width))
    return x
  end
Figure 1. Three example programs evaluated with finite width intervals with width going to zero (blue curves) and with infinitesimal width (orange curves). The finite width result correctly converges to the infinitesimal result in the limit .

5.4. The Correspondence Between Observe on Points and Observe on Intervals

We may take a program written using observe(D,x) with exact conditioning on points, and convert it to our language by replacing such calls with observe(D,(x,w*eps)) where w is some constant to make the units correct. For programs that exhibit a paradox of type 1 by executing a different number of observes depending on the outcome of calls to rand, the computed expectation values will change. However, for programs that always execute the same number of observe calls, regardless of the outcome of rand calls, the computed expectation values will not be affected by this translation. To see this, note that a call to observe(D,x) will multiply weight *= pdf(D,x), whereas observe(D,(x,w*eps)) will multiply weight *= pdf(D,x)*w*eps. Thus if the observe calls are the same in all trials, the only difference is that weight will contain an extra factor of in all trials. The net result is that both the numerator and denominator in the weighted average get multiplied by the factor , which has no effect. Thus this translation is conservative with respect to the old semantics, in the sense that it does not change the result that already well-behaved probabilistic programs compute.

5.5. Parameter Transformations as a Language Feature

The three paradoxes we identified all have to do with parameter transformations. We explicitly add parameter transformations as a language feature. A parameter transformation allows us to transform a probability distribution to , such that sampling from is the same as sampling from and then applying the function to the result. In order to ensure that the distribution has a probability density function we require to be continuously differentiable. We can also use a parameter transformation to transform an interval from to which contains all the numbers for . In order to ensure that the transformed interval is again an interval, we require that is monotone, that is, whenever we also have . In this case, ’s action on an interval is simple: .

Definition 5.6 ().

A parameter transformation is a continuously differentiable function with for all , where and are intervals representing its domain and range.

A strictly monotone function has an inverse on its range, so parameter transformations have an inverse and , so the inverse of a parameter transformation is again a parameter transformation.

Example 5.7 ().

The function is a parameter transformation . The function is a parameter transformation .

The transformation can be used to convert decibels to energy density, and can be used to convert meters to centimeters.

Probability distributions need to support 3 operations: random sampling with rand(D), computing the CDF with cdf(D,x) and computing the PDF with pdf(D,x). We define these operations for the transformed distribution .

Definition 5.8 ().

Given a continuous probability distribution and a parameter transformation , we define the operations:

This definition matches how probability distributions transform in probability theory. Our implementation represents a parameter transformation as the 4-tuple of functions , so that we have access to the inverse and derivative.

Definition 5.9 ().

Given an interval with midpoint and width , we let and and define:

This performs parameter transformation on an interval represented as a midpoint-width pair. If the width is infinitesimal, we need a different rule.

Definition 5.10 ().

Given an interval with midpoint and infinitesimal width , we define :

This performs parameter transformation on an infinitesimal width interval, which gets transformed to an interval whose width is larger by a factor . The key lemma about parameter transformations is that they do not affect the value of the (possibly infinitesimal) probability of a (possibly infinitesimal) interval.

Lemma 5.11 ().

Let be a parameter transformation, a distribution, and an interval. Then where is the probability function defined at (1).

Proof.

We distinguish non-infinitesimal intervals from infinitesimal intervals.

  • If is non infinitesimal, then by Definition (1):

    For we have, where and :

    and by (1):

  • If is infinitesimal (