Applied Measure Theory for Probabilistic Modeling

10/01/2021
by   Chad Scherrer, et al.
0

Probabilistic programming and statistical computing are vibrant areas in the development of the Julia programming language, but the underlying infrastructure dramatically predates recent developments. The goal of MeasureTheory.jl is to provide Julia with the right vocabulary and tools for these tasks. In the package we introduce a well-chosen set of notions from the foundations of probability together with powerful combinators and transforms, giving a gentle introduction to the concepts in this article. The task is foremost achieved by recognizing measure as the central object. This enables us to develop a proper concept of densities as objects relating measures with each others. As densities provide local perspective on measures, they are the key to efficient implementations. The need to preserve this computationally so important locality leads to the new notion of locally-dominated measure solving the so-called base measure problem and making work with densities and distributions in Julia easier and more flexible.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/06/2020

The Base Measure Problem and its Solution

Probabilistic programming systems generally compute with probability den...
10/14/2021

Areas on the space of smooth probability density functions on S^2

We present symbolic and numerical methods for computing Poisson brackets...
12/16/2019

Synthetic topology in Homotopy Type Theory for probabilistic programming

The ALEA Coq library formalizes measure theory based on a variant of the...
01/09/2021

Paradoxes of Probabilistic Programming

Probabilistic programming languages allow programmers to write down cond...
10/06/2021

Continuous logistic Gaussian random measure fields for spatial distributional modelling

We investigate a class of models for non-parametric estimation of probab...
11/27/2013

Introduction to Neutrosophic Measure, Neutrosophic Integral, and Neutrosophic Probability

In this paper, we introduce for the first time the notions of neutrosoph...
05/15/2018

OOP and its Calculated Measures in Programming Interactivity

This study examines the object oriented programming (OOP) and its calcul...

1 Why measures?

Distributions are an insufficient abstraction for probabilistic modeling.

Let’s first consider Bayesian modeling. In the posterior density of the parameter given the observation or data ,

the denominator is called the model evidence. Computing it efficiently is often difficult or unfeasible.

Some sampling algorithms circumvent this problem and only require knowing the density up to a constant factor.

We then can work with the “unnormalized posterior density” instead, but we want to distinguish it from a proper probability density, because that difference determines which algorithms are available.

In developments like this using distributions, it’s common to begin in terms of a distribution, but then carry around the unnormalized posterior as a function. Fortunately, the meaning of this function is usually clear from context. But structurally, this representation is now divorced from its meaning. It’s no longer a Distribution object, so the tools from the original library can no longer help. This is a perspective we will take often.

Thus, starting with elements of our class of interest, a simple and common operation has led us to something outside of this class. This is analogous to the way polynomials over the reals lead to the complex numbers. If “the shortest path between two truths in the real domain passes through the complex domain” (Hadamard), then the same argument can be made for measures as connection between distributions.

As a second example, people working in Bayesian modeling sometimes use improper priors

. Whatever one’s position on the merits of this, we think it’s reasonable to make this approach possible. But an improper prior does not integrate to one, so it’s not a distribution. Also, there’s an elegant correspondence between frequentist and Bayesian methods, where regularization corresponds to a prior. But limiting ourselves to distributions restrains us from fully realizing this, because the corresponding “prior” might be improper or even flat, as in maximum likelihood estimation.

A final and very different concern is the structure of most distribution libraries, in which distributions are classified primarily according to whether they are “discrete” or “continuous”. Such systems often lack facilities to use these in combination; some go so far as to encode the distinction in the type system, making them fundamentally incompatible.

But “discrete vs continuous” is a false dichotomy. For example, in three dimensions we can (and often do) work with distributions over points, lines, planes, or the entire space, or over spaces like a simplex or the surface of a sphere. These can be combined in rich ways, for example as a spike and slab prior for sparse modeling, or as a parameterized subspace for low-rank modeling.

We can address all of these points by extending the system we work with. Instead of distributions, our primary class of interest is measures, with distributions as a special case.

Of course, this special case is particularly useful. The point is not to disregard distributions, but to change our focus to one allowing a richer calculus for reasoning. Adapting Hadamard’s argument, The shortest path between two distributional truths passes through non-distributional measures.

Contributions

Our work is novel in several ways. MeasureTheory.jl has

  1. Explicitly represented base measures with the same sophistication as the rest of the system, in particular more than just “discrete or continuous”;

  2. A local approach for determining absolute continuity, which is usually a global characteristic;

  3. Multiple parameterizations for a given measure, without a proliferation of constructors; and

  4. Normalization and support constraints held separately from the data-dependent computation, allowing for greater efficiency.

Some parts of our approach can been seen in existing systems, though these are still far from universal:

  • A rich set of combinators for building new measures from existing ones;

  • Flexible type constraints, for example allowing measures with symbolic parameters;

  • Light-weight measure construction, replacing a common assumption that once constructed, a measure will be used many times. This is especially important for probabilistic programming applications.

2 What are measures?

We’ll now describe some foundations to help the reader get a deeper understanding of our approach, in particular what measures and probability distributions are and how they relate to notions such as the probability of something happening in the real world or the notion of volume of the space we are living in.

Throughout this discussion, it’s important to keep in mind that for us the measure-theoretic abstractions are only means to an end.111The stackoverflow discussion https://mathoverflow.net/q/11591 discusses books on measure theory; M.S. likes [Shiryaev1996] as starting point and uses [Elstrodt2011] and [Bauer_Heinz1982-01-01] as reference. As we anticipate interest in connecting with the actual theory, we give some pointers to more technical details in the footnotes.

Our primary interest, and the goal of MeasureTheory, is to offer support for applied probabilistic modeling.

The package name is owed to the fact, that the word measure alone is too vague.

Given a space of possible outcomes and a set of subsets of called events, a probability distribution assigns each event a non-negative quantity, called the probability or probability mass.

Likewise, a measure assigns each measurable set a non-negative quantity, also called a “mass”.

The space could be the space suitable to model a six-sided die, or might be the 3-dimensional Euclidean space suitable to model “volume”, to give two examples.

One further such space that deserves mention is traditionally denoted . This is an abstract space connected to the real-world notion of probability, which we denote as .

Computationally, can be considered to be the set of possible states of a random number generator rng::AbstractRNG, which is the source of computational randomness.

has a special role of tying random quantities a program produces and their distributions together, through the notion of random variables, which are functions

, from taking values in spaces , such as or .222Additionally, one requires a random variable to be a measurable function, that is, one for which the inverse image of an event is also an event of . This is a similar, but much weaker requirement to the definition of a continuous function (“the inverse image of an open set is also open”). The package Omega.jl [tavares2019language] uses random variables derived from as core principle.

From a computational perspective, a Julia function taking only an rng::AbstractRNG argument, making a number of calls to rand(rng), and returning a value is a random variable . A simple example is X(rng=Random.GLOBAL_RNG) = rand(rng, 1:6). While mathematically is a function from , the argument is often hidden like the function argument rng of its computational counterpart—not a coincidence.

Each random variable is tied to its probability law , the distribution assigning probability to the events for each (measurable) set

In that sense, classical distribution packages restrict themselves very much to the task of providing a catalogue of useful probability laws, and a simple mechanism to provide a random draw (a random variable) with that law, as function rand(rng, D). In the example,

where denotes the number of elements of .

In MeasureTheory, measures have abstract super-type AbstractMeasure.

Kolmogorov’s axioms

We now introduce Kolmogorov’s axioms, which describe laws that characterize probability distributions, and, with one exception, also measures.

Measures and probability distributions are both required to obey the axiom that the (probability) mass of sets obtained as union of disjoint component sets333Precisely: union of a sequence of disjoint component sets., equals the sum of the (probability) masses of the components. So if sets and are not disjoint, the mass of the union is computed from the mass of the components using inclusion-exclusion. For probabilities this is

Similarly for a measure ,

In the case where and are disjoint, this reduces to

a property called additivity.444The axioms also require a more general form of this to hold. If are pairwise disjoint (that is, no two have any common elements), then it must be true that

and similarly for more general measures, replacing with . This extension of additivity to the countably infinite case is called -additivity.

The reader has encountered many measures before, whether they were given that name. For example, counting measure gives the number of elements of a set, and the Lebesgue measure gives length, area, volume, etc in Euclidean space (CountingMeasure and LebesgueMeasure in the package.)

So far our characterizations of measures and probability distributions are functionally identical. Indeed, the one distinguishing feature is the law of unit measure, which only for distributions requires that

and also . Thus can be considered the event “that anything at all will happen”.

By this axiom, probability mass is a proportion, a quantity between 0 and 1. Only this axiom, which for measures is replaced by weaker axiom , sets the two apart. To reinforce that a distribution is also a measure, we’ll sometimes refer to it as a probability measure.

This close relationship between measures and probability measures is illustrated by the Lebesgue measure on . The measure of the full space is . But restricting to the unit interval gives . This restricted

measure is a probability distribution—the uniform distribution

describing the law of Julia function rand() giving random numbers in the interval .

We have not yet stated for which sets Kolmogorov’s axioms have to hold for something to be called a measure or a probability:

-algebras and how to like them

A (probability) measure on need not assign a (probability) mass to every possible subset , but only to each event for a probability measure, or measurable set more generally. Statements we make about sets should be understood as restricted to this set-of-sets, which can be different from one measure to the next.

Now, Kolmogorov’s axioms require a given measure to be defined for , and they also require that this set of measurable sets is closed under the operations which the axioms allow (complements and unions of sets or sequences of sets). This imposes an algebraic structure to the measurable sets, called a -algebra, which is exactly that: A set of sets (or in particular) closed under complementation and countable unions and intersections.

This begs the question: Why not just use the power set, the “set of all sets”, which is, after thinking about it, such a -algebra?

A key observation is that having defined probability mass for a number of relatively simple sets, for example on all intervals, Kolmogorov’s axioms give a recipe to compute the probability mass of more complicated sets, and with those, even more in new applications of the axioms, etc. One says the -algebra is generated by these sets. Indeed, Distributions.jl does not give means to compute probability for arbitrary subsets of floating point numbers, but gives a way of computing probability mass just of intervals of the form throught the function cdf.

In fact, it is possible for small enough spaces to assign meaningful probabilities to all subsets of the space, but impossible for any space containing a continuum such as the real line. Just representing a single arbitrary subset of the Float64-range, the computational abstraction of the real line, would require staggering 2 306 Petabytes of memory. This is not practical.

But even the mathematical measure theory needs to stay clear of the set of all subsets of uncountable spaces, because it turns out that not even volume can be properly defined for all subsets of the Euclidean space.555Another thing to keep in mind is that, for a function to be measurable requires that “the inverse image of a measurable set must be measurable”. So allowing more sets to to be measurable requires either allowing more measurable , or restricting the set of measurable functions.

Accordingly, as of this writing, MeasureTheory does not have first-class -algebras, but rather considers them to be implicit to a given measure. Having said that, -algebras do have a role to play in applied measure theory, but we give only a pointer here. The practical importance of -algebras here lies therein that they also encode available information. Observe that each event corresponds to a question, with a probabilistic answer. For example, an event corresponds to the question whether the random variable is in the interval and there is a probability that this will be the case. Assume we know whether some of such questions , are answered (each with yes or no).

Then we also know that the answer to the complement (no or yes), and we always know that has probability 1 and the corresponding question is answered by ‘yes’. Likewise, if both and can be answered given what we know, we also know the answer to . That means, these known events form an -algebra, and in particular for each random variable there is a -algebra associated with which encodes what else we know knowing and how we should assign (conditional) probabilities to other events knowing .

This is a very relevant question for probabilistic programming tasks and causal inference, and with kernels MeasureTheory does provide tools to reason about it which are computationally appropriate, because they are local.

Densities

A measure is often described in terms of its density. Before getting to a more technical discussion, we hope a physical analogy can help build some intuition. Imagine a wooden board with some knots in it, where we might be interested in the mass of some two-dimensional cut-out shape. This mass depends not only on the shape but also on its location and orientation, in particular the inclusion of knots. In this way we can start with a measure (here Lebesgue measure) and use a density (the physical density of the wood) to construct a new measure (the mass of any given cut-out shape).

Of course, we do this all the time with distributions, building a continuous distribution in terms of a probability density function (pdf) over Lebesgue measure, or a discrete distribution in terms of a probability mass function (pmf), which is just a density over counting measure.

Somewhat more formally, a probability density in Euclidean space is a local ratio of probability assigned to an infinitesimally small volume/area around each point , relative to that volume itself

“Volume” is not a distribution, but is in fact the Lebesgue measure described above. Discrete distributions can be expressed similarly using counting measure. There are certainly more events (sets) than outcomes (elements), so a probability density gives a local and parsimonious description of a distribution.

More generally, for measures and (and an absolute continuity condition, to be discussed), the density is

As we see, density only makes sense relative to some base measure.

Limiting ourselves to distributions makes this awkward to even discuss, but allowing measures as first-class objects means we can make this characterization more explicit, and thus more flexible. What sets MeasureTheory apart from most libraries is that we don’t sweep this under the rug, but rather address it head-on.

In place of Lebesgue or counting measure, any measure we can express can play the role of

above. This becomes crucial in high-dimensional spaces, where working with other reference measures such as a product of normal distributions becomes a numerical requirement. (Mathematically, there is no infinite dimensional Lebesgue measure, so numerically, even high-dimensional Lebesgue measures can be problematic.

[https://doi.org/10.5281/zenodo.3931118] for example allows to express a target density with respect to product of normal distributions using the Boomerang sampler [pmlr-v119-bierkens20a] for this reason.)

3 Locally dominated measures

One cannot expect to express a measure putting positive mass on a set relative to a second measure on , if nothing is there to compare, that is if but . You can’t “make something from nothing”.

Given measures and defined on a common space , we say is dominated by if implies (or equivalently, implies ) for every measurable . This is denoted by , and is sometimes equivalently read as “ is absolutely continuous with respect to ”.

The Radon-Nikodym Theorem states that if equivalently there is a density function , often written , with the property that for every ,

The concept of absolute continuity is very useful for formal manipulations. However, the global nature of testing

would require every support to be represented in a way that allows efficient computation of this relation. In particular, in applications based on Markov chain Monte Carlo, such global information about the measures at hand is not available.

Even with such a capability, this approach would have its problems. It seems likely a user getting an error in response to requesting a density would often respond by restricting measures accordingly until the request can be fulfilled.

Because of this, we instead define to be locally dominated by near , written , if there is some neighborhood such that (so it’s not a degenerate case) and (absolute continuity between the restricted measures). A local density can be defined similarly. In MeasureTheory, the density and logdensity functions work in exactly these terms. For the remainder of this paper, “(log-)density” will always refer to the local (log-)density, and “local” will often be taken as understood.

Density computations are typically done in log space. For example, if we’re interested in , we’ll typically instead compute logdensity(μ, ν, x). Here and below, we’ll often write the math in terms of densities and the code in terms of log-densities, despite unfortunate inconsistency between the two.

There are several benefits to working in log-space. Results are much less likely to overflow or underflow, and the many products and exponents become sums and products, respectively. which are much more efficient to compute and to differentiate.

Working locally and in log-space also gives us a convenient antisymmetry,

logdensity(, , x) == -logdensity(, , x)

In particular, logdensity(μ, ν, x) takes special values in some common cases:

(else)
(finite) -Inf
(else) Inf NaN

4 Base measures and density recursion

To this point, we’ve described the implementation in terms of the three-argument logdensity(μ, ν, x). But implementing things directly in these terms would require a definition for every pair of possible measures, which is of course intractable.

We address this by instead assigning every measure a base measure. This is defined locally, with default method

basemeasure(::AbstractMeasure, x) = basemeasure()

This allows users to define a base measure specific to a given neighborhood if they prefer, or to define a global base measure when that suffices.

The log-density is then fundamentally expressed in term of the two-argument logdensity(μ, x), which gives the log-density with respect to this base measure. Thus we have

logdensity(, x) == logdensity(, basemeasure(, x), x)

This leaves us with the question of how to compute the three-argument logdensity(μ, ν, x) for general . We implement this recursively, in an approach we refer to as density recursion.

If is the base measure of and is the base measure of , then

In the implementation, logdensity(μ, α, x) and logdensity(ν, β, x) can be computed directly using the two-argument logdensity. There may be a specific method for logdensity(α, β, x); if not, it will require a recursion step.

In this process, we may reach a measure that serves as its own base measure. Such measures are primitive, for example Lebesgue or counting measure. If logdensity is called on two primitive measures with no explicit method, an error is thrown.

The above serves as a high-level specification of the computational process. The low-level implementation will differ in some ways as we continue to tune it for performance.

5 Parameterized measures

A common challenge in building a library of distributions is the choice of parameterizations. For example, in Stan [Stan], a negative binomial distribution is parameterized by and , where

In the Julia package Distributions.jl [Distributions.jl-2019], this is instead given by

These are equivalent (let and ), and the inconsistency alone gives some evidence that it might be reasonable to prefer one or the other depending on the circumstances. Yet most libraries only allow one or the other (or yet another alternative) as the

negative binomial distribution. Other parameterizations require entirely different names.

In MeasureTheory, our approach avoids this problem. A parameterized measure is defined by a struct of the appropriate type with a named tuple par field. For example,

struct NegativeBinomial{N,T} <: ParameterizedMeasure{N}
    par :: NamedTuple{N,T}
end

We can then write

    NegativeBinomial(r=10, p=0.75)

or

    NegativeBinomial(=10, =3)

Calls to rand, logdensity, etc then delegate to methods according to the appropriate names. We use KeywordCalls.jl [KeywordCalls.jl], so all names are resolved statically at compile time.

6 Kernels

A kernel is a (measurable666That is, the function must be measurable for every fixed .) function that returns a measure. Equivalently, it represents a family of measures parameterized by its argument. Writing for “measures on ”, we can write this as

A prominent application is a conditional distribution. In this case is further restricted to be a Markov kernel,

where represents “probability measures on ”. If and are random variables defined on and , the kernel defines a probability measure assigning every measurable probability

In MeasureTheory, a kernel is represented as either a Julia function or a Kernel object. The latter pairs a measure constructor with a mapping into its parameter space, and makes the functional relationship between argument of and the returned measure transparent.

A Markov kernel, in particular, corresponds to the distribution of a parametrized random variable, a function with random outcomes (or to “mechanisms” in causal inference, for example in [https://doi.org/10.5281/zenodo.1005091]). For example in MeasureTheory, the kernel

can be expressed as

     = kernel(Normal) do x
        (=x, =x)
    end

or equivalently

     = kernel(Normal; =identity, =sqrt)

corresponding to the random function

    f(rng, x) = x + x * randn(rng)

In this example, working in terms of a kernel allows us to make the linear relationship between the argument x and mean(κ(x)) transparent.

This could of course also be expressed as a parameterized measure. The difference is that kernels are lighter weight to build and have all the dynamism of functions, while the static nature of parameterized measures can make it easier to express some optimizations.

Finally, kernels are key to expressing likelihoods, which we can use to formulate Bayesian conjugacy properties. We’ll discuss this further in Section 9.

7 Markov chains

Some kernels map elements of a state space to measures on the same space.

In this case, repeated application of the kernel “walks” through , yielding the law of a Markov chain. We support this in MeasureTheory through the Chain combinator. This takes a kernel (or function) and a starting measure, and can be used with Julia’s convenient do notation. For example, we can write a Gaussian random walk as

    mc = Chain(Normal(=0.0)) do x Normal(=x) end

A random sample from a Chain is an infinite sequence expressed as a iterator, e.g.

julia> r = rand(mc); julia> x = Iterators.take(r, 3) |> collect 3-element Vector{Any}: -0.4931543737034523 -0.5661895116186417 -1.3286977670590228

Thus, for example, we can evaluate mc on the x we just computed:

    julia> logdensity(mc, x)
    -0.4149771036439342

Crucially, the call to rand builds a struct containing a random seed, to ensure that repeated realizations of r will be identical.

To give us more control over the sampling procedure, Chain is implemented using dynamic iterators from DynamicIterators.jl [DynamicIterators.jl].

The appropriate -algebra for infinite sequences requires some care. A Markov chain on has -algebra generated by sets of sequences with finitely many constraints.777That is, the smallest -algebra containing all sets of sequences which can be written as using collections of measurable sets indexed by finite index sets , called cylinder sets.

8 Densities and integrals

In working with measures, two computations that come up quite often are (Radon-Nikodym) differentiation and (Lebesgue) integration. It’s often important to be able to do things like

  1. For measures and , obtain

  2. Given a measure and a function , obtain a measure with

Note that (1) is different from our current density function, because no is specified. Computationally, this is called a closure.

In MeasureTheory, we write these as

f = ∂(μ, ν) and μ = ∫(f, ν) .

This allows the Radon-Nikodym theorem to be expressed as

f == ∂(∫(f, ν), ν) and μ == ∫(∂(μ, ν), ν) .

Because we often we want to work instead in log-space, we also include (and tend to prefer) the functions log∂ and ∫exp, for which we would write

ℓ = log∂(μ, ν) and μ = ∫exp(ℓ, ν) ,

which follow a similar law,

f == log∂(∫exp(f, ν), ν) and μ == ∫exp(log∂(μ, ν), ν)

Finally, in terms of density and logdensity, we have

              (, )(x) == density(, , x)
           log(, )(x) == logdensity(, , x)

9 Pointwise products and likelihoods

For a parameter and data , suppose we have a prior (momentarily assuming Lebesgue base measure) and likelihood . Then Bayes’s Law gives the posterior density

In practice, we rarely know the normalization factor , so we often work in terms of the unnormalized posterior density,

Because this does not include the normalization, this does not integrate to one, and so it’s not a probability density function. But there are no such problems as a density for a measure; the measure defined in this way has the same base measure as that of the prior.

This pointwise product in probability often describes the fusion of information, in this case the fusion of information from prior and from the observations via the likelihood, but also shows up in related situation, such as sensor fusion in signal processing.

More generally, this “prior times likelihood” can be considered as an operation that takes a measure and a function, and returns a new measure. Given a measure with base measure , and a function defined on the support of , is a measure with

In the special case where is a likelihood, it’s important to keep track of whether it’s represented in log space. For this, we have a Likelihood type that “knows” about the measure it was derived from and adapts according to whether the user calls density or logdensity. The general method

    (::AbstractMeasure, ::Likelihood)

defaults to constructing a PointwiseProductMeasure

to hold the prior and likelihood, but we’ll be extending this with more efficient methods for cases with conjugate priors.

In the absence of conjugacy, the extra structure of this suspended form is convenient, for example to build approximations to a given measure of interest.

The above Likelihood type is specified using a kernel and observed data. Though it is not an AbstractMeasure, we include a logdensity method for computational convenience. The core of the implementation is then

    struct Likelihood{F,S,X}
        k::Kernel{F,S}
        x::X
    end

    function logdensity(::Likelihood, p)
        return logdensity(.k(p), .x)
    end

10 Product and power measures

Given measures on and on , the (independent) product measure is a measure on . Just as Lebesgue measure on is generated by defining its value on intervals, the product measure is generated by defining

for any measurable and . MeasureTheory code uses exactly this notation, μ ⊗ ν. This can be extended recursively to products of any finite number of measures.

If is defined in terms of a base measure and likewise over , the product measure has base measure and density

A special case of this is when . In MeasureTheory we refer to this as a power measure, written μ^2 or μ^n for higher dimensions. More generally, the second argument can be a Tuple, so for example μ^(2,3) extends as a product measure over matrices.

A second special case is when we have a collection of values together with a kernel. For this we use the For combinator. So in a regression model, we might express the response as coming from

    For(1:n) do j
        Normal( * x[j], )
    end

This is exactly the product measure

More generally, suppose we have a measure on a space and a kernel . Then we can define a measure on as follows.

For a given with , let have base measure . Then similarly to the independent product, we have

Because the types are always available to help us distinguish this from the independent case, it’s safe to use the same notation, μ ⊗ κ.

Note that this defines a function

In functional programming, this (together with some laws) is called a monadic bind [ramsey2002stochastic].

11 Superposition and mixtures

A superposition is the measure-theoretic analog of a mixture model. For measures and with and , the superposition, written , is a measure with base measure and density

Using the inverse in this final line allows to computed once and then re-used. Also note that in the special case where , this reduces to .

An important special case of superposition occurs when and are finite measures on some space , and . This is equivalent to a convex combination of probability densities, called a mixture.

Conveniently, a product of measures distributes over superposition. That is, if and are measures on a common space and and are measures on , then

and

A very common special case of superposition is a spike and slab prior, a mixture of a Dirac measure (a point mass) with a continuous measure. This is useful for sparse Bayesian modeling, as implemented in the sparse ZigZag sampler, described in [arxiv2103.08478] and implemented in [https://doi.org/10.5281/zenodo.3931118] using MeasureTheory.

12 Density decomposition

Especially for probability measures, it’s common for the log-density with respect to Lebesgue or counting measure to have several types of terms. Given an observation and any relevant parameters, there are typically

  • Terms that are data-dependent, each involving some nontrivial function of (and possibly also of the parameters),

  • Terms that are parameter-dependent, and

  • Terms that are constant.

In addition, it’s common to have an “argument check” to be sure is in the support of the distribution.

Depending on the application, we can often ignore some of these. For example, we may know is in the support by construction, or from a previous check. In these cases, any use of resources to check arguments is wasteful.

There are many cases where it’s important for performance to have constant and parameter-dependent terms separated from data-dependent ones. Suppose is a measure with log-density of the form

If we instead observe an iid product of observations, the log-density is

This final form can be much more efficient, because it reduces repeated computations of to one.

In some cases it’s important to avoid computing at all. For correlation matrices, it’s common to use the LKJ prior [LEWANDOWSKI20091989].

Rather than work with correlation matrices directly, it’s convenient to work in terms of the Cholesky decomposition. For this purpose, MeasureTheory includes an LKJCholesky measure. This is typically used as a prior with fixed parameters and , which give the dimensionality and “concentration” of the measure. The relative cost of normalization (which is irrelevant for MCMC) can be computed as

function relative_normcost(k, )
     = LKJCholesky(k, )
    L = rand().L
    f_cost = @belapsed logdensity($, $L)
    g_plus_C_cost = @belapsed Dists.lkj_logc0($k, $)
    return g_plus_C_cost / (g_plus_C_cost + f_cost)
end

So for example, in 10 dimensions with this gives

    julia> relative_normcost(10, 2.0)
    0.76428528899935

That is, 76% of the time is spent in normalization. If our application doesn’t need it, three-fourths of the computation time is simply wasted.

For these reasons, we break the representation of the log-density into several pieces (now additive terms in log-space):

  1. Constant terms,  .

  2. Parameter-dependent terms,  .

  3. Data-dependent terms,  .

The two-argument logdensity then computes only the data-dependent terms, with the constant and parameter-dependent terms pushed to the base measure. This makes it easy to defer computation of these terms until they are required.

13 Affine transforms

A particularly expressive way to build new measures in terms of existing ones is through a pushforward. Given a measure and a function defined on its support, the pushforward of through is a measure defined by

In the following sections, we first discuss in the context of probability measures before the more general case.

Forward parameterization

Starting with a -dimensional multivariate random variable , an affine transform

is a “linear transform with a shift”. We can use this to define a new random variable

, as

with and of the appropriate dimensions.888We would typically use in place of , if not for the unfortunate potential confusion with as a name for a measure. If , then has mean

and variance matrix

Note that if , we get , so we can arrive at a given positive semidefinite by taking to be its lower Cholesky factor. Also, in the special case of a one-dimensional Gaussian, this gives the familiar .

We call this the forward parameterization because it’s especially convenient for sampling, sometimes referred to as “running the model forward”. Unfortunately, the cost of this convenience is a relatively awkward expression for the density. In this case, we start with and need to solve , finally adjusting according to the determinant of the transformation:

where is the determinant of the square matrix, and the second equality comes from solving for , which gives . More generally (when the transform is not expressed as a matrix), this role is played by the determinant of the Jacobian, .

This requires solving a linear system. Even with being lower-triangular, this involves division operations and the allocation of a temporary vector for storage of .

In many cases, we prefer the density (or log-density, really) to be fast to evaluate. This leads us to a kind of dual approach to the above.

Inverse parameterization

An alternative parameterization of a multivariate Gaussian is in terms of its precision matrix, . Similarly to above, we’ll write for the lower Cholesky factor of , so . The parameterization is then specified by

Solving for is of course now very simple, and the density becomes

In exchange, forward sampling becomes awkward,

Generalization

Given an injective map and measures , the pushforward has density

Note that there’s no Jacobian to be found! This is because the measure and base measure are either both transformed, or both not. In fact, the Jacobian only comes into play when we compute “across the transform”, and even then it’s not in every case.

Changing our notation slightly, the general case is

This decomposes the problem into two subproblems. First we must compute . This plays the role of the determinant of the Jacobian, but is specific to the base measure . In particular, might be a discrete measure, in which case this factor is one. Finally, we compute , which is just the (pre-transformation) density, more familiar from previous discussion as .

For the Lebesgue case, if the Jacobian is not square but, say, for , the resulting measure will be embedded into a -dimensional affine subspace of

. This can be convenient for low-rank modeling, which can be important for high-dimensional data. If

has QR decomposition

, we can use in place of or above, since columns of are orthonormal (so it’s a change of basis and does not “stretch” the space). Our implementation is a variation of this that’s equivalent but more efficient to compute.

14 Extensions

Despite it being a very new package, MeasureTheory.jl there is already active work to build up on it and to extend it.

PointProcesses.jl[Dalle_PointProcesses.jl] defines point processes, in particular requiring the concept of random measure.

ManifoldMeasures.jl[manifoldmeasures] implement measures on a manifold, using Hausdorff measure as the base measure.

MultivariateMeasures.jl[multivariatemeasures] gives high-performance implementations of logdensity for multivariate measures, using LoopVectorization.jl[loopvectorization].

Soss.jl[scherrer2020soss] is a probabilistic programming language that has recently adopted MeasureTheory.jl as a foundation. In particular, every Soss Model is an instance of AbstractMeasure, and has another Soss model as its base measure.

As mentioned in Section 11, ZigZagBoomerang.jl[https://doi.org/10.5281/zenodo.3931118]

allows sampling with a spike and slab prior for sparse Bayesian inference and makes use the freedom to choose appropriate reference measures. These models can be expressed using Soss.

Mitosis.jl[arxiv2010.03509] uses MeasureTheory.jl

to represent Bayesian networks via Markov kernels and defines transformations on those.

15 Related work

Narayanan et al [narayanan2016probabilistic] describe Hakaru, a system for Bayesian modeling using measures. Here, a measure is a functional

represented as a program. Hakaru’s combinators are then expressed as compilers taking programs as the inputs.

Radul and Alexeev [Radul2020] describe the base measure problem of losing track of a base measure when applying a transformation, and suggest standardizing around Hausdorff measure as a solution. This problem doesn’t arise for us, because the base measure is always taken into account.

Borgström et al [Borgstr_m_2013] describes the Fun system in F# in terms of measure transformer semantics, but discusses only finite measures.

In Julia [bezanson2017julia], the Distributions.jl package [Distributions.jl-2019] is very popular for computations on distributions. The drawbacks of Distributions.jl are essentially those described in the first few sections of this paper. Current advantages over MeasureTheory are the extensive range of distributions it implements and its popularity and familiarity to many Julia users.

MeasureTheory.jl currently has Distributions.jl as a dependency, and uses it as a fall-back for many computations. For convenience, we also re-export the Distributions module under the handle Dists.

16 Conclusion

We have introduced the concepts and implementation of MeasureTheory.jl. This package is very new, so we expect there will be some changes as it matures. For this reason we have limited our discussion to aspects of the implementation we believe are relatively stable.

We hope this work can become a common foundation for probabilistic modeling in Julia. In particular, we believe this approach is especially well-suited for use in probabilistic programming, for which Julia has such a robust and active community.

We welcome discussion and community involvement with this package, as well as additional extensions to those we have described.

References