1 Why measures?
Distributions are an insufficient abstraction for probabilistic modeling.
Let’s first consider Bayesian modeling. In the posterior density of the parameter given the observation or data ,
the denominator is called the model evidence. Computing it efficiently is often difficult or unfeasible.
Some sampling algorithms circumvent this problem and only require knowing the density up to a constant factor.
We then can work with the “unnormalized posterior density” instead, but we want to distinguish it from a proper probability density, because that difference determines which algorithms are available.
In developments like this using distributions, it’s common to begin in terms of a distribution, but then carry around the unnormalized posterior as a function. Fortunately, the meaning of this function is usually clear from context. But structurally, this representation is now divorced from its meaning. It’s no longer a Distribution object, so the tools from the original library can no longer help. This is a perspective we will take often.
Thus, starting with elements of our class of interest, a simple and common operation has led us to something outside of this class. This is analogous to the way polynomials over the reals lead to the complex numbers. If “the shortest path between two truths in the real domain passes through the complex domain” (Hadamard), then the same argument can be made for measures as connection between distributions.
As a second example, people working in Bayesian modeling sometimes use improper priors
. Whatever one’s position on the merits of this, we think it’s reasonable to make this approach possible. But an improper prior does not integrate to one, so it’s not a distribution. Also, there’s an elegant correspondence between frequentist and Bayesian methods, where regularization corresponds to a prior. But limiting ourselves to distributions restrains us from fully realizing this, because the corresponding “prior” might be improper or even flat, as in maximum likelihood estimation.
A final and very different concern is the structure of most distribution libraries, in which distributions are classified primarily according to whether they are “discrete” or “continuous”. Such systems often lack facilities to use these in combination; some go so far as to encode the distinction in the type system, making them fundamentally incompatible.
But “discrete vs continuous” is a false dichotomy. For example, in three dimensions we can (and often do) work with distributions over points, lines, planes, or the entire space, or over spaces like a simplex or the surface of a sphere. These can be combined in rich ways, for example as a spike and slab prior for sparse modeling, or as a parameterized subspace for lowrank modeling.
We can address all of these points by extending the system we work with. Instead of distributions, our primary class of interest is measures, with distributions as a special case.
Of course, this special case is particularly useful. The point is not to disregard distributions, but to change our focus to one allowing a richer calculus for reasoning. Adapting Hadamard’s argument, The shortest path between two distributional truths passes through nondistributional measures.
Contributions
Our work is novel in several ways. MeasureTheory.jl
has

Explicitly represented base measures with the same sophistication as the rest of the system, in particular more than just “discrete or continuous”;

A local approach for determining absolute continuity, which is usually a global characteristic;

Multiple parameterizations for a given measure, without a proliferation of constructors; and

Normalization and support constraints held separately from the datadependent computation, allowing for greater efficiency.
Some parts of our approach can been seen in existing systems, though these are still far from universal:

A rich set of combinators for building new measures from existing ones;

Flexible type constraints, for example allowing measures with symbolic parameters;

Lightweight measure construction, replacing a common assumption that once constructed, a measure will be used many times. This is especially important for probabilistic programming applications.
2 What are measures?
We’ll now describe some foundations to help the reader get a deeper understanding of our approach, in particular what measures and probability distributions are and how they relate to notions such as the probability of something happening in the real world or the notion of volume of the space we are living in.
Throughout this discussion, it’s important to keep in mind that for us the measuretheoretic abstractions are only means to an end.^{1}^{1}1The stackoverflow discussion https://mathoverflow.net/q/11591 discusses books on measure theory; M.S. likes [Shiryaev1996] as starting point and uses [Elstrodt2011] and [Bauer_Heinz19820101] as reference. As we anticipate interest in connecting with the actual theory, we give some pointers to more technical details in the footnotes.
Our primary interest, and the goal of MeasureTheory
, is to offer support for applied probabilistic modeling.
The package name is owed to the fact, that the word measure alone is too vague.
Given a space of possible outcomes and a set of subsets of called events, a probability distribution assigns each event a nonnegative quantity, called the probability or probability mass.
Likewise, a measure assigns each measurable set a nonnegative quantity, also called a “mass”.
The space could be the space suitable to model a sixsided die, or might be the 3dimensional Euclidean space suitable to model “volume”, to give two examples.
One further such space that deserves mention is traditionally denoted . This is an abstract space connected to the realworld notion of probability, which we denote as .
Computationally, can be considered to be the set of possible states of a random number generator rng::AbstractRNG
, which is the source of computational randomness.
has a special role of tying random quantities a program produces and their distributions together, through the notion of random variables, which are functions
, from taking values in spaces , such as or .^{2}^{2}2Additionally, one requires a random variable to be a measurable function, that is, one for which the inverse image of an event is also an event of . This is a similar, but much weaker requirement to the definition of a continuous function (“the inverse image of an open set is also open”). The packageOmega.jl
[tavares2019language] uses random variables
derived from as core principle.
From a computational perspective,
a Julia function taking only an rng::AbstractRNG
argument, making a number of calls to rand(rng)
, and returning a value is a random variable .
A simple example is X(rng=Random.GLOBAL_RNG) = rand(rng, 1:6)
.
While mathematically is a function from , the argument is often hidden like the function argument rng
of its computational counterpart—not a coincidence.
Each random variable is tied to its probability law , the distribution assigning probability to the events for each (measurable) set
In that sense, classical distribution packages restrict themselves very much to the task of providing a catalogue of useful probability laws, and a simple mechanism to provide a random draw (a random variable) with that law, as function
rand(rng, D)
.
In the example,
where denotes the number of elements of .
In MeasureTheory
, measures have abstract supertype AbstractMeasure
.
Kolmogorov’s axioms
We now introduce Kolmogorov’s axioms, which describe laws that characterize probability distributions, and, with one exception, also measures.
Measures and probability distributions are both required to obey the axiom that the (probability) mass of sets obtained as union of disjoint component sets^{3}^{3}3Precisely: union of a sequence of disjoint component sets., equals the sum of the (probability) masses of the components. So if sets and are not disjoint, the mass of the union is computed from the mass of the components using inclusionexclusion. For probabilities this is
Similarly for a measure ,
In the case where and are disjoint, this reduces to
a property called additivity.^{4}^{4}4The axioms also require a more general form of this to hold. If are pairwise disjoint (that is, no two have any common elements), then it must be true that
The reader has encountered many measures before, whether they were given that name. For example, counting measure gives the number of elements of a set, and the Lebesgue measure gives length, area, volume, etc in Euclidean space (CountingMeasure
and LebesgueMeasure
in the package.)
So far our characterizations of measures and probability distributions are functionally identical. Indeed, the one distinguishing feature is the law of unit measure, which only for distributions requires that
and also . Thus can be considered the event “that anything at all will happen”.
By this axiom, probability mass is a proportion, a quantity between 0 and 1. Only this axiom, which for measures is replaced by weaker axiom , sets the two apart. To reinforce that a distribution is also a measure, we’ll sometimes refer to it as a probability measure.
This close relationship between measures and probability measures is illustrated by the Lebesgue measure on . The measure of the full space is . But restricting to the unit interval gives . This restricted
measure is a probability distribution—the uniform distribution
describing the law of Julia functionrand()
giving random numbers in the interval .
We have not yet stated for which sets Kolmogorov’s axioms have to hold for something to be called a measure or a probability:
algebras and how to like them
A (probability) measure on need not assign a (probability) mass to every possible subset , but only to each event for a probability measure, or measurable set more generally. Statements we make about sets should be understood as restricted to this setofsets, which can be different from one measure to the next.
Now, Kolmogorov’s axioms require a given measure to be defined for , and they also require that this set of measurable sets is closed under the operations which the axioms allow (complements and unions of sets or sequences of sets). This imposes an algebraic structure to the measurable sets, called a algebra, which is exactly that: A set of sets (or in particular) closed under complementation and countable unions and intersections.
This begs the question: Why not just use the power set, the “set of all sets”, which is, after thinking about it, such a algebra?
A key observation is that having defined probability mass for a number of relatively simple sets, for example on all intervals,
Kolmogorov’s axioms give a recipe to compute the probability mass of more complicated sets, and with those, even more in new applications of the axioms, etc.
One says the algebra is generated by these sets.
Indeed, Distributions.jl
does not give means to compute probability for arbitrary subsets of floating point numbers,
but gives a way of computing probability mass just of intervals of the form throught the function cdf
.
In fact, it is possible for small enough spaces to assign meaningful probabilities to all subsets of the space, but impossible for any space containing a continuum such as the real line. Just representing a single arbitrary subset of the Float64
range, the computational abstraction of the real line, would require staggering 2 306 Petabytes of memory. This is not practical.
But even the mathematical measure theory needs to stay clear of the set of all subsets of uncountable spaces, because it turns out that not even volume can be properly defined for all subsets of the Euclidean space.^{5}^{5}5Another thing to keep in mind is that, for a function to be measurable requires that “the inverse image of a measurable set must be measurable”. So allowing more sets to to be measurable requires either allowing more measurable , or restricting the set of measurable functions.
Accordingly, as of this writing, MeasureTheory
does not have firstclass algebras, but rather considers them to be implicit to a given measure.
Having said that, algebras do have a role to play in applied measure theory, but we give only a pointer here.
The practical importance of algebras here lies therein that they also encode available information. Observe that each event corresponds to a question, with a probabilistic answer.
For example, an event corresponds to the question whether the random variable is in the interval and there is a probability that this will be the case.
Assume we know whether some of such questions , are answered (each with yes or no).
Then we also know that the answer to the complement (no or yes), and we always know that has probability 1 and the corresponding question is answered by ‘yes’. Likewise, if both and can be answered given what we know, we also know the answer to . That means, these known events form an algebra, and in particular for each random variable there is a algebra associated with which encodes what else we know knowing and how we should assign (conditional) probabilities to other events knowing .
This is a very relevant question for probabilistic programming tasks and causal inference, and with kernels MeasureTheory
does provide tools to reason about it which are computationally appropriate, because they are local.
Densities
A measure is often described in terms of its density. Before getting to a more technical discussion, we hope a physical analogy can help build some intuition. Imagine a wooden board with some knots in it, where we might be interested in the mass of some twodimensional cutout shape. This mass depends not only on the shape but also on its location and orientation, in particular the inclusion of knots. In this way we can start with a measure (here Lebesgue measure) and use a density (the physical density of the wood) to construct a new measure (the mass of any given cutout shape).
Of course, we do this all the time with distributions, building a continuous distribution in terms of a probability density function (pdf) over Lebesgue measure, or a discrete distribution in terms of a probability mass function (pmf), which is just a density over counting measure.
Somewhat more formally, a probability density in Euclidean space is a local ratio of probability assigned to an infinitesimally small volume/area around each point , relative to that volume itself
“Volume” is not a distribution, but is in fact the Lebesgue measure described above. Discrete distributions can be expressed similarly using counting measure. There are certainly more events (sets) than outcomes (elements), so a probability density gives a local and parsimonious description of a distribution.
More generally, for measures and (and an absolute continuity condition, to be discussed), the density is
As we see, density only makes sense relative to some base measure.
Limiting ourselves to distributions makes this awkward to even discuss, but allowing measures as firstclass objects means we can make this characterization more explicit, and thus more flexible. What sets MeasureTheory
apart from most libraries is that we don’t sweep this under the rug, but rather address it headon.
In place of Lebesgue or counting measure, any measure we can express can play the role of
above. This becomes crucial in highdimensional spaces, where working with other reference measures such as a product of normal distributions becomes a numerical requirement. (Mathematically, there is no infinite dimensional Lebesgue measure, so numerically, even highdimensional Lebesgue measures can be problematic.
[https://doi.org/10.5281/zenodo.3931118] for example allows to express a target density with respect to product of normal distributions using the Boomerang sampler [pmlrv119bierkens20a] for this reason.)3 Locally dominated measures
One cannot expect to express a measure putting positive mass on a set relative to a second measure on , if nothing is there to compare, that is if but . You can’t “make something from nothing”.
Given measures and defined on a common space , we say is dominated by if implies (or equivalently, implies ) for every measurable . This is denoted by , and is sometimes equivalently read as “ is absolutely continuous with respect to ”.
The RadonNikodym Theorem states that if equivalently there is a density function , often written , with the property that for every ,
The concept of absolute continuity is very useful for formal manipulations. However, the global nature of testing
would require every support to be represented in a way that allows efficient computation of this relation. In particular, in applications based on Markov chain Monte Carlo, such global information about the measures at hand is not available.
Even with such a capability, this approach would have its problems. It seems likely a user getting an error in response to requesting a density would often respond by restricting measures accordingly until the request can be fulfilled.
Because of this, we instead define to be locally dominated by near , written , if there is some neighborhood such that (so it’s not a degenerate case) and (absolute continuity between the restricted measures). A local density can be defined similarly. In MeasureTheory
, the density
and logdensity
functions work in exactly these terms. For the remainder of this paper, “(log)density” will always refer to the local (log)density, and “local” will often be taken as understood.
Density computations are typically done in log space. For example, if we’re interested in , we’ll typically instead compute logdensity(μ, ν, x)
. Here and below, we’ll often write the math in terms of densities and the code in terms of logdensities, despite unfortunate inconsistency between the two.
There are several benefits to working in logspace. Results are much less likely to overflow or underflow, and the many products and exponents become sums and products, respectively. which are much more efficient to compute and to differentiate.
Working locally and in logspace also gives us a convenient antisymmetry,
logdensity(, , x) == logdensity(, , x)
In particular, logdensity(μ, ν, x)
takes special values in some common cases:
(else)  

(finite)  Inf 

(else)  Inf 
NaN 
4 Base measures and density recursion
To this point, we’ve described the implementation in terms of the threeargument logdensity(μ, ν, x)
. But implementing things directly in these terms would require a definition for every pair of possible measures, which is of course intractable.
We address this by instead assigning every measure a base measure. This is defined locally, with default method
basemeasure(::AbstractMeasure, x) = basemeasure()
This allows users to define a base measure specific to a given neighborhood if they prefer, or to define a global base measure when that suffices.
The logdensity is then fundamentally expressed in term of the twoargument logdensity(μ, x)
, which gives the logdensity with respect to this base measure. Thus we have
logdensity(, x) == logdensity(, basemeasure(, x), x)
This leaves us with the question of how to compute the threeargument logdensity(μ, ν, x)
for general . We implement this recursively, in an approach we refer to as density recursion.
If is the base measure of and is the base measure of , then
In the implementation, logdensity(μ, α, x)
and logdensity(ν, β, x)
can be computed directly using the twoargument logdensity
. There may be a specific method for logdensity(α, β, x)
; if not, it will require a recursion step.
In this process, we may reach a measure that serves as its own base measure. Such measures are primitive, for example Lebesgue or counting measure. If logdensity
is called on two primitive measures with no explicit method, an error is thrown.
The above serves as a highlevel specification of the computational process. The lowlevel implementation will differ in some ways as we continue to tune it for performance.
5 Parameterized measures
A common challenge in building a library of distributions is the choice of parameterizations. For example, in Stan [Stan], a negative binomial distribution is parameterized by and , where
In the Julia package Distributions.jl
[Distributions.jl2019], this is instead given by
These are equivalent (let and ), and the inconsistency alone gives some evidence that it might be reasonable to prefer one or the other depending on the circumstances. Yet most libraries only allow one or the other (or yet another alternative) as the
negative binomial distribution. Other parameterizations require entirely different names.
In MeasureTheory
, our approach avoids this problem. A parameterized measure is defined by a struct of the appropriate type with a named tuple par
field. For example,
struct NegativeBinomial{N,T} <: ParameterizedMeasure{N} par :: NamedTuple{N,T} end
We can then write
NegativeBinomial(r=10, p=0.75)
or
NegativeBinomial(=10, =3)
Calls to rand
, logdensity
, etc then delegate to methods according to the appropriate names. We use KeywordCalls.jl
[KeywordCalls.jl], so all names are resolved statically at compile time.
6 Kernels
A kernel is a (measurable^{6}^{6}6That is, the function must be measurable for every fixed .) function that returns a measure. Equivalently, it represents a family of measures parameterized by its argument. Writing for “measures on ”, we can write this as
A prominent application is a conditional distribution. In this case is further restricted to be a Markov kernel,
where represents “probability measures on ”. If and are random variables defined on and , the kernel defines a probability measure assigning every measurable probability
In MeasureTheory
, a kernel is represented as either a Julia function or a Kernel
object. The latter pairs a measure constructor with a mapping into its parameter space, and makes the functional relationship between argument of and the returned measure transparent.
A Markov kernel, in particular, corresponds to the distribution of a parametrized random variable, a function with random outcomes (or to “mechanisms” in causal inference, for example in [https://doi.org/10.5281/zenodo.1005091]). For example in MeasureTheory
, the kernel
can be expressed as
= kernel(Normal) do x (=x, =x) end
or equivalently
= kernel(Normal; =identity, =sqrt)
corresponding to the random function
f(rng, x) = x + x * randn(rng)
In this example, working in terms of a kernel
allows us to make the linear relationship between the argument x
and mean(κ(x))
transparent.
This could of course also be expressed as a parameterized measure. The difference is that kernels are lighter weight to build and have all the dynamism of functions, while the static nature of parameterized measures can make it easier to express some optimizations.
Finally, kernels are key to expressing likelihoods, which we can use to formulate Bayesian conjugacy properties. We’ll discuss this further in Section 9.
7 Markov chains
Some kernels map elements of a state space to measures on the same space.
In this case, repeated application of the kernel “walks” through , yielding the law of a Markov chain. We support this in MeasureTheory
through the Chain
combinator. This takes a kernel (or function) and a starting measure, and can be used with Julia’s convenient do
notation. For example, we can write a Gaussian random walk as
mc = Chain(Normal(=0.0)) do x Normal(=x) end
A random sample from a Chain
is an infinite sequence expressed as a iterator, e.g.
julia> r = rand(mc); julia> x = Iterators.take(r, 3) > collect 3element Vector{Any}: 0.4931543737034523 0.5661895116186417 1.3286977670590228
Thus, for example, we can evaluate mc
on the x
we just computed:
julia> logdensity(mc, x) 0.4149771036439342
Crucially, the call to rand
builds a struct containing a random seed, to ensure that repeated realizations of r
will be identical.
To give us more control over the sampling procedure, Chain
is implemented using dynamic iterators from DynamicIterators.jl
[DynamicIterators.jl].
The appropriate algebra for infinite sequences requires some care. A Markov chain on has algebra generated by sets of sequences with finitely many constraints.^{7}^{7}7That is, the smallest algebra containing all sets of sequences which can be written as using collections of measurable sets indexed by finite index sets , called cylinder sets.
8 Densities and integrals
In working with measures, two computations that come up quite often are (RadonNikodym) differentiation and (Lebesgue) integration. It’s often important to be able to do things like

For measures and , obtain

Given a measure and a function , obtain a measure with
Note that (1) is different from our current density
function, because no is specified. Computationally, this is called a closure.
In MeasureTheory
, we write these as
f = ∂(μ, ν)
and μ = ∫(f, ν)
.
This allows the RadonNikodym theorem to be expressed as
f == ∂(∫(f, ν), ν)
and μ == ∫(∂(μ, ν), ν)
.
Because we often we want to work instead in logspace, we also include (and tend to prefer) the functions log∂
and ∫exp
, for which we would write
ℓ = log∂(μ, ν)
and μ = ∫exp(ℓ, ν)
,
which follow a similar law,
f == log∂(∫exp(f, ν), ν)
and μ == ∫exp(log∂(μ, ν), ν)
Finally, in terms of density
and logdensity
, we have
(, )(x) == density(, , x) log(, )(x) == logdensity(, , x)
9 Pointwise products and likelihoods
For a parameter and data , suppose we have a prior (momentarily assuming Lebesgue base measure) and likelihood . Then Bayes’s Law gives the posterior density
In practice, we rarely know the normalization factor , so we often work in terms of the unnormalized posterior density,
Because this does not include the normalization, this does not integrate to one, and so it’s not a probability density function. But there are no such problems as a density for a measure; the measure defined in this way has the same base measure as that of the prior.
This pointwise product in probability often describes the fusion of information, in this case the fusion of information from prior and from the observations via the likelihood, but also shows up in related situation, such as sensor fusion in signal processing.
More generally, this “prior times likelihood” can be considered as an operation that takes a measure and a function, and returns a new measure. Given a measure with base measure , and a function defined on the support of , is a measure with
In the special case where is a likelihood, it’s important to keep track of whether it’s represented in log space. For this, we have a Likelihood
type that “knows” about the measure it was derived from and adapts according to whether the user calls density
or logdensity
. The general method
(::AbstractMeasure, ::Likelihood)
defaults to constructing a PointwiseProductMeasure
to hold the prior and likelihood, but we’ll be extending this with more efficient methods for cases with conjugate priors.
In the absence of conjugacy, the extra structure of this suspended form is convenient, for example to build approximations to a given measure of interest.
The above Likelihood
type is specified using a kernel and observed data. Though it is not an AbstractMeasure
, we include a logdensity
method for computational convenience. The core of the implementation is then
struct Likelihood{F,S,X} k::Kernel{F,S} x::X end function logdensity(::Likelihood, p) return logdensity(.k(p), .x) end
10 Product and power measures
Given measures on and on , the (independent) product measure is a measure on . Just as Lebesgue measure on is generated by defining its value on intervals, the product measure is generated by defining
for any measurable and . MeasureTheory
code uses exactly this notation, μ ⊗ ν
. This can be extended recursively to products of any finite number of measures.
If is defined in terms of a base measure and likewise over , the product measure has base measure and density
A special case of this is when . In MeasureTheory
we refer to this as a power measure, written μ^2
or μ^n
for higher dimensions. More generally, the second argument can be a Tuple
, so for example μ^(2,3)
extends as a product measure over matrices.
A second special case is when we have a collection of values together with a kernel. For this we use the For
combinator. So in a regression model, we might express the response as coming from
For(1:n) do j Normal( * x[j], ) end
This is exactly the product measure
More generally, suppose we have a measure on a space and a kernel . Then we can define a measure on as follows.
For a given with , let have base measure . Then similarly to the independent product, we have
Because the types are always available to help us distinguish this from the independent case, it’s safe to use the same notation, μ ⊗ κ
.
Note that this defines a function
In functional programming, this (together with some laws) is called a monadic bind [ramsey2002stochastic].
11 Superposition and mixtures
A superposition is the measuretheoretic analog of a mixture model. For measures and with and , the superposition, written , is a measure with base measure and density
Using the inverse in this final line allows to computed once and then reused. Also note that in the special case where , this reduces to .
An important special case of superposition occurs when and are finite measures on some space , and . This is equivalent to a convex combination of probability densities, called a mixture.
Conveniently, a product of measures distributes over superposition. That is, if and are measures on a common space and and are measures on , then
and
A very common special case of superposition is a spike and slab prior, a mixture of a Dirac measure (a point mass) with a continuous measure.
This is useful for sparse Bayesian modeling, as implemented in the sparse ZigZag sampler, described in [arxiv2103.08478] and implemented in [https://doi.org/10.5281/zenodo.3931118] using MeasureTheory
.
12 Density decomposition
Especially for probability measures, it’s common for the logdensity with respect to Lebesgue or counting measure to have several types of terms. Given an observation and any relevant parameters, there are typically

Terms that are datadependent, each involving some nontrivial function of (and possibly also of the parameters),

Terms that are parameterdependent, and

Terms that are constant.
In addition, it’s common to have an “argument check” to be sure is in the support of the distribution.
Depending on the application, we can often ignore some of these. For example, we may know is in the support by construction, or from a previous check. In these cases, any use of resources to check arguments is wasteful.
There are many cases where it’s important for performance to have constant and parameterdependent terms separated from datadependent ones. Suppose is a measure with logdensity of the form
If we instead observe an iid product of observations, the logdensity is
This final form can be much more efficient, because it reduces repeated computations of to one.
In some cases it’s important to avoid computing at all. For correlation matrices, it’s common to use the LKJ prior [LEWANDOWSKI20091989].
Rather than work with correlation matrices directly, it’s convenient to work in terms of the Cholesky decomposition. For this purpose, MeasureTheory
includes an LKJCholesky
measure. This is typically used as a prior with fixed parameters and , which give the dimensionality and “concentration” of the measure. The relative cost of normalization (which is irrelevant for MCMC) can be computed as
function relative_normcost(k, ) = LKJCholesky(k, ) L = rand().L f_cost = @belapsed logdensity($, $L) g_plus_C_cost = @belapsed Dists.lkj_logc0($k, $) return g_plus_C_cost / (g_plus_C_cost + f_cost) end
So for example, in 10 dimensions with this gives
julia> relative_normcost(10, 2.0) 0.76428528899935
That is, 76% of the time is spent in normalization. If our application doesn’t need it, threefourths of the computation time is simply wasted.
For these reasons, we break the representation of the logdensity into several pieces (now additive terms in logspace):

Constant terms, .

Parameterdependent terms, .

Datadependent terms, .
The twoargument logdensity
then computes only the datadependent terms, with the constant and parameterdependent terms pushed to the base measure. This makes it easy to defer computation of these terms until they are required.
13 Affine transforms
A particularly expressive way to build new measures in terms of existing ones is through a pushforward. Given a measure and a function defined on its support, the pushforward of through is a measure defined by
In the following sections, we first discuss in the context of probability measures before the more general case.
Forward parameterization
Starting with a dimensional multivariate random variable , an affine transform
is a “linear transform with a shift”. We can use this to define a new random variable
, aswith and of the appropriate dimensions.^{8}^{8}8We would typically use in place of , if not for the unfortunate potential confusion with as a name for a measure. If , then has mean
and variance matrix
Note that if , we get , so we can arrive at a given positive semidefinite by taking to be its lower Cholesky factor. Also, in the special case of a onedimensional Gaussian, this gives the familiar .
We call this the forward parameterization because it’s especially convenient for sampling, sometimes referred to as “running the model forward”. Unfortunately, the cost of this convenience is a relatively awkward expression for the density. In this case, we start with and need to solve , finally adjusting according to the determinant of the transformation:
where is the determinant of the square matrix, and the second equality comes from solving for , which gives . More generally (when the transform is not expressed as a matrix), this role is played by the determinant of the Jacobian, .
This requires solving a linear system. Even with being lowertriangular, this involves division operations and the allocation of a temporary vector for storage of .
In many cases, we prefer the density (or logdensity, really) to be fast to evaluate. This leads us to a kind of dual approach to the above.
Inverse parameterization
An alternative parameterization of a multivariate Gaussian is in terms of its precision matrix, . Similarly to above, we’ll write for the lower Cholesky factor of , so . The parameterization is then specified by
Solving for is of course now very simple, and the density becomes
In exchange, forward sampling becomes awkward,
Generalization
Given an injective map and measures , the pushforward has density
Note that there’s no Jacobian to be found! This is because the measure and base measure are either both transformed, or both not. In fact, the Jacobian only comes into play when we compute “across the transform”, and even then it’s not in every case.
Changing our notation slightly, the general case is
This decomposes the problem into two subproblems. First we must compute . This plays the role of the determinant of the Jacobian, but is specific to the base measure . In particular, might be a discrete measure, in which case this factor is one. Finally, we compute , which is just the (pretransformation) density, more familiar from previous discussion as .
For the Lebesgue case, if the Jacobian is not square but, say, for , the resulting measure will be embedded into a dimensional affine subspace of
. This can be convenient for lowrank modeling, which can be important for highdimensional data. If
has QR decomposition
, we can use in place of or above, since columns of are orthonormal (so it’s a change of basis and does not “stretch” the space). Our implementation is a variation of this that’s equivalent but more efficient to compute.14 Extensions
Despite it being a very new package, MeasureTheory.jl
there is already active work to build up on it and to extend it.
PointProcesses.jl
[Dalle_PointProcesses.jl] defines point processes, in particular requiring the concept of random measure.
ManifoldMeasures.jl
[manifoldmeasures] implement measures on a manifold, using Hausdorff measure as the base measure.
MultivariateMeasures.jl
[multivariatemeasures] gives highperformance implementations of logdensity
for multivariate measures, using LoopVectorization.jl
[loopvectorization].
Soss.jl
[scherrer2020soss] is a probabilistic programming language that has recently adopted MeasureTheory.jl
as a foundation. In particular, every Soss Model
is an instance of AbstractMeasure
, and has another Soss model as its base measure.
As mentioned in Section 11, ZigZagBoomerang.jl
[https://doi.org/10.5281/zenodo.3931118]
allows sampling with a spike and slab prior for sparse Bayesian inference and makes use the freedom to choose appropriate reference measures. These models can be expressed using Soss.
Mitosis.jl
[arxiv2010.03509] uses MeasureTheory.jl
to represent Bayesian networks via Markov kernels and defines transformations on those.
15 Related work
Narayanan et al [narayanan2016probabilistic] describe Hakaru
, a system for Bayesian modeling using measures. Here, a measure is a functional
represented as a program. Hakaru’s combinators are then expressed as compilers taking programs as the inputs.
Radul and Alexeev [Radul2020] describe the base measure problem of losing track of a base measure when applying a transformation, and suggest standardizing around Hausdorff measure as a solution. This problem doesn’t arise for us, because the base measure is always taken into account.
Borgström et al [Borgstr_m_2013] describes the Fun
system in F# in terms of measure transformer semantics, but discusses only finite measures.
In Julia [bezanson2017julia], the Distributions.jl
package [Distributions.jl2019] is very popular for computations on distributions. The drawbacks of Distributions.jl
are essentially those described in the first few sections of this paper.
Current advantages over MeasureTheory
are the extensive range of distributions it implements and its popularity and familiarity to many Julia users.
MeasureTheory.jl
currently has Distributions.jl
as a dependency, and uses it as a fallback for many computations. For convenience, we also reexport the Distributions
module under the handle Dists
.
16 Conclusion
We have introduced the concepts and implementation of MeasureTheory.jl
. This package is very new, so we expect there will be some changes as it matures. For this reason we have limited our discussion to aspects of the implementation we believe are relatively stable.
We hope this work can become a common foundation for probabilistic modeling in Julia. In particular, we believe this approach is especially wellsuited for use in probabilistic programming, for which Julia has such a robust and active community.
We welcome discussion and community involvement with this package, as well as additional extensions to those we have described.