Gaussian (multivariate normal) distributions are a small but expressive family of probability distributions with a variety of applications, from ridge regression over Kálmán filters to Gaussian process regression. An important feature is that Gaussians are self-conjugate, that is conditional distributions of Gaussians are themselves Gaussian. For example, in a regression problem, we may put Gaussian priors on the regression coefficients111throughout, denotes the variance
(and not standard deviation) for consistency with the covariance matricesin the multivariate case. We make no further mentions of standard deviations. and let . We can then make noisy or exact observations for datapoints and obtain Gaussian posteriors over both the regression coefficients and the predicted value.
An important question is then how to model complete absence of information in such a setting, say over coefficients or data. Conditioning on such absent information should have no effect. However, no Gaussian distribution is ever completely uninformative, because it is integrable and introduces a bias towards its mean. Intuitively, the uninformative prior over the real line should be the limit of for . However, this limit cannot be taken in any ordinary sense because it assigns measure to every bounded set (it converges weakly to the zero measure, which is not a probability measure). The Lebesgue measure is another candidate, however it is not a probability measure because it fails to be normalized. Nonetheless, so called improper priors like the Lebesgue measure are frequently used in density computations, because after conditioning, they may result in a well-defined and normalized posterior (e.g. [gelman]). We prefer to give a formal account of this phenomenon which completely avoids non-normalization:
In Section 2, we present a theory of extended Gaussian distributions, which faithfully include ordinary Gaussian distributions while adding in uninformative priors. Extended Gaussian variables can be added, scaled and conditioned just like ordinary Gaussians. We generalize the natural duality between precision and covariance for ordinary Gaussians to the extended case (Section 2.2).
In Section 3, we generalize extended Gaussian distributions to a category of extended Gaussian maps between vector spaces. combines features from Gaussian probability as well as nondeterminism, which is modelled by linear relations. This is surprising because probability and nondeterminism are difficult to combine in general [nogo]. We show that is an instance of a general construction: For a suitable type of ‘decoration’ , we define the categories of decorated linear maps and of decorated linear relations, of which linear relations, affine relations, ordinary Gaussians and extended Gaussians are obtained for different choices of . This joint formalism lets us elegantly establish relationships between those categories, such as the support of a Gaussian, which is an affine subspace (Section 3.3).
Because extended Gaussians are inherently not measure-theoretic, we will use the language of categorical probability theory [cho_jacobs, fritz]
to talk about random variables, distributions and conditioning in an abstract and unified way. In technical terms, our main theorem is thatis a Markov category which has all conditionals (Theorem 3.4).
Recent applications of categorical probability theory include [bart:licsmultisets, jacobs:hypergeo]. The categorical framework has a canonical counterpart in the compositional semantics of probabilistic programs [staton:commutative_semantics, dariothesis]. In [2021compositional], we treat exact conditioning of Gaussian random variables from a programming-languages viewpoint. The study of uninformative priors is very natural in this context, because it asks for a unit for the conditioning operation. We will discuss this more in Section 4 but take care to keep the developments of Section 2 self-contained and avoid categorical language there.
2 Extended Gaussian Distributions
We begin with an informal introduction to extended Gaussians, before defining them formally in Section 2.1. In a nutshell, an extended Gaussian on a vector space is a Gaussian distribution on a quotient . For an overview of linear algebra terminology, see Section 6.1.
Representations: A Gaussian distribution on can be written uniquely as where is its mean and is a symmetric positive-semidefinite matrix called its covariance matrix. The support of is the affine subspace where is the column space (image) of . Gaussian distributions transform as follows under linear maps: If , then
is its pushforward Gaussian distribution on . For an introduction to Gaussian probability see e.g. [alexthesis].
We now wish to add certain forms of limits to Gaussian distributions, which allow us to express ignorance along a certain subspace . An extended Gaussian distribution on is an entity which can be written as
where is a vector subspace called locus of nondeterminism. Extended Gaussian distributions transform as where is the image subspace. The intuition is that nondeterministic (uninformative) noise is present in the distribution along the subspace . For example, the extended Gaussian distribution (which we’ll simply write as
) expresses a ‘uniform distribution’ over the real line; it is the distribution of a point about which we have no information.
Translation invariance and non-uniqueness: An important consequence of the uninformativeness is that the representation (1) can no longer be unique. For example, the uniform distribution on should be translation invariant, because adding a constant to an unknown point results in an unknown point. We can thus, for example, expect the following three representations of extended Gaussians to denote the same distribution:
Let us consider a more complex example involving two correlated random variables:
Let be independent normal variables, to which we add some uniform noise along the diagonal ; that is we let and define
The extended Gaussian vector can be decomposed into a sum (see Figure 1)
where the first summand lies in and the second one takes values in the complement . By translation invariance, the first contribution gets absorbed by . The remaining contribution has variance
in the second coordinate. We can therefore conclude that following are both valid representations of the joint distribution of:
Conditionals: Unlike , the variables in Example 2 are no longer independent, because they are coupled via . In fact, one can show that given has the conditional distribution
The expression appearing in the conditional is an example of an extended Gaussian map, which we consider systematically in Section 3. In Theorem 3.4, we prove that that conditional distributions of extended Gaussians are again extended Gaussian. We now proceed to define extended Gaussians formally.
2.1 Definition of Extended Gaussians
For a first definition of extended Gaussians, we start from the representatives in (1) and quotient by an equivalence relation for when they shall denote the same distribution: [Preliminary Definition] An extended Gaussian distribution on is an equivalence class of pairs where is a vector subspace and is a Gaussian distribution on . We identify two such pairs if and only if and the pushforward distributions and agree for some (equivalently any) direct sum decomposition , where is the projection endomorphism onto .
In Example 2, we considered the particular complement of . The projection onto is given by the matrix
From this we obtain the desired equality (2) by verifying that
The choice of complement is in no way canonical. A cleaner, high-level definition goes as follows: An extended Gaussian distribution on with locus of nondeterminism is an ordinary Gaussian distribution on the quotient space (Definition 2.1). The preliminary definition 2.1 is then obtained by identifying with the complement . However, because the choice of complement is non-canonical, it is preferable to work with the quotient directly. However, the abstract definition makes it necessary to re-introduce Gaussian distributions on more general vector spaces than just in a coordinate-free way. This development is well-known, and we refer to e.g. [alexthesis, Section 1] for an overview:
All vector spaces are henceforth assumed finite-dimensional. We write for the dual space of , consisting of all linear maps . By a form on , we mean a symmetric bilinear map . The kernel of is the set
We call nondegenerate if . Note that can be curried as with . The notation is consistent in that , and is nondegenerate if and only if is an isomorphism. If has kernel , then the quotient form is well-defined and nondegenerate.
A form is called positive semidefinite if for all , and positive definite if for all . A positive semidefinite form is positive definite if and only if it is nondegenerate.
The correct coordinate-free version of the covariance matrix is that of a covariance form on the dual space. Given a random variable on the space , and linear functions , we compute the covariance
This expression is symmetric, bilinear and positive semidefinite. A Gaussian distribution is fully determined by its mean and covariance form, which motivates the following definition:
A Gaussian distribution on a vector space is a pair written of a mean and a positive semidefinite form . If is a linear map, the Gaussian distribution pushes forward to where .
We can now give the following concise definition of extended Gaussians. An extended Gaussian distribution on a vector space is a pair where is a vector subspace and is a Gaussian distribution on . Every linear map induces a linear map , and we take the pushforward of to be .
2.2 Precision and Duality
We will show that our definition of extended Gaussians fits into an elegant duality between forms on a space and its dual. This lets us convert between two equivalent representations, using a covariance form or a precision form, which are convenient for different purposes:
Probability distributions and probability densities are dual to each other. Distributions naturally push forward, and consequently the covariance form must be defined on the dual space . On the other hand, density functions pull back. If the covariance matrix is nonsingular, the Gaussian distribution has a Lebesgue density
where is called the precision matrix. If is singular, will only admit a density on its support , and the precision is only defined on that subspace. The coordinate-free formulation of the precision-covariance correspondence is given by the following duality statement (where for simplicity we assume centered distributions, that is mean zero):
The following pieces of data are equivalent for every vector space
a subspace and a nondegenerate form
Recall that the evaluation pairing induces a duality between the subspaces of and called annihilators, which we here denote . For subspaces and , the subspaces are defined as
Taking annihilators is order-reversing and involutive; we list further properties under Proposition 6.1.
In Theorem 2.2, the subspace is taken to be annihilator of the kernel , that is . This recovers the familiar support for covariance matrices. We may think of the form as taking the value infinity outside of (which corresponds to vanishing a density under (3)).
Extended Gaussians admit a generalized and in fact more symmetric version of this correspondence: The following pieces of data are equivalent for every vector space
pairs of a subspace and
pairs of a subspace and
Given , let and define and .
Form the nondegenerate quotient form . Its currying is an isomorphism. Making use of the canonical isomorphisms described in Proposition 6.1,
we obtain an iso , which is the same thing as a bilinear form with kernel .
Conversely, given , let and define and . Turn into an iso , then reading the diagram (5) backwards uniquely defines the iso and hence the form with kernel .
The constructions are clearly inverses to each other. It is furthermore easy to see that the correspondence takes positive semidefinite forms to positive semidefinite forms.
A (centered) extended Gaussian thus has a covariance representation which is a pair with and a positive semidefinite form on , and a precision representation with and positive semidefinite on . The covariance representation is convenient for computing pushforwards, while the precision representation generalizes density functions and is useful for conditioning. Note that the locus of nondeterminism of an extended Gaussian is , that is, unlike for ordinary Gaussians, vanishing precision is now allowed! The proof of Theorem 2.2 is reminiscent of the construction of the Moore-Penrose pseudoinverse, whose relevance to Gaussian probability is well-known (e.g. [lauritzen]).
We define the uniform distribution on to be the unique extended Gaussian distribution whose locus of nondeterminism is all of . Its precision representation is the zero form ; its covariance representation is the zero form on the trivial subspace .
3 A Category of Extended Gaussian maps
It is convenient to generalize Gaussian distributions on to Gaussian maps , which are linear functions with Gaussian noise, informally written , where is linear and is a Gaussian distribution on (independent of ). This allows us to discuss distributions, linear maps and conditionals within a single formalism.
Gaussian maps can be composed in sequence and in parallel, where the noise is pushed forward and accumulated appropriately. Formally, we defines a symmetric monoidal category (due to [fritz, Ch. 6]) as follows: Objects are vector spaces , and morphisms are Gaussian maps
between them. On objects, the tensor is cartesian product,, and the categorical structure is given by
where denotes the product distribution of Gaussians. Furthermore, we can copy and discard information using the linear maps and . This gives the structure of a Markov category [fritz], that is a categorical model of probability theory. We can recover Gaussian distributions as Gaussian maps out of the trivial vector space.
In this section, we show that the construction of arises as an instance of a general notion of decorated linear map . We then give a recipe to combine decorated linear maps with nondeterminism to obtain decorated linear relations. This subsumes our earlier construction of extended Gaussian distributions, as well as serving as a definition for extended Gaussian maps (Definition 3.2).
In Section 3.3, we use the generality of this construction to establish the following diagram of identity-on-objects functors between Markov categories, where the bottom row consists of decorated linear maps and the top row of decorated linear relations:
In all categories of this diagram, the objects are finite-dimensional vector spaces . Recall that a linear relation between vector spaces is a relation that is also a vector subspace; an affine relation is a relation that is also an affine subspace (see Section 6.1). The remaining Markov categories in the diagram are defined as
: vector spaces and linear functions
: vector spaces and affine-linear functions
: vector spaces and left-total linear relations
: vector spaces and left-total affine relations
The functor collapses an Gaussian distributions to their supports, which are affine subspaces.
3.1 Decorated Linear Maps
Let be a functor from the category of vector spaces into the category of commutative monoids and monoid homomorphisms. We think of elements as “decorations” for linear maps into , and thus call a decoration functor.
We define a category of -decorated linear maps as follows:
Objects are vector spaces
Morphisms are pairs where is a linear map and
Composition is defined as follows: if , define
Note that addition takes place in the commutative monoid .
There is a faithful inclusion sending to . The functor which forgets the decoration is an opfibration; a decorated map is opcartesian if and only if is a unit in the monoid . In fact, is precisely the (op-)Grothendieck construction for seen as a functor .
We argue that has the structure of a symmetric monoidal category with the tensor on objects. For this, we first observe that is automatically lax monoidal; for this we define natural maps given as follows: For , let where are the biproduct inclusions. We can now define the tensor of decorated map as . The monoidal category is in general not cartesian; it does however inherit copy and delete maps from . The category is a Markov category if and only if deleting is natural, i.e. .
We reconstruct the bottom row of (6) for the following decoration functors:
For , is equivalent to .
For , is equivalent to . A map consists of a pair with linear and .
Define to be the set of covariance forms. This is a commutative monoid under pointwise addition, and is functorial via . Let , then is precisely .
3.2 Decorated Linear Relations
Given a decoration functor , we define -decorated linear relations by maintaining a locus of nondeterminism similar to Definition 2.1.
We define a category as follows:
objects are vector spaces
morphisms in are triples where is a vector subspace, is a linear map and .
Intuitively, the subspace represents the direction of complete ignorance, so we only decorate the quotient.
Composition of and is slightly more involved: We first define the composite subspace as , which is well-defined. The composite vanishes on and so descends to . We define the composite as
To understand the name ‘decorated linear relation’, we consider the case where we obtain that is equivalent to . The key observation is the following:
To give a left-total linear relation is to give subspace and a linear function to the quotient.
The correspondence is fully spelled out in Proposition 6.1. This means we can think of a left-total relation as a linear function with nondeterministic noise along some subspace . This is similar to a decorated linear map, however the choice of the linear map is no longer unique, so further quotienting is required, as we discuss now:
As a quotient: We can demystify the composition of by first constructing an auxiliary category and then quotienting it by a congruence. We first consider the decoration functor defined by . Each is a commutative monoid under Minkowski addition , and the functorial action for is direct image , which is a monoid homomorphism. Now we consider the category where morphisms are by definition triples with and , and composition is . This has all the data for and explains why the composite subspace is formed the way it is.
However some pieces of data need to be identified if they agree on the quotient by : We define an equivalence relation if are the same subspace , and and where is the quotient map.
The relation is a congruence relation, and is the quotient of under . is symmetric monoidal and inherits the copy and delete maps from . It is a Markov category if . See Appendix (Section 6.3).
We define the Markov category to be .
3.3 Relationships between the Constructions
The constructions and are themselves functorial:
Let be decoration functors and a natural transformation. Then we have induced monoidal identity-on-objects functors
which preserve copy and delete structure. This proposition accounts for all functors in the diagram (6) except for . For every decoration functor , we obtain an inclusion functor by choosing locus of nondeterminism , that is forming the composite
From a decorated linear relation, we can extract its locus of nondeterminism by forgetting the decoration via the following composite:
Recall that the support of a Gaussian distribution on is the affine space . This construction can be extended functorially to as follows:
We have a functor which takes the Gaussian noise to its support. Concretely on , . We construct as the composite
where is takes the covariance form to the annihilator of its kernel. Naturality is nontrivial and can be shown using the Cholesky composition, which makes use of the positive semidefiniteness of .
Our presentation of extended Gaussians is not measure-theoretic. Categorical probability theory [fritz, cho_jacobs, dariothesis] is a general language of probability which allows one to precisely state concepts such as determinism, independence and support in an abstract way without relying on a measure-theoretic formalism. An important notion is that of a conditional distribution [fritz, Section 11], which means we can break the sampling of a joint distribution into two parts: We first sample from the marginal, and then sample dependently on , . Conditional distributions are extended to conditionals of arbitrary morphisms in [fritz, Definition 11.5], which is a morphism satisfying the equation in the appropriate categorical sense.
Conditionals in exist and are given by the usual conditional distributions [fritz, 11.8]. also has conditionals, which are essentially given by re-ordering a relation to a relation . This may not be left-total yet, but any linear relation can be extended to a left-total one outside of its domain.
We will now show that extended Gaussians have conditionals too: We proceed by picking a convenient complement to the locus of nondeterminism , which lets reduce the proof to the existence of conditionals in and separately. In fact, this strategy works for general decorated linear relations:
If has conditionals, so does . Let be represented modulo by with . By Lemma 6.2, there exists a complement of such that is a complement of in . We then take to be the projection endomorphisms, and replace the representative by and by without affecting .
Now we consider and find a conditional . This means we can obtain as
Similarly we can use conditionals in to find a linear function and a subspace such that can be obtained as
Thus a joint sample can be obtained as follows
Because we have chosen such that , we can extract the values of from via the projections as and . We can thus read off a conditional for by combining the two individual conditionals, namely
has all conditionals.
For example, the procedure of Theorem 3.4 recovers Example 2 systematically: In order to find a conditional distribution for the joint distribution on , we choose the particular complement of the diagonal and obtain desired decomposition
Conditioning on equality: In the context of Gaussian probability, we can condition two random variables to be exactly equal [2021compositional] by introducing an auxiliary variable for their difference, and computing the conditional distribution . For example, if are independent standard normal variables, then the conditional distribution is . Note that the variance of has shrunk after conditioning, because is itself concentrated around and thus induces a stronger concentration in .
We can now formally show that our uniform extended Gaussians (Definition 2.2) are really uninformative in the sense that conditioning on equality with a uniform variable does not change the prior:
For every extended Gaussian prior on , if and , then still has distribution . We introduce an auxiliary random variable and show that the following two joint distributions over are equal:
The right-hand side lets us read off the conditional on immediately. Conditioning on equality now means setting , after which we obtain .
4 Outlook and Discussion
We have defined a sound mathematical model for reasoning about Gaussian distributions together with uninformative priors, and gained a new and generalized perspective on linear relations. We gave a construction to combine probability and nondeterminism on vector spaces, which is generally not possible in a seamless way [weakdist, nogo].
The original motivation for this work comes from programming language theory and the semantics of probabilistic programming [2021compositional]. While in statistics literature, noisy observations are seen as the primary operation of interest, our focus on singular covariances is natural from a logical or programming perspective: Copying and conditioning are fundamental operations, however they do lead to highly singular distributions. In [gaussianinfer]
, we defined a programming language for Gaussian probability with a first-class exact conditioning construct. In this setting, the existence of a uniform distribution is simply asking for a unit for the conditioning operation. Further connections between probability and logic arise both through the analogy with unification in logic programming[staton:predicate_logic] and the relationship to linear relations (Section 3.3). Linear relations have have been used in graphical reasoning and signal-flow diagrams [bsz, baez:props, baez:control].
We believe that solves the outstanding characterization of in [2021compositional]. The category can express both logical (solving linear equations) and probabilistic computation (conditioning Gaussians). We believe that it is a hypergraph category [hypergraphcats], with conditioning as multiplication and uniform distributions as units. [coecke:inference] have suggested hypergraph categories as the right setting for inference, and celebrated algorithms such as message-passing inference can be formulated fully abstractly there [morton:inference]. We believe that categorical probability theory is a compelling language to go beyond measure-theoretic foundations and talking about statistical models and conditioning in a high-level way. The systematic connection with probabilistic programming languages is elaborated in [dariothesis].
We hope that extended Gaussians will also be useful outside of theoretical computer science. Here we have introduced them primarily from an algebraic and category-theoretic viewpoint and shown a duality theorem (Theorem 2.2) linking them to certain types of quadratic forms. We would like to explore further connections to statistics [quotients] and functional analysis: An important step is to topologize the homsets of and characterize which in which ways Gaussian distributions can converge to extended Gaussians. Ideally this would exhibit the construction of extended Gaussians as a form of topological completion. The aspect of considering improper priors as limits of normalized ones is treated in [akaike, approximations].
It also seems interesting to consider extended Gaussians under the ‘principle of transformation groups’ (e.g. [jaynes:priors]). In [2021compositional, VI], we remark that is essentially presented as a PROP by the invariance of the standard normal distribution under the orthogonal group . We expect the uniform distributions to be presented by invariance under all of .
It has been useful to discuss this work with many people. Particular thanks go to Tobias Fritz, Bart Jacobs, Dusko Pavlovic, Sam Staton and Alexander Terenin.
6.1 Glossary: Linear Algebra
All vector spaces are assumed finite dimensional. For vector subspaces , their Minkowski sum is the subspace . If furthermore , we call their sum a direct sum and write . A complement of is a subspace such that . An affine subspace is a subset of the form for some and a (unique) vector subspace . The space is called a coset of and the cosets of organize into the quotient vector space .
An affine-linear map between vector spaces is a map of the form for some linear function and . Vector spaces and affine-linear maps form a category .
A linear relation is a relation which is also a vector subspace of . An affine relation is a relation which is also an affine subspace of . We write . A relation is left-total if for all .
Linear relations, affine relations and left-total relations are closed under the usual composition of relations. We denote by and the categories whose objects are vector spaces, and morphisms are left-total linear and affine relations respectively. Those categories are Markov categories (left-totality must be assumed for discarding to be natural).
The following characterization underlies Definition 3.2: Every left-total linear relation can be written as a ‘linear map with nondeterministic noise’ .
Let be a left-total linear relation. Then
is a vector subspace of
is a coset of for every
the assignment is a well-defined linear map
every linear map is of that form for a unique left-total linear relation
For 1, consider (by assumption nonempty), then by linearity of
so is a vector subspace. For 2, we can find some and wish to show that . Indeed if then so , hence . Conversely for all we have so . This completes the proof that is a coset. For 3, the previous point shows that the map is a well-defined map . It remains to show it is linear. That is, if and then . This follows immediately from the linearity of . For the last point 4, given a linear map we construct the relation
which is left-total because . To see that is linear, let meaning and for representatives of . Linearity of means that is a representative of . Thus
Taking annihilators is order-reversing and involutive
If , then and we have a canonical isomorphism
and similarly for , we have
If and , then
If , we have a canonical isomorphism
Standard. An explicit description of the canonical iso (7) is given as follows.
We define as follows. If , then is a function such that . The restriction thus descends to the quotient , and we let . To check this is well-defined, notice that the kernel of consists of those such that , that is .
We define as follows. An element is a function with . Find any extension of to a linear function (such an extension exists because is a retract of ). Then still , so . It remains to show that the choice of extension does not matter in the quotient . Indeed if is another extension, then , hence .
The proof of the existence of conditionals in proceeds by picking a good complement to the locus of nondeterminism as follows:
Let be a vector subspace, and let be its projection. Then there exists a complement of such that is a complement of . We give an explicit construction, where in fact we can choose to be a cartesian product of subspaces . Define
We argue that if and , then . First we prove that : Indeed, if for , then , but that implies . So we know , i.e. . Thus .
It remains to show that we can write every as with and .
We can write with and .
We claim that there exists a such that . Because , there exists some such that . We now decompose for . By definition of , we have , so .
Write with and define and . Then we have and , and as desired
6.3 Composition and Congruence
For the construction of , it remains to check that the relation is a monoidal congruence on . Recall that if and only if and where is the quotient map.
Let meaning and . Then clearly also .
Congruence: Let , and be given for , and assume that and . We need to show that