 # On the information-theoretic structure of distributed measurements

The internal structure of a measuring device, which depends on what its components are and how they are organized, determines how it categorizes its inputs. This paper presents a geometric approach to studying the internal structure of measurements performed by distributed systems such as probabilistic cellular automata. It constructs the quale, a family of sections of a suitably defined presheaf, whose elements correspond to the measurements performed by all subsystems of a distributed system. Using the quale we quantify (i) the information generated by a measurement; (ii) the extent to which a measurement is context-dependent; and (iii) whether a measurement is decomposable into independent submeasurements, which turns out to be equivalent to context-dependence. Finally, we show that only indecomposable measurements are more informative than the sum of their submeasurements.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Any classical physical system (by which we simply mean any deterministic function) can be taken as a measuring apparatus or input/output device. For example, a thermometer takes inputs from the atmosphere and outputs numbers on a digital display. The thermometer categorizes inputs by temperature and is blind to, say, differences in air pressure.

Classical measurements are formalized as follows:

###### Definition 1.

Given a classical physical system with state space , a measuring device is a function . The output is the reading and the pre-image is the measurement.

From this point of view a thermometer and a barometer are two functions, and , mapping the state space of configurations (positions and momenta) of atmospheric particles to real numbers. When the thermometer outputs , it specifies that the atmospheric configuration was in the pre-image which, assuming the thermometer perfectly measures temperature, is exactly characterized as atmospheric configurations with temperature . Similarly, the pre-images generated by the barometer group atmospheric configurations by pressure.

The classical definition of measurement takes a thermometer as a monolithic object described by a single function from atmospheric configurations to real numbers. The internal structure of the thermometer – that is composed of countless atoms and molecules arranged in an extremely specific manner – is swept under the carpet (or, rather, into the function).

This paper investigates the structure of measurements performed by distributed systems. We do so by adapting Definition 1 to a large class of systems that contains networks of Boolean functions , Conway’s game of life [13, 10] and Hopfield networks [16, 4] as special cases.

Our motivation comes from prior work investigating information processing in discrete neural networks

[7, 8]. The brain can be thought of as an enormously complicated measuring device mapping sensory states and prior brain states

to subsequent brain states. Analyzing the functional dependencies implicit in cortical computations reduces to analyzing how the measurements performed by the brain are composed out of submeasurements by subdevices such as individual neurons and neuronal assemblies. The cortex is of particular interest since it seemingly effortlessly integrates diverse contextual data into a unified gestalt that determines behavior. The measurements performed by different neurons appear to interact in such a way that they generate more information jointly than separately. To improve our understanding of how the cortex integrates information we need to a formal language for analyzing how context affects measurements in distributed systems.

As a first step in this direction, we develop methods for analyzing the geometry of measurements performed by functions with overlapping domains. We propose, roughly speaking, to study context-dependence in terms of the geometry of intersecting pre-images. However, since we wish to work with both probabilistic and deterministic systems, things are a bit more complicated.

We sketch the contents of the paper. Section §2 lays the groundwork by introducing the category of stochastic maps

. Our goal is to study finite set valued functions and conditional probability distributions on finite sets. However, rather than work with sets, functions and conditional distributions, we prefer to study stochastic maps (Markov matrices) between function spaces on sets. We therefore introduce the faithful functor

taking functions on sets to Markov matrices:

 [f:X→Y]↦[Vf:VX→VY],

where is functions from to . Conditional probability distributions can also be represented using stochastic maps.

Working with linear operators instead of set-valued functions is convenient for two reasons. First, it unifies the deterministic and probabilistic cases in a single language. Second, the dual of a stochastic map provides a symmetric treatment of functions and their corresponding inverse image functions. Recall the inverse of function is , which takes values in the powerset of , rather than itself. Dualizing a stochastic map flips the domain and range of the original map, without introducing any new objects:

 (1)

see Proposition 2.

Section §3 introduces distributed dynamical systems. These extend probabilistic cellular automata by replacing cells (space coordinates) with occasions (spacetime coordinates: cell at time ). Inspired by [15, 2], we treat distributed systems as collections of stochastic maps between function spaces so that processes (stochastic maps) take center stage, rather than their outputs. Although the setting is abstract, it has the advantage that it is scalable: using a coarse-graining procedure introduced in  we can analyze distributed systems at any spatiotemporal granularity.

Distributed dynamical systems provide a rich class of toy universes. However, since these toy universes do not contain conscious observers we confront Bell’s problem : “What exactly qualifies some physical [system] to play the role of ‘measurer’?” In our setting, where we do not have to worry about collapsing wave-functions or the distinction between macroscopic and microscopic processes, the solution is simple: every physical system plays the role of measurer. More precisely, we track measurers via the category of subsystems of . Each subsystem is equipped with a mechanism which is constructed by gluing together the mechanisms of the occasions in and averaging over extrinsic noise.

Measuring devices are typically analyzed by varying their inputs and observing the effect on their outputs. By contrast this paper fixes the output and varies the device over all its subdevices to obtain a family of submeasurements parametrized by all subsystems in . The internal structure of the measurement performed by is then studied by comparing submeasurements.

We keep track of submeasurements by observing that they are sections of a suitably defined presheaf. Sheaf theory provides a powerful machinery for analyzing relationships between objects and subobjects , which we adapt to our setting by introducing the structure presheaf , a contravariant functor from to the category of measuring devices on . Importantly, is not a sheaf: although the gluing axiom holds, uniqueness fails, see Theorem 4. This is because the restriction operator in

is (essentially) marginalization, and of course there are infinitely many joint distributions

that yield marginals and .

Section §4 adapts Definition 1 to distributed systems and introduces the simplest quantity associated with a measurement: effective information, which quantifies its precision, see Proposition 5. Crucially, effective information is context-dependent – it is computed relative to a baseline which may be completely uninformative (the so-called null system) or provided by a subsystem.

Finally entanglement, introduced in §5, quantifies the obstruction (in bits) to decomposing a measurement into independent submeasurements. It turns out, see discussion after Theorem 10, that entanglement quantifies the extent to which a measurement is context-dependent – the extent to which contextual information provided by one submeasurement is useful in understanding another. Theorem 9 shows that a measurement is more precise than the sum of its submeasurements only if entanglement is non-zero. Precision is thus inextricably bound to context-dependence and indecomposability. The failure of unique descent is thus a feature, not a bug, since it provides “elbow room” to build measuring devices that are not products of subdevices.

Space constraints prevent us from providing concrete examples; the interested reader can find these in [7, 8, 6]. Our running examples are the deterministic set-valued functions

 f:X→Y and g:X×Y→Z

which we use to illustrate the concepts as they are developed.

## 2 Stochastic maps

Any conditional distribution on finite sets and can be represented as a matrix as follows. Let

denote the vector space of real valued functions on

and similarly for . is equipped with Dirac basis , where

 δx(x′)={1if x=x′0else.

Given a conditional distribution construct matrix with entry in column and row . Matrix is stochastic

: it has nonnegative entries and its columns sum to 1. Alternatively, given a stochastic matrix

, we can recover the conditional distribution. The Dirac basis induces Euclidean metric

 ⟨∙|∙⟩:VX⊗VX→R:⟨∑αxδx∣∣∑βxδx⟩=∑αxβx (2)

which identifies vector spaces with their duals . Let .

###### Definition 2.

The category of stochastic maps has function spaces for objects and stochastic matrices with respect to Dirac bases for arrows. We identify of with using the Dirac basis without further comment below.

###### Definition 3.

The dual of surjective stochastic map is the composition , where is the unique map making diagram

commute. Precomposing with renormalizes222If is not surjective, i.e. if one of the rows has all zero entries, then the renormalization is not well-defined. its columns to sum to 1. The stochastic dual of a stochastic transform is stochastic; further, if is stochastic then .

Category is described in terms of braid-like generators and relations in . A more general, but also more complicated, category of conditional distributions was introduced by Giry , see .

###### Example 1 (deterministic functions).

Let be the category of finite sets. Define faithful functor taking set to and function to stochastic map . It is easy to see that and .

We introduce special notation for commonly used functions:

• Set inclusion. For any inclusion of sets, let denote the corresponding stochastic map. Two important examples are

• Point inclusion. Given define .

• Diagonal map. Inclusion induces .

• Terminal map. Let denote the terminal map induced by .

• Projection. Let denote the projection induced by .

###### Proposition 1 (dual is Bayes over uniform distribution).

The dual of a stochastic map applies Bayes rule to compute the posterior distribution using the uniform probability distribution.

Proof: The uniform distribution is the dual

of the terminal map . It assigns equal probability to all of ’s elements, and can be characterized as the maximally uninformative distribution . Let . The normalized transpose is

 m♮(δy)=∑xpm(y|x)∑x′pm(y|x′)δx=∑xpm(y|x)⋅pω♮(x)∑x′pm(y|x′)pω♮(x′)δx=∑xpm(x|y)⋅δx.■
###### Remark 1.

Note that . Dirac’s bra-ket notation must be used with care since stochastic matrices are not necessarily symmetric .

###### Corollary 2 (preimages).

The dual of stochastic map is conditional distribution

 pVf(x|y)={1|f−1(y)|if f(x)=y0else. (3)

Proof: By the proof of Proposition 1

 (Vf)♮(δy)=1|f−1(y)|∑{x|f(x)=y}δx.■

The support of is . Elements in the support are assigned equal probability, thereby treating them as an undifferentiated list. Dual thus generalizes the inverse image . Conveniently however, the dual simply flips the domain and range of , whereas the inverse image maps to powerset , an entirely new object.

###### Corollary 3 (marginalization with respect to uniform distribution).

Precomposing with the dual to marginalizes over the uniform distribution on .

Proof: By Corollary 2 we have . It follows immediately that

 pm∘π♮X(z|x)=1|Y|∑y∈Ypm(z|x,y).■

Precomposing with treats inputs from as extrinsic noise. Although duals can be defined so that they implement Bayes’ rule with respect to other probability distributions, this paper restricts attention to the simplest possible renormalization of columns, Definition 2. The uniform distribution is convenient since it uses minimal prior knowledge (it depends only on the number of elements in the set) to generalize pre-images to the stochastic case, Proposition 2.

## 3 Distributed dynamical systems

Probabilistic cellular automata provide useful toy models of a wide range of physical and biological systems. A cellular automaton consists of a collection of cells, each equipped with a mechanism whose output depends on the prior outputs of its neighbors. Two important important examples are

###### Example 2 (Conway’s game of life).

The cellular automaton is a grid of deterministic cells with outputs . A cell outputs 1 at time iff: (i) three of its neighbors outputted 1s at time or (ii) it and two neighbors outputted 1s at . Remarkably, a sufficiently large game of life grid can implement any deterministic computation .

###### Example 3 (Hopfield networks).

These are probabilistic cellular automata [16, 4], again with outputs . Cell fires with probability proportional to

 p(nk,t=1|n∙,t−1)∝exp⎡⎣1T∑j→kαjk⋅nj,t−1⎤⎦.

Temperature controls network stochasticity. Attractors are embedded into a network by setting the connectivity matrix as .

It is useful to take a finer perspective on cellular automata by decomposing them into spacetime coordinates or occasions . An occasion is a cell at a time point . Two occasions are linked if there is a connection from ’s cell to ’s (because they are neighbors or the same cell) and their time coordinates are and respectively for some , so occasions form a directed graph. More generally:

###### Definition 4.

A distributed dynamical system consists of the following data:

1. Directed graph. A graph with a finite set of vertices or occasions and edges .

2. Alphabets. Each vertex has finite alphabet of outputs and finite alphabet of inputs, where .

3. Mechanisms. Each vertex is equipped with stochastic map . Figure 1: Mapping a cellular automaton to a distributed dynamical system.

Taking any cellular automaton over a finite time interval initializing the mechanisms at time with fixed values (initial conditions) or probability distributions (noise sources) yields a distributed dynamical system, see Fig. 1. Each cell of the original automaton corresponds to a series of occasions in the distributed dynamical system, one per time step.

Cells with memory – i.e. whose outputs depend on their neighbors outputs over multiple time steps – receive inputs from occasions more than one time step in the past. If a cell’s mechanism changes (learns) over time then different mechanisms are assigned to the cell’s occasions at different time points.

The sections below investigate the compositional structure of measurements: how they are built out of submeasurements. Technology for tracking subsystems and submeasurements is therefore necessary. We introduce two closely related categories:

###### Definition 5.

The category of subsystems of

is a Boolean lattice with objects given by sets of ordered pairs of vertices

and arrows given by inclusions . The initial and terminal objects are and .

###### Remark 2.

Subsystems are defined as ordered pairs of vertices, rather than subgraphs of the directed graph of . Pairs of occasions that are not connected by edges are ineffective; they do not contribute to the information-processing performed by the system. We include them in the formalism precisely to make their lack of contribution explicit, see Remark 3.

Let and similarly for . Set the input alphabet of as the product of the output alphabets of its source occasions and similarly the output alphabet of as the product of the output alphabets of its target occasions .

###### Definition 6.

The category of measuring devices on has objects for . For define arrow

 r21:Hom(VAC2,VSC2) →Hom(VAC1,VSC1) ↦⎡⎣VAC1π♮A−→VAC2T→VSC2πS−→VSC1⎤⎦,

where and are shorthands for projections as in Definition 1.

The reason for naming the category of measuring devices will become clear in §4 below. The two categories are connected by contravariant functor :

###### Theorem 4 (structure presheaf).

The structure presheaf taking

 FD:SysopD→MeasD:C↦Hom(VAC,VSC)% and i12↦r21

satisfies the gluing axiom but has non-unique descent.

Proof: Functor is trivially a presheaf since it is contravariant. It is an interesting presheaf because the gluing axiom holds.

For gluing we need to show that for any collection of subsystems and sections such that for all , there exists section such that for all . This reduces to finding a conditional distribution that causes diagram

in to commute. The vertices are conditional distributions and the arrows are marginalizations, so rewrite as

where and similarly for the vertical arrow. It is easy to see that

 p(x,y,z|u,v,w):=p(x,y|u,w)p(x,z|v,w)p(x|w)

satisfies the requirement.

For to be a sheaf it would also have to satisfy unique descent: the section satisfying the gluing axiom must not only exist for any collection with compatible restrictions but must also be unique. Descent in is not unique because there are many distributions satisfying the requirement above: strictly speaking is a marginalization operator rather than restriction. For example, there are many distributions that marginalize to give and besides the product distribution .

The structure presheaf depends on the graph structure and alphabets; mechanisms play no role. We now construct a family of sections of using the mechanisms of ’s occasions. Specifically, given a subsystem , we show how to glue its occasions’ mechanisms together to form joint mechanism . The mechanism of the entire system is recovered as a special case.

In general, subsystem is not isolated: it receives inputs along edges contained in but not in . Inputs along these edges cannot be assigned a fixed value since in general there is no preferred element of . They also cannot be ignored since is defined as receiving inputs from all its sources. Nevertheless, the mechanism of should depend on alone. We therefore treat edges not in as sources of extrinsic noise by marginalizing with respect to the uniform distribution as in Corollary 3.

For each vertex let . We then have projection . Define

 mCl:=⎡⎣VSClπ♮l−→VSlml−→VAl⎤⎦. (4)

It follows immediately that is itself a distributed dynamical system defined by its graph, whose alphabets are inherited from and whose mechanisms are constructed by marginalizing.

Next, we tensor the mechanisms of individual occasions and glue them together using the diagonal map

. The diagonal map used here333which is surjective in the sense that all rows contain non-zero entries generalizes and removes redundancies in , which may, for example, include the same source alphabets many times in different factors.

Let mechanism be

 mC:=⎡⎣VSCιΔ−→⨂vl∈trg(C)VSCl⊗vl∈trg(C)mCl−−−−−−−−→VAC⎤⎦. (5)

The dual of is

 m♮C:=[VAC→VSC]. (6)

Finally, we find that we have constructed a family of sections of :

###### Definition 7.

The quale is the family of sections of constructed in Eqs. (4), (5) and (6)

 qD:={m♮C∈F(C)=Hom(VAC,VSC)∣∣C∈SysD}.

The construction used to glue together the mechanism of the entire system can also be used to construct the mechanism of any subsystem, which provides a window – the quale – into the compositional structure of distributed processes.

## 4 Measurement

This section adapts Definition 1 to distributed stochastic systems. The first step is to replace elements of state space with stochastic maps , or equivalently probability distributions on , which are the system’s inputs. Individual elements of correspond to Dirac distributions.

Second, replace function with mechanism . Since we are interested in the compositional structure of measurements we also consider submechanisms . However, comparing mechanisms requires that they have the same domain and range, so we extend to the entire system as follows

 (7)

We refer to the extension as by abuse of notation. We extend mechanisms implicitly whenever necessary without further comment. Extending mechanisms in this way maps the quale into a cloud of points in labeled by objects in .

In the special case of the initial object , define

###### Remark 3.

Subsystems differing by non-existent edges (Remark 2) are mapped to the same mechanism by this construction, thus making the fact that the edges do not exist explicit within the formalism.

Composing an input with a submechanism yields an output , which is a probability distribution on . We are now in a position to define

###### Definition 8.

A measuring device is the dual to the mechanism of a subsystem. An output is a stochastic map . A measurement is a composition .

Recall that stochastic maps of the form correspond to probability distributions on . Outputs as defined above are thus probability distributions on , the output alphabet of . Individual elements of are recovered as Dirac vectors: .

###### Definition 9.

The effective information generated by in the context of subsystem is

 ei(mC2→mC1,dout):=H[m♮C1∘dout∥∥m♮C2∘dout]. (8)

The null context, corresponding to the empty subsystem , is a special case where is replaced by the uniform distribution on . To simplify notation define

 ei(mC,dout):=ei(m⊥→mC,dout).

Here,

is the Kullback-Leibler divergence or relative entropy

. Eq. (8) expands as

 ei(mC2→mC1,dout)=∑s∈SD⟨m♮C1∘dout∣∣δs⟩⋅log2⟨m♮C1∘dout∣∣δs⟩⟨m♮C2∘dout∣∣δs⟩. (9)

When for some we have

 ei(mC2→mC1,δa)=∑s∈SDpmC1(s|a)⋅log2pmC1(s|a)pmC2(s|a). (10)

Definition 8 requires some unpacking. To relate it to the classical notion of measurement, Definition 1, we consider system where the alphabets of and are the sets and respectively, and the mechanism of is . In other words, system corresponds to a single deterministic function .

###### Proposition 5 (classical measurement).

The measurement performed when deterministic function outputs is equivalent to the preimage . Effective information is .

Proof: By Corollary 2 measurement is conditional distribution

 pVf(x|y)={1|f−1(y)|if f(x)=y0else.

which generalizes the preimage. Effective information follows immediately.

Effective information can be interpreted as quantifying a measurement’s precision. It is high if few inputs cause to output out of many – i.e. has few elements relative to – and conversely is low if many inputs cause to output – i.e. if the output is relatively insensitive to changes in the input. Precise measurements say a lot about what the input could have been and conversely for vague measurements with low .

The point of this paper is to develop techniques for studying measurements constructed out of two or more functions. We therefore present computations for the simplest case, distributed system , in considerable detail. Let be the graph

with obvious assignments of alphabets and the mechanism of as . To make the formulas more readable let , and . We then obtain lattice

 (11)

The remainder of this section and most of the next analyzes measurements in the lattice.

###### Proposition 6 (partial measurement).

The measurement performed on when outputs , treating as extrinsic noise, is conditional distribution

 p(x|z)=⎧⎪⎨⎪⎩|g−1x×Y(z)||g−1(z)|% if g(x,y)=z for some y∈Y0else, (12)

where . The effective information generated by the partial measurement is

 ei(m♮X∙,δz)=log2|X|+∑x∈Xp(x|z)⋅log2p(x|z). (13)

Proof: Treating as a source of extrinsic noise yields which takes . The dual is

 m♮X∙=πXY,X∘(Vg)♮:δz↦∑x∈X|g−1x×Y(z)||g−1(z)|⋅δx.

The computation of effective information follows immediately.

A partial measurement is precise if the preimage has small or empty intersection with for most , and large intersection for few .

Propositions 5 and 6 compute effective information of a measurement relative to the null context provided by complete ignorance (the uniform distribution). We can also compute the effective information generated by a measurement in the context of a submeasurement:

###### Proposition 7 (relative measurement).

The information generated by measurement in the context of the partial measurement where is unobserved noise, is

 ei(mX∙→mXY,δz)=∑x∈Xg−1x×Y(z)g−1(z)log2|Y|g−1x×Y(z). (14)

Proof: Applying Propositions 5 and 6 obtains

 ei(mX∙→mXY,δz)=∑(x,y)∈g−1(z)1|g−1(z)|log2[1|g−1(z)|⋅|g−1(z)|⋅|Y||g−1x×Y(z)|]

which simplifies to the desired expression.

To interpret the result decompose into a family of functions labeled by elements of , where . The precision of the measurement performed by is . It follows that the precision of the relative measurement, Eq. (14), is the expected precision of the measurements performed by family taken with respect to the probability distribution generated by the noisy measurement.

In the special case of relative precision is simply the difference of the precision of the larger and smaller subsystems:

###### Corollary 8 (comparing measurements).
 ei(mX∙→mXY,δz)=ei(mXY,δz)−ei(mX∙,δz)

Proof: Applying Propositions 5, 6, 7 and simplifying obtains

 ei(mXY,δz)−ei(mX∙,δz) =log2|X|⋅|Y||g−1(z)|−∑x|g−1x×Y(z)||g−1(z)|log2|X|⋅|g−1x×Y(z)||g−1(z)| =log2|Y||g−1(z)|+∑(x,y)∈g−1(z)1|g−1(z)|log2|g−1(z)||g−1x×Y(z)| =ei(mX∙→mXY,δz).■

## 5 Entanglement

The proof of Theorem 4 showed the structure presheaf has non-unique descent, reflecting the fact that measuring devices do not necessarily reduce to products of subdevices. Similarly, as we will see, measurements do not in general decompose into independent submeasurements. Entanglement, , quantifies how far a measurement diverges in bits from the product of its submeasurements. It turns out that is necessary for a system to generate more information than the sum of its components: non-unique descent thus provides “room at the top” to build systems that perform more precise measurements collectively than the sum of their components.

Entanglement has no direct relation to quantum entanglement. The name was chosen because of a formal resemblance between the two quantities, see Supplementary Information of .

###### Definition 10.

Entanglement over partition of is

 γ(mD,P,dout)=H[m♮D∘dout∥∥m⨂i=1πj∘m♮j∘dout]

where and .

Projecting via marginalizes onto the subspace . Entanglement thus compares the measurement performed by the entire system with submeasurements over the decomposition of the source occasions into partition .

###### Theorem 9 (effective information decomposes additively when entanglement is zero).
 γ(mD,P,dout)=0⟹ei(mD,dout)=m∑i=1ei(mj,dout).

Proof: Follows from the observations that (i) if and only if ; (ii) ; and (iii) the uniform distribution on is a tensor of uniform distributions on subsystems of .

The theorem shows the relationship between effective information and entanglement. If a system generates more information “than it should” (meaning, more than the sum of its subsystems), then the measurements it generates are entangled. Alternatively, only indecomposable measurements can be more precise than the sum of their submeasurements.

We conclude with some detailed computations for , Diagram (11). Let .

###### Theorem 10 (entanglement and effective information for g:X×Y→Z).
 γ(mXY,P,δz) =∑(x,y)∈g−1(z)1|g−1(z)|log2|g−1(z)||g−1x×Y(z)|⋅|g−1X×Y(z)| =ei(mXY,δz)−ei(mX∙,δz)−ei(m∙Y,δz).

Proof: The first equality follows from Propositions 5 and 6

 γ(mXY,P,δz)=∑(x,y)∈g−1(z)=∑(x,y)∈g−1(z)1|g−1(z)|log2[1|g−1(z)|⋅|g−1(z)||g−1x×Y(z)||g−1(z)||g−1X×Y(z)|].

From the same propositions it follows that equals

 log2|X|⋅|Y||g−1(x)|−∑x|g−1x×Y(z)||g−1(z)|log2|X|⋅|g−1x×Y(z)||g−1(z)|−∑y|g−1X×y(z)||g−1(z)|log2|Y|⋅|g−1X×y(z)||g−1(z)| =log21g−1(z)−∑(x,y)∈g−1(z)1|g−1(z)|⋅log2|g−1X×y(z)||g−1(z)|⋅|g−1x×Y(z)||g−1(z)|.

Entanglement quantifies how far the size of the pre-image of deviates from the sizes of its and slices as and are varied.

By Corollary 8 entanglement also equals