Unification of field theory and maximum entropy methods for learning probability densities

11/19/2014
by   Justin B. Kinney, et al.
Cold Spring Harbor Laboratory
0

The need to estimate smooth probability distributions (a.k.a. probability densities) from finite sampled data is ubiquitous in science. Many approaches to this problem have been described, but none is yet regarded as providing a definitive solution. Maximum entropy estimation and Bayesian field theory are two such approaches. Both have origins in statistical physics, but the relationship between them has remained unclear. Here I unify these two methods by showing that every maximum entropy density estimate can be recovered in the infinite smoothness limit of an appropriate Bayesian field theory. I also show that Bayesian field theory estimation can be performed without imposing any boundary conditions on candidate densities, and that the infinite smoothness limit of these theories recovers the most common types of maximum entropy estimates. Bayesian field theory is thus seen to provide a natural test of the validity of the maximum entropy null hypothesis. Bayesian field theory also returns a lower entropy density estimate when the maximum entropy hypothesis is falsified. The computations necessary for this approach can be performed rapidly for one-dimensional data, and software for doing this is provided. Based on these results, I argue that Bayesian field theory is poised to provide a definitive solution to the density estimation problem in one dimension.

READ FULL TEXT VIEW PDF

page 7

page 8

12/23/2013

Rapid and deterministic estimation of probability densities using scale-free field theories

The question of how best to estimate a continuous probability density fr...
07/03/2022

Low probability states, data statistics, and entropy estimation

A fundamental problem in analysis of complex systems is getting a reliab...
01/12/2017

Maximum Entropy Flow Networks

Maximum entropy modeling is a flexible and popular framework for formula...
02/07/2020

On the Estimation of Information Measures of Continuous Distributions

The estimation of information measures of continuous distributions based...
02/25/2021

On the consistency of the Kozachenko-Leonenko entropy estimate

We revisit the problem of the estimation of the differential entropy H(f...
09/21/2021

Minimax Rates for Conditional Density Estimation via Empirical Entropy

We consider the task of estimating a conditional density using i.i.d. sa...
04/05/2018

Density estimation on small datasets

How might a smooth probability distribution be estimated, with accuratel...

Code Repositories

14_maxent

Code for "Unification of Field Theory and Maximum Entropy Methods for Learning Probability Densities" (2014) by Justin B. Kinney, available at http://arxiv.org/abs/1411.5371


view repo

I Introduction

Research in nearly all fields of science routinely calls for the estimation of smooth probability densities from finite sampled data Silverman (1986); Scott (1992). Indeed, the presence of histograms in a large fraction of the scientific literature attests to this need. But the problem of how to go beyond a histogram and recover a smooth probability distribution has yet to find a definitive solution, even in one dimension.

The reader might find this state of affairs surprising. Many different methods for estimating smooth probability densities are well known and commonly used. One of the most popular methods is kernel density estimation

Silverman (1986); Scott (1992). Kernel density estimation is easy to carry out, but this approach has little theoretical justification and there is no consensus on certain basic aspects of its use, such as how to choose a kernel width or how to treat data points near a boundary Eggermont and LaRiccia (2001)

. Bayesian inference of a Gaussian mixture model is another common method. This approach, however, requires that one assume an explicit functional form of the density that one wishes to learn.

Concepts from statistical physics have given rise to two alternative approaches to the density estimation problem: maximum entropy (MaxEnt) Jaynes (1957); Mead and Papanicolaou (1984) and Bayesian field theory Bialek et al. (1996); Nemenman and Bialek (2002); Kinney (2014); Enßlin et al. (2009); Lemm (2003); Holy (1997); Periwal (1997); Aida (1999); Schmidt (2000)

. Each of these approaches has a firm but distinct theoretical basis. MaxEnt derives from the principle of maximum entropy as described by Jaynes in 1957

Jaynes (1957). Bayesian field theory, which is also referred to as “information field theory” in some of the literature Enßlin et al. (2009), instead uses the standard methods of Bayesian inference together with priors that weight possible densities according to an explicit measure of smoothness without requiring a particular functional form. Perhaps because the principles underlying these two methods are different, the relationship between these approaches has remained unclear.

MaxEnt density estimation is carried out as follows. One first uses sampled data to estimate values for a chosen set of moments, e.g., mean and variance. Typically, all moments up to some specified order are selected

Mead and Papanicolaou (1984); Ormoneit and White (1999). The probability density that matches these moments while having the maximum possible entropy is then adopted as one’s estimate. All other information in the data is discarded. One can therefore think of the MaxEnt estimate as a null hypothesis reflecting the assumption that there is no useful information in the data beyond the values of the specified moments Good (1963).

In the Bayesian field theory approach, one first defines a prior on the space of continuous probability densities. This prior is formulated using a scalar field theory that favors smooth probability densities over rugged ones. The data are then used to compute a Bayesian posterior, and from this one identifies the maximum a posteriori (MAP) density estimate. Simple field theory priors require that one assume an explicit smoothness length scale . However, an optimal value for can be learned from the data in a natural way if one instead adopts a prior formed from a scale-free mixture of these simple field theories Bialek et al. (1996); Nemenman and Bialek (2002); Kinney (2014). Scale-free Bayesian field theories thus provide a way to estimate probability densities without having to specify any tunable parameters.

One problem with the field theory priors that have been considered for this purpose thus far Bialek et al. (1996); Nemenman and Bialek (2002); Kinney (2014) is that they impose boundary conditions on candidate densities. This assumption of boundary conditions is standard practice in physics; it greatly aids analytic calculations and is often well-motivated by physical reasoning. In the density estimation context, however, such boundary conditions limit the types of data sets for which such field theory priors would be appropriate. MaxEnt, by contrast, does not impose any boundary conditions on the density estimates it provides.

Here I describe a class of Bayesian field theory priors that have no boundary conditions. These priors yield MAP density estimates that exactly match the first few moments of the data. In the limit, such MAP estimates become identical to MaxEnt estimates constrained by these same moments. More generally, I show that a MaxEnt density estimate matched to any moments of the data can be recovered from Bayesian field theory in the infinite smoothness limit; one need only choose a field theory prior that defines “smoothness” appropriately.

This unification of Bayesian field theory and MaxEnt density estimation further suggests a natural way to test the validity of the MaxEnt hypothesis against one’s data. If Bayesian field theory identifies as being optimal for one’s data set, the MaxEnt hypothesis is validated. If instead the optimal is finite, the MaxEnt hypothesis is rejected in favor of a nonparametric density estimate that matches the same moments of the data but has lower entropy.

This paper is structured as follows. Section II describes the derivation of an action,

, that governs the posterior probability of densities under a specific class of Bayesian field theories. Section III describes how the MAP density, which minimizes this action, can be uniquely derived without assuming any boundary conditions. A differential operator I call the “bilateral Laplacian” plays a central role in eliminating these boundary conditions.

Section IV shows that such MAP density estimates reduce to MaxEnt estimates in the limit. Section V derives an expression for a quantity, the “evidence ratio” , that allows one to select the optimal value for given the data. The large behavior of this evidence ratio is shown to be characterized by a “ coefficient,” the sign of which provides a novel analytic test of the MaxEnt assumption.

Section VI formalizes a discrete-space representation of this Bayesian field theory inference procedure. In addition to being essential for the computational implementation of this method, this discrete representation greatly clarifies why no boundary conditions are required to derive the MAP density when one makes use of the bilateral Laplacian. Section VII describes how to compute the MAP density (to a specified precision) at all length scales . Section VIII illustrates this density estimation approach on simulated data sets. A summary and discussion are provided in section IX.

Detailed derivations of various results from sections II through VI are provided in Appendices A-D. Appendix E presents details of a predictor-corrector homotopy algorithm that allows the density estimation computations described in this paper to be carried out. An open source software implementation of this algorithm for one-dimensional density estimation is provided 111Available at https://github.com/jbkinney/14_maxent. Finally, Appendix F gives an expanded discussion of how Bayesian field theory relates to earlier work in statistics on “maximum penalized likelihood” Silverman (1982); Eggermont and LaRiccia (2001); Gu (2013).

Ii Bayesian field theory

The main results of this paper are elaborated in the context of one-dimensional density estimation. Many of our results are readily extended to higher dimensions, however, at least in principle. This issue is discussed in more detail later on.

Suppose we are given data points sampled from a smooth probability density that is confined to an interval of length . Our goal is to estimate from these data. Following Kinney (2014), we first represent each candidate density in terms of a real field via

(1)

This parametrization ensures that is positive and normalized 222Integrals over are restricted to the interval of length .. Next we adopt a field theory prior on . Specifically we consider priors of the form

(2)

where

(3)

is the “action” corresponding to this prior and

(4)

is the associated partition function. The real parameter is a length scale below which fluctuations in are strongly damped. The parameter , on the other hand, reflects a fundamental choice in how we define “smoothness.” In this paper we consider arbitrary positive integer values of , for reasons that will become clear. Note, however, that previous work has explored the consequences of using non-integer values of Nemenman and Bialek (2002).

As shown in Appendix A, this choice of prior allows us to compute an exact posterior probability over candidate densities. We find that

(5)

where

(6)

is a nonlinear action,

(7)

is the corresponding partition function, and

(8)

is the raw data density.

The derivation of Eq. (6) is somewhat subtle. In particular, the action gives a posterior probability that is not related to via Bayes’s rule. However, upon marginalizing over the constant component of one finds that is indeed related to via Bayes’s rule. This latter fact is sufficient to justify the use of Eq. (6) in what follows. See Appendix A for details.

Iii Eliminating boundary conditions

The MAP field is defined as the field that minimizes the action . To obtain a differential equation for , previous work Bialek et al. (1996); Nemenman and Bialek (2002); Kinney (2014) imposed periodic boundary conditions on and used integration by parts to derive

(9)

With the periodic boundary conditions in place, this differential equation has a unique solution. However, imposing these boundary conditions amounts to assuming that is the same at both ends of the -interval. It is not hard to imagine data sets for which this assumption would be problematic.

It is true, of course, that Eq. (9) requires boundary conditions in order to have a unique solution. The reason boundary conditions are needed is the appearance of the of the standard -order Laplacian operator, . However, we assumed boundary conditions on in order to derive Eq. (9) in the first place. It therefore has not been established that boundary conditions are required for to have a unique minimum.

In fact, has a unique minimum without the imposition of any boundary conditions on . The boundary conditions on assumed in previous work Bialek et al. (1996); Nemenman and Bialek (2002); Kinney (2014) are therefore unnecessary. Indeed, from Eq. (6) alone we can derive a differential equation that uniquely specifies the MAP field .

We start by rewriting the action as

(10)

where the differential operator is defined by the requirement that

(11)

for any two fields and . In what follows we refer to as the “bilateral Laplacian of order .” Note that is a positive semi-definite operator, since

(12)

for every real field .

We now prove that is unique by showing that is a strictly convex function of when . Consider the change in upon the perturbation , where and are two real fields and is an infinitesimal number and the field is normalized so that . The action will change by an amount

Because is positive semi-definite, the term will be bounded from below by and must therefore be positive. The Hessian of is therefore positive definite at every , establishing the strict convexity of and thus the uniqueness of .

The requirement that gives the following differential equation for :

(14)

From the argument above we see that this differential equation, unlike Eq. (9), has a unique solution without the imposition of any boundary conditions on .

This lack of a need for boundary conditions in Eq. (14), despite the need for boundary conditions in Eq. (9), is due to a fundamental difference between the standard Laplacian and the bilateral Laplacian. This difference occurs only at the boundaries of the -interval. Roughly speaking, is well-defined at both and , whereas is not. This point will be clarified in Section VI, when we formulate our Bayesian field theory approach on a finite set of grid points.

In the interior of the -interval, however, the bilateral Laplacian is identical to the standard Laplacian. To see this, we integrate Eq. (11) and use integration by parts to derive

The second term on the right hand side vanishes if the test function is chosen so that at and for . The value of such test functions within the interior of the interval are unconstrained, and so

(16)

Iv Connection to maximum entropy

From its definition in Eq. (11

), we see that the bilateral Laplacian is symmetric and real. This operator is therefore Hermitian and possesses a complete set of orthonormal eigenvectors with corresponding real eigenvalues. See Appendix B for a discussion of the spectrum of the bilateral Laplacian.

The kernel of is particularly relevant to the density estimation problem. A field is in the kernel of if and only if

(17)

From this we see that the kernel of is equal to the space of polynomials of order .

In particular, is in the kernel of for all positive integers . As a result, multiplying Eq. (14) on the left by unity and integrating gives . The MAP density , which is defined in terms of by Eq. (1), is thereby seen to have the simplified form,

(18)

If we multiply Eq. (14) on the left by other polynomials of order and integrate, we further find that

(19)

Therefore, at every length scale , the first moments of the MAP density exactly match those of the data.

At , the MAP field is restricted to the kernel of the bilateral Laplacian. The corresponding density thus has the form

(20)

where the values of the coefficients are determined by Eqs. 18 and 19. is therefore identical to the MaxEnt density that matches the first moments of the data Mead and Papanicolaou (1984).

At , the kinetic term in Eq. (14) vanishes. As a result, setting gives

(21)

We therefore see that the MAP density is simply the “histogram” of the data, i.e. the normalized sum of delta functions centered at each data point. When we formulate our inference procedure on a grid in section VI, we will see that indeed becomes a bona fide histogram with bins defined by our choice of grid.

The set of MAP densities thus forms a one-parameter “MAP curve” in the space of probability densities extending from the data histogram at to the MaxEnt density at . Every density along this MAP curve exactly matches the first moments of the data.

More generally, the MaxEnt density estimate constrained to match any set of moments can be recovered in the infinite smoothness limit of an appropriate Bayesian field theory. To see this, consider a MaxEnt estimate chosen to satisfy the generalized moment-matching criteria

(22)

for some set of user-specified functions , , , . A Bayesian field theory that recovers this MaxEnt estimate in the infinite smoothness limit can be readily constructed by using a prior defined by the action

(23)

where is the (positive) smoothness parameter and is a positive semidefinite operator whose kernel is spanned by the specified functions , , , together with the constant function . The posterior probability on will then be governed by the action

(24)

Following the same line of reasoning as above, we find that the MAP density , corresponding to the field that minimizes , will satisfy

(25)

regardless of the value of . In the infinite smoothness limit (), the MaxEnt density will be recovered, i.e.

(26)

where the coefficients are determined by the constraints in Eq. (25).

V Choosing the length scale

To determine the optimal value for , we compute . This quantity, commonly called the “evidence,” forms the basis for Bayesian model selection Bialek et al. (1996); Nemenman and Bialek (2002); MacKay (2003); Balasubramanian (1997).

For the problem in hand, the evidence vanishes when regardless of the data. The reason for this is that is an improper prior; see Appendix C. However, the evidence ratio is finite for all . Using a Laplace approximation, which is valid for large , we find that

(27)

where . Here the subscripts “row” and “ker” indicate restriction to the row space and kernel of , respectively; the terms inside the determinants stand for matrices that have the values (for all s) arrayed along the main diagonal and zeros everywhere else. See Appendix C for the derivation of Eq. (27).

By construction, the evidence ratio approaches unity in the large limit. Whether this limiting value is approached from above or below is relevant to the question of whether is optimal, and thus whether the MaxEnt hypothesis is consistent with the data. Using perturbation theory about (), we find that

(28)

where the coefficient is 333See SM for a derivation of Eq. (29).

(29)

Here, and (

) denote the eigenvalues and eigenfunctions of

and are indexed so that for . The eigenfunctions are normalized so that , and in the degenerate subspace () they are oriented so that for some positive real numbers . The other indexed quantities are , , and .

Eq. (29) provides a plug-in formula that can be used to assess the validity of the MaxEnt hypothesis. If , there is guaranteed to be a finite value of that has a larger evidence ratio than . In this case the MaxEnt estimate is guaranteed to be non-optimal. On the other hand, if , then is a local optimum that may or may not be a global optimum as well. Numerical computation of over all values of is thus needed to resolve whether the MaxEnt hypothesis provides the best explanation of the data in hand.

Figure 1: (Color) Illustration of the predictor-corrector homotopy algorithm. (a) The MAP curve (brown) is approximated using finite set of densities . First the MAP density at an intermediate length scale is computed. A predictor-corrector algorithm is then used to extend the set of MAP densities outward to larger and to smaller values of . These values are chosen so that neighboring MAP densities lie within a geodesic distance of of each other. (b) Each step has two parts. First, a predictor step (magenta) computes a new length scale and an approximation of . A series of corrector steps (orange) then converges to .
Figure 2: (Color) Density estimation without boundary conditions using the field theory prior. (a) N = 100 data points were drawn from a simulated density (black) and binned at grid points. The resulting histogram (gray) is shown along with the field theory density estimate (orange) and the corresponding MaxEnt estimate (blue). (b) The heat map shows the densities interpolating between the MaxEnt density at and the data histogram at . (c) The log evidence ratio is shown as a function of . (d) The differential entropy Cover and Thomas (2006) is shown as a function of ; the entropy at is indicated by the dashed line. Dotted lines in (b-d) mark the optimal length scale .
Figure 3: (Color) The optimal estimated density for any particular data set might or might not have maximum entropy. Panels (a-c) show three different choices for (black), along with a histogram (gray) of data points binned at grid points. In each panel,

was chosen to be the sum of two equally weighted normal distributions separated by a distance of (a) 0, (b) 2.5, or (c) 5. Panels (d-f) show the evidence ratio curves computed for 100 data sets respectively drawn from the

densities in (a-c). Blue curves indicate ; orange curves indicate finite . Titles in (d-f) give the percentage of data sets for which was found. The heat maps shown in panels (g-i) report the number of simulated data sets for which the coefficient was positive or negative and for which the MaxEnt density was or was not recovered.

Vi Discrete space representation

In this section we retrace the entire analysis above in the discrete representation, i.e., where the continuous -interval is replaced by an evenly spaced set of grid points. This discrete representation is necessary for the computational implementation of our field theoretic density estimation method. Happily, our main findings above are seen to hold exactly upon discretization. This discrete representation also sheds light on how the bilateral Laplacian differs from the standard Laplacian and why this difference eliminates the need for boundary conditions.

We consider grid points evenly spaced throughout the interval . Specifically, we place grid points at

(30)

where is the grid spacing. In moving to this discrete representation, functions of become

-dimensional vectors with elements denoted by the subscript

. For instance, the field becomes a vector with elements . Integrals become sums, i.e.,

(31)

and path integrals over become -dimensional integrals over the elements , i.e.,

(32)

We denote differential operators in this discrete representation with a subscript . The derivative operator, , becomes a by matrix having elements . For instance, setting gives the by matrix,

(33)

Similarly, the standard -order Laplacian becomes a by matrix, given by . For example, choosing and yields the a by Laplacian matrix

(34)

Because elements are eliminated from the vector upon application of the standard Laplacian, the discrete version of Eq. (9) provides only equations for the unknown values of . additional constraints, typically provided in the form of boundary conditions, are thus needed to obtain a unique solution.

By contrast, the -order bilateral Laplacian is represented by the by matrix , where . Indeed, again choosing and we recover an by bilateral Laplacian matrix

(35)

The middle two rows of match those of , reflecting the equivalence of bilateral Laplacians and standard Laplacians in the interior of the -interval. However, contains six additional rows, three at either end. These are sufficient to specify a unique solution for the 8 elements of the vector. More generally, the discrete version of Eq. (14) provides equations for the unknown elements of and is therefore able to specify a unique solution without the imposition of any boundary conditions.

Using the bilateral Laplacian, we readily define a discretized version of the prior by adopting

(36)

This leads to the posterior action

(38)

where is value of the data histogram at grid point , i.e., the fraction of data points discretized to grid point , divided by bin width .

The corresponding equation of motion is

(39)

The kernel of is spanned by vectors having the polynomial form . The analogous moment-matching behavior therefore holds exactly in the discrete representation, i.e.,

(40)

where is related to via Eq. (18). In the limit, the MAP density again has the analogous form

(41)

where the coefficients are chosen to satisfy Eq. (40). Thus, the connection to the MaxEnt density estimate remains intact upon discretization.

The derivation of the evidence ratio in Eq. (27) follows through without modification. The only change is that the functional determinants now become determinants of finite-dimensional matrices. The derivation of the coefficient in Eq. (29) also follows in a similar manner; the only change to Eq. (29) is that the the index now ranges from 1 to , not 1 to .

Vii Computing density estimates

To compute density estimates using this field theory approach, we work in the discrete representation described in the previous section. First the user specifies the number of grid points as well as a bounding box for the data. MAP densities are then computed at a finite set of length scales , as illustrated in Fig. 1a. This “string of beads” approximation to the MAP curve allows us to evaluate the evidence ratio over all length scales and, to finite precision, identify the length scale that maximizes .

This approximation of the MAP curve is computed using a predictor-corrector homotopy algorithm Allgower and Georg (1990). An overview of this algorithm is now given. Please refer to Appendix E for algorithm details. I note that this algorithm provides more transparent precision bounds on the computed densities than does the previously reported algorithm of Kinney (2014).

First, an intermediate length scale is chosen and the corresponding MAP density is computed. This density, , serves as the starting point from which to trace MAP curve towards both larger and smaller length scales (Fig. 1a). The algorithm then proceeds in both directions, stepping from length scale to length scale and updating the MAP density at each step.

During each step, the subsequent length scale is chosen so that the corresponding MAP density is sufficiently similar to the MAP density at the preceding length scale. Specifically, in stepping from to , the algorithm chooses so that the geodesic distance (see Skilling (2007); Kinney (2014)) between and matches a user-specified tolerance , i.e.,

(42)

The value was used for the computations described below and shown in Figs. 2 and 3. Stepping in the decreasing direction is terminated at a length scale such that . Similarly, stepping in the increasing direction is terminated at a length scale such that ; the MaxEnt density is computed at the start of the algorithm essentially as described by Ormoneit and White Ormoneit and White (1999).

Each step along the MAP curve is accomplished in two parts (Fig. 1b). Given the MAP density at length scale , a “predictor step” is used to compute both the next length scale as well as an approximation to the corresponding MAP density . The repeated application of a “corrector step” is then used to compute a series of densities that converges to .

If the numerics are properly implemented, this predictor-corrector algorithm is guaranteed to identify the correct MAP density at each of the chosen length scales . This is because the action is strictly convex in and therefore has a unique minimum (as was shown in section III). The distance criteria in Eq. (42) further ensures that the stepping procedure does not drastically overstep . It is also worth noting that, because is sparse, these predictor and corrector steps can be sped up by using numerical sparse matrix methods.

Viii Example analyses

Fig. 2 provides an illustrated example of this density estimation procedure in action. Starting from a set of sampled data (Fig. 2a, gray), the homotopy algorithm computes the MAP density at a set of length scales spanning the interval (Fig. 2b). The evidence ratio is then computed at each of these chosen length scales using Eq. (27), and the length scale that maximizes is identified (Fig. 2c). is then reported as the best estimate of the underlying density (Fig. 2a, orange). If one further wishes to report “error bars” on this estimate, other plausible densities can be drawn from the posterior as described in Kinney (2014).

The optimal length scale may or may not be infinite. If , then is the MaxEnt estimate that matches the first moments of the data. On the other hand, if is finite as in Fig. 2, then will have lower entropy than the MaxEnt estimate (Fig. 2d) while still exactly matching the first moments of the data. This reduced entropy reflects the use of addition information in the data beyond the first moments. It should be noted that is never zero due to a vanishing Occam factor in this limit.

The density estimation procedure proposed in this paper thus provides an automatic test of the MaxEnt hypothesis. It can therefore be used to test whether has a hypothesized functional form. For example, using

we can test whether our data was drawn from a Gaussian distribution. This is demonstrated in Fig. 3. In these tests, when data was indeed drawn from a Gaussian density,

was obtained about of the time (Fig. 3a and 3d). By contrast, when data was drawn from a mixture of two Gaussians, the fraction of data sets yielding decreased sharply as the separation between the two Gaussians was increased (Figs. 3b, 3c, 3e, and 3f). In a similar manner, this density estimation approach can be used to test other functional forms for , either by using the bilateral Laplacian of different order , or by replacing the bilateral Laplacian with a differential operator having a kernel spanned by other functions whose expectation values one wishes to match to the data.

The method used to select both here and in previous work Bialek et al. (1996); Nemenman and Bialek (2002) is sometimes referred to as “empirical Bayes”: for we choose the value of that maximizes . By contrast, Kinney (2014) used a fully Bayesian approach by positing a Jeffreys prior then choosing the length scale that maximizes

. It can be reasonably argued that the empirical Bayes method adopted here is less sensible than the fully Bayesian approach. However, in the fully Bayesian approach the assumed prior

obscures the large behavior of the evidence ratio . This large behavior is nontrivial and potentially useful.

As shown in section V, the behavior of in the large limit is governed by the coefficient defined in Eq. (29). The sign of the coefficient can therefore be used to assess the MaxEnt hypothesis without having to compute at every length scale. This suggestion is supported by the simulations shown in Fig. 3. Here, the sign of (positive or negative) performed well as a proxy for whether the MaxEnt estimate was recovered (no or yes, respectively) from a full computation of the MAP curve; see Figs. 3g 444The one trial reported in Fig. 3g for which even though is due to difficulties with the numerics in the limit., 3h, and 3i. These results suggest that the coefficient, for which Eq. (29) provides an analytic expression, might allow an expedient test of the MaxEnt hypothesis when computation of the entire MAP curve is less feasible, e.g., in higher dimensions.

Ix Summary and discussion

Bialek et al. Bialek et al. (1996) showed that probability density estimation can be formulated as a Bayesian inference problem using field theory priors. Among other results, Bialek et al. (1996) derived the action in Eq. (6) and showed how to use a Laplace approximation of the evidence to select the optimal smoothness length scale 555See Appendix F for a discussion of earlier related work on the “maximum penalized likelihood” formulation of the density estimation problem and how it relates to the Bayesian field theory approach.. However, periodic boundary conditions were imposed on candidate densities in order to transform the standard Laplacian into a Hermitian operator. The MaxEnt density estimate typically violates these boundary conditions, and as a result the ability of Bayesian field theory to subsume MaxEnt density estimation went unrecognized in Bialek et al. (1996) and in follow-up studies Nemenman and Bialek (2002); Kinney (2014).

Here we have seen that boundary conditions on candidate densities are unnecessary. The bilateral Laplacian, defined in Eq. (11), is a Hermitian operator that imposes no boundary conditions on functions in its domain, yet is equivalent to the standard Laplacian in the interior of the interval of interest. Using the bilateral Laplacian of various orders to define field theory priors, we recovered standard MaxEnt density estimates in cases where the smoothness length scale was infinite. We also obtained a novel criterion for judging the appropriateness of the MaxEnt hypothesis on individual data sets.

More generally, Bayesian field theories can be constructed for any set of moment-matching constraints. One need only replace the bilateral Laplacian in the above equations with a differential operator that has a kernel spanned by the functions whose mean values one wishes to match to the data. The resulting field theory will subsume the corresponding MaxEnt hypothesis, and thereby allow one to judge the validity of that hypothesis. Analogous approaches for estimating discrete probability distributions can be formulated by replacing integrals over with sums over states.

The elimination of boundary conditions removes a considerable point of concern with using Bayesian field theory for estimating probability densities. As demonstrated here and in Kinney (2014), the necessary computations are readily carried out in one dimension. One issue not investigated here – the large assumption used to compute the evidence ratio – can likely be addressed by using Feynman diagrams in the manner of Enßlin et al. (2009).

In the author’s opinion, the problem of how to choose an appropriate prior appears to be the primary issue standing in the way of a definitive practical solution to the density estimation problem in 1D. It is not hard to imagine different situations that would call for qualitatively different priors, but understanding which situations call for which priors will require further study. The author is optimistic, however, that the variety of priors needed to address most of the 1D density estimation problems typically encountered might not be that large.

This field theory approach to density estimation readily generalizes to higher dimensions – at least in principle. Additional care is required in order to construct field theories that do not produce ultraviolet divergences Bialek et al. (1996), and the best way to do this remains unclear. The need for a very large number of grid points also presents a substantial practical challenge. Grid-free methods, such as those used by Good and Gaskins (1971); Holy (1997), may provide a way forward.

Acknowledgements.
I thank Gurinder Atwal, Curtis Callan, William Bialek, and Vijay Kumar for helpful discussions. Support for this work was provided by the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory.

Appendix A Derivation of the action

Our derivation of the action in Eq. (6) of the main text is essentially that used in Kinney (2014). This derivation is not entirely straight-forward, however, and the details of it have yet to be reported.

The prior , which is defined by the action in Eq. (3), is improper due to the differential operator having an -dimensional kernel; see section III. To avoid unnecessary mathematical difficulties, we can render proper by considering a regularized form of the action

(43)

where is an infinitesimal positive number. All quantities of interest, of course, should be evaluated in the limit.

The corresponding prior over is

(44)

Here we have decomposed the field using

(45)

where is the constant Fourier component of and is the non-constant component of . The likelihood of given the data is given by

(46)

Using the identity

(47)

which holds for any positive numbers and , we find that the likelihood of can be expressed as

(48)

The prior probability of

and the data together is therefore given by

(49)

where is the action from Eq. (6).

The “evidence” for – i.e., the probability of the data given – is therefore given by,

(50)

where is the partition function from Eq. (7). The posterior distribution over is then given by Bayes’s rule:

(51)
(52)

This motivates us to define the posterior distribution over as

(53)

This definition of is consistent with the formula for obtained in Eq. (52), in that

(54)

However, Eq. (53) violates Bayes’s rule, since

(55)

This is not a problem, however, since itself is not directly observable; only is observable. Put another way, Eq. (53) violates Bayes’s rule only in the way that it specifies the behavior of . This constant component of , however, has no effect on .

Note that all of the above calculations have been exact; no large approximation was used. This contrasts with prior work Bialek et al. (1996); Nemenman and Bialek (2002), which used a large Laplace approximation to derive the formula for . Also note that the regularization parameter has vanished in the formulas for the posterior distributions and . However, this parameter still appears in the formula for the evidence , both explicitly and implicitly through the value of .

Appendix B Spectrum of the bilateral Laplacian

Figure 4: Spectrum of the bilateral Laplacian of order . (a) The first three Legendre polynomials provide an orthonormal basis for the kernel of . These choices for , , and are plotted with decreasing opacity. (b) All other eigenfunctions are nontrivial linear combinations of factors of the form . This behavior is illustrated by the basis functions , , , and , which are shown in decreasing opacity. (c) The rank-ordered eigenvalues of lie at or below those of the standard Laplacian with any choice of boundary conditions. Shown are the eigenvalues of (black squares), together with the eigenvalues of with periodic, Dirichlet, or Neumann boundary conditions (dark, medium, and light gray circles respectively).

In the continuum limit, remains manifestly Hermitian and therefore possesses a complete orthonormal basis of eigenfunctions. We now consider the spectrum of this operator. In what follows we will make use the ket notion of quantum mechanics, primarily as a notational convenience. For any two functions and and any Hermitian operator , we define

(56)

Note the convention of dividing by in the inner product integral. This allows us to take inner products without altering units.

From

(57)

we see that is positive semi-definite. Equality in Eq. (57) obtains if and only if is a polynomial of order ; such polynomials therefore comprise the null space of .

More generally, any solution to the eigenvalue equation implies that for any test function . Integrating this by parts gives

(59)

Considering test functions who’s first derivatives all vanish at the boundary, we see that in the bulk of the interval, ,

(60)

Any function of must therefore be an eigenfunction of the standard -order Laplacian, , as well. Moreover, all boundary terms in Eq. (59) must vanish because the values of and its derivatives at the boundary are independent of its integral with any function in the interior. The eigenfunction must therefore have the boundary behavior

(61)

Note in particular that this behavior is exhibited by polynomials of order , which comprise the kernel of . On the other hand, if , the corresponding eigenfunction will consist of terms of the form , where for . The coefficients of these terms will be such that the boundary behavior in Eq. (61) is satisfied. Typically such eigenfunctions will not be purely sinusoidal or purely exponential, but rather will exhibit a nontrivial combination of sinusoidal and exponential behavior (see Fig. 4b).

It should be emphasized that the boundary behavior of the eigenfunctions (Eq. (61)) is not a boundary condition that all functions in the domain of must satisfy. Specifically, although any well-behaved function can be expressed in the eigenbasis via

(62)

for some set of coefficients , one will typically find that

(63)

because the sum on the right hand side of Eq. (63) will not be well-defined. The reason for this is that the convergence criterion for Eq. (62) is weaker than that of Eq. (63), due to the fact that whereas . Therefore, even though the right hand side of Eq. (62) will converge for any particular , the right and side of Eq. (63) typically will not.

The ordered eigenvalues of the bilateral Laplacian provide a lower bound for the eigenvalues of the standard Laplacian with any set of boundary conditions. To see this, note that we can define a positive semi-definite Hermitian operator whose kernel is spanned by all functions satisfying a set of specified boundary conditions. Let us denote the standard Laplacian with these boundary conditions as . Then we can express in terms of the bilateral Laplacian as

(64)

where

(65)

In the limit, a prior defined using in place of will restrict candidate fields to those that satisfy the boundary condition specified by .

From first-order perturbation theory, we know that the ’th eigenvalue of is related to that of by

(66)

Therefore, the