    # Wave function representation of probability distributions

Orthogonal decomposition of the square root of a probability density function in the Hermite basis is a useful low-dimensional parameterization of continuous probability distributions over the reals. This representation is formally similar to the representation of quantum mechanical states as wave functions, whose squared modulus is a probability density.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Motivation

Empirical Data Platform (EDP) is a cloud platform for data analysis. For modeling regression residuals, it requires a simple, natural, low-dimensional family of smooth probability distributions over the reals. This family must be able to represent a wide variety of real-world data without tuning or customization.

With one free parameter, the natural distribution is a Dirac delta. With two free parameters, the natural distribution is a Gaussian. With three or more free parameters, I know of no consensus. The Pearson distribution[12, §20, p. 381] is one general family, but it does not have a clean parameterization. Johnson’s distribution and the generalized lambda distribution

are both four-parameter families, which have more flexibility than normal distributions do but don’t support multimodal distributions, which are necessary for modeling residuals in EDP. Fleishman

 uses a four-term polynomial sum of Gaussians, which doesn’t allow for asymmetric distributions. Gentle[3, p. 193–196] has further discussion.

So, we still require a class of distributions that can handle skewed data with several modes, preferably without the complexity and computational cost that come with nonparametric methods like kernel density estimation.

## 2 Details

Suppose we have a random variable

with a density function

. Assume it has been transformed so that its mean is approximately zero and its variance is approximately one-half. Figure 1: The first several basis functions defined by equation 1. Only h0 is always positive, so while these can be used as a basis for the square root of a density, they can’t be used as the basis for an untransformed density.

Let’s find a function such that , where is a sum of orthogonal components. In a quantum mechanics context, would be an amplitude. Here, we use the Hermite basis, since it makes Gaussians come out nicely. Let be the physicists’ Hermite polynomial of degree .[1, table 22.12, p. 801] Define as:

 hn(x)=Hn(x)e−x2/2(√π2nn!)1/2 (1)

The first several of these are shown in figure 1. This normalization of the Hermite polynomials is orthonormal:

 ∫∞−∞hn(x)hm(x)dx=δnm (2)

Pick a maximum degree . (EDP defaults to

.) Define a vector

by:

 wk =∫∞−∞hk(x)√f(x)dx (3) where k ∈{0,…,K} (4)

Observe that the square root of the density for is:

 √f(x) = ⎷1√2π(1/2)e−x2(2)(1/2) (5) =1√√πe−x2/2 (6) =h0(x) (7)

So, is represented exactly by .

Next, define a degree- approximation to :

 ^f(x)=(K∑k=0wkhk(x))2 (8)

is a density, so it must be in . Therefore, . Since the Hermite basis is complete for , converges in to as . If is symmetric, then

is zero for odd

. The sum converges to one as ; the partial sum can be used as a check of how well matches .

Now consider the quantum harmonic oscillator with potential . The coefficients are the amplitudes in the Hamiltonian basis for the state whose initial position is .[5, p. 56]

## 3 Estimating the coefficients

The formula for , equation 3, contains an integral. If we knew , it would be easy to compute this numerically using adaptive Simpson’s method[8, ch. 6] or Gauss–Hermite quadrature.[1, eqn. 25.4.46].

However, we would like to model regression residuals, so we need to be able to fit from an i.i.d. sample. We can do this by computing the MLE of on a transformed space. Consider the unit -sphere in (noting the zero-based indices):

 w20+⋯+w2K=1 (9)

This is the set of admissible . Outside this sphere, the probability density would integrate to more than one. For any on the surface of the sphere, consider the line through and . Define to be the mapping from to the point on that line where the line intersects the plane whose first coordinate is zero. This is a stereographic projection of .[14, ch. 2] By dropping the coordinate with index 0 (which is always zero), we have transformed , which is constrained to be on the surface of a sphere, to , which has one fewer dimension but is unconstrained. Every value on this plane except for the origin maps to a unique on the unit sphere. The pre-image of the origin could be considered to be either or . I choose for continuity, though they both lead to the same probability density.

This is the algebraic form of the transform:

 [P(w)]k =wk1−w0 (10) k ∈{1,…,K} (11)

This is its inverse:

 [P−1(γ)]k =⎧⎪⎨⎪⎩S2−1S2+1if k=02γkS2+1if 1≤k≤K (12) where S2 =K∑k=1γ2k (13) and k ∈{0,…,K} (14)

With these definitions, the log likelihood at is:

 ℓ(γ)=n∑i=1log(K∑k=0[P−1(γ)]k⋅hk(xi))2 (15)

I have found L-BFGS, as implemented by libLBFGS, to work well for finding the MLE of , even for sample sizes as small as two. I expect that Gibbs sampling or Hamiltonian Monte Carlo would work well for drawing from a Bayesian posterior.

## 4 Moments, entropy, and sampling

Because the density is a polynomial times , Gauss–Hermite quadrature[1, eqn. 25.4.46]

can be used to exactly (up to floating point roundoff) compute any moment in

primitive floating point operations (no expensive transcendental functions). Non-polynomial integrands, such as the entropy integral , are not exact when computed this way, but this method is still more accurate than Monte Carlo estimation for a comparable amount of computation time.

EDP is currently using a univariate slice sampler to sample from . Since we normalize to have variance one-half before fitting , EDP can use a fixed initial slice width, 4.0. Because the derivative of the density estimate (equation 8) is simple, I plan to switch to Adaptive Rejection Metropolis Sampling (ARMS) if EDP ever needs a faster sampler.

## 5 Examples

Figures 2 through 5 are examples of distributions I used for testing the match between the true density and the wave function representation. In each figure, the left plot shows the amplitude, and the right plot shows its square, the density. The dashed lines are the true amplitude and density , and the solid lines are the amplitude estimate and the density estimate obtained by computing the MLE of . Figure 2: This Student twith four degrees of freedom does not have moments greater than 3, but there is no issue fitting the tenth-degree Hermite expansion. Observe that since the MLE optimizes fit in squared amplitudes, most discrepancy between √f(x) and the amplitude estimate in the left plot is in places where both are close to zero. Figure 3: The uniform distribution is difficult to fit because its density is discontinuous. Notice that even with a degree-40 representation and a sample size of 2500, the fit is not good. We observe the Gibbs phenomenon, just as we would in the more familiar case of a Fourier expansion of a square wave. The fit does improve with increasing degree (in the sense of total variation distance), but the normalization constant in hk (equation 1) is O(√2kk!), which causes floating point errors for K larger than around 20. Figure 4: This plot of a bimodal normal mixture shows the limitations of low-degree polynomials in representing separated modes. Raising the degree from 10 to 12 or 14 substantially improves the fit. Larger sample sizes fail to make a large difference, except to pin down the exact relative heights of the modes. Figure 5: This Beta(3,5) distribution shown here demonstrates that the wave function method can fit skewed distributions.

## 6 Discussion

The wave function representation of continuous probability densities is a practical solution to the need for a general class of well-behaved probability densities. It can represent any smooth density, yet is resistant to over-fitting. Unlike, for example, kernel density estimation, it involves no tuning parameters. Also, unlike kernel density estimation, it has nicely shaped tails. Coefficients can be fit quickly with off-the-shelf methods. As a result, the wave function representation has been effective in a production data analysis system for modeling a wide variety of user-uploaded data.

This paper discusses only unconditional densities with support on the real line. The extension to densities on different spaces, using different bases, is straightforward. For example, the Legendre polynomials[1, ch. 8, p. 333] could be used in place of the Hermite polynomials for modeling distributions on finite closed intervals of the real line. Relatedly, I wonder whether the solutions to Schrödinger’s equation for simple potentials other than could yield classes of probability distributions useful outside quantum physics. Finally, I hope to explore the possibility of fitting conditional models, where the coefficients are functions of a predictor, for heteroskedastic regressions.

R code for fitting wave functions to distributions is available in the “wavefunction” package on CRAN.

## References

•  Milton Abramowitz and Irene Stegun, editors. Handbook of Mathematical Functions, tenth printing. National Bureau of Standards, 1972.
•  Allen Fleishman. A method for simulating non-normal distributions. Psychometrika, 43(4):521–532, 1978.
•  James Gentle. Random Number Generation and Monte Carlo Methods, 2nd ed. Springer, 2003.
•  W. R. Gilks, N. G. Best, and K. K. C. Tan. Adaptive rejection Metropolis sampling within Gibbs sampling. Journal of the Royal Statistical Society, Series C, 44(4):455–472, 1995.
•  David Griffiths. Introduction to Quantum Mechanics, 2nd ed. Pearson Prentice Hall, 2004.
•  N. L. Johnson. Systems of frequency curves generated by methods of translation. Biometrika, 36(1/2):149–176, 1949.
•  Dong Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45:503–528, 1989.
•  Webb Miller. The Engineering of Numerical Software. Prentice–Hall, 1984.
•  Radford Neal. Slice sampling. Annals of Statistics, 31(3):705–767, 2003.
•  Radford Neal. MCMC using Hamiltonian dynamics. In Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng, editors, Handbook of Markov Chain Monte Carlo. Chapman & Hall / CRC, 2011. URL: https://arxiv.org/abs/1206.1901.
•  Naoaki Okazaki. libLBFGS: a library of limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS). Accessed Dec. 11, 2017.
•  Karl Pearson. X. Contributions to the mathematical theory of evolution.—II. Skew variation in homogeneous material. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 186:343–414, 1895.
•  John Ramberg and Bruce Schmeiser. An approximate method for generating asymmetric random variables. Communications of the ACM, 17(2):78–92, 1974.
•  Michael Spivak. A Comprehensive Introduction to Differential Geometry, Vol. 1, 3rd ed. Publish or Perish, 1999.
•  Madeleine B. Thompson. wavefunction: Wave Function Representation of Real Distributions, 2018. R package version 1.0.0.
•  Wikipedia. Stereographic projection: Other formulations and generalizations—Wikipedia, the free encyclopedia. Accessed Dec. 11, 2017.