An important problem in statistics and applied mathematics is the solution to a so-called Fredholm equation of the first kind
, i.e., given a probability density functionon and a non-negative kernel which is a probability density function in for each , find the probability density function such that
Such equations have a wide range of applications across a variety of fields, including signal processing, physics, and statistics. See Ramm (2005) for a general introduction, and Vangel (1992) for statistical applications. We will also consider problems where the density and non-negativity constraints are relaxed.
Given the importance of this problem, it should be no surprise that there is a substantial literature on the theoretical and computational aspects of solving (1); see, for example, Corduneanu (1994); Groetsch (2007); Hansen (1999); Morozov (1984); Wing (1990). Existence of a unique solution to (1) is a relevant question, though we will not speak directly on this issue here. One important case for which there is a well-developed existence theory is when the operator is compact, a consequence of square-integrability of with respect to
. In this case, there exists a singular value systemfor which and are orthogonal bases in and , respectively, and
Then can be expressed as for some , and it follows that
which exists and is square-integrable if . Of course, the above formula for
The classic iterative algorithm for finding is given by
Convergence properties of (2) are detailed in Landweber (1951). An issue here is that, under the constraint that and are density functions, there is no guarantee that the sequence of approximations coming from (2) are density functions.
An alternative is to apply a discretization–based method. That is, specify grid points and approximate the original problem (1) via a discrete system
where the are weighting coefficients for a quadrature formula. One then solves the linear system of equations to get an approximation for . See for example, Phillips (1962). Of course, there is no guarantee here either that the solution will be a density.
As an alternative to the additive updates in (2), in this paper we focus on a multiplicative version, described in Section 2, that can guarantee the sequence of approximations are density functions. Moreover, the theoretical convergence analysis of this multiplicative algorithm turns out to be rather straightforward for the case where there exists a unique solution to equation (1). For the general case, where a solution may not exist, the asymptotic behavior of the algorithm can still be determined, and, in Section 3, we provide a characterization of this algorithm as an alternating minimization algorithm, and employ the powerful tools developed in Csiszzár and Tusnády (1984) to study its convergence. The remainder of the paper focuses on three applications of this algorithm. The first, in Section 4, is estimating a smooth mixing density based on samples from the density in (1). The second, in Section 5, is computing the density of the first passage time for Brownian motion hitting an upper boundary, where the boundary need not be concave. The third, in Section 6, is solving general Fredholm equations where the density function constraints are relaxed. Finally, some concluding remarks are given in Section 7.
2 The algorithm
This algorithm was also studied in Shyamalkumr (1996). First note that, if is well-defined, then it follows immediately from the multiplicative structure of the algorithm, and Fubini’s theorem, that is also a density for . This overcomes the difficulty faced with using (2). Second, for a comparison with (2), we see that (3) operates in roughly the same way, but on the log–scale:
This logarithmic version will be helpful when we discuss convergence in Section 3.
The sequence (3) also has a Bayesian interpretation. Indeed, if, at iteration , we had seen an observation from the distribution with density , then the Bayes update of “prior” to “posterior” would be given by
But without such an , yet knowing comes from , the natural choice now is to use the “average update”
which is exactly (3).
For a quick proof–of–concept, consider a Pareto density , supported on , with . It is easy to check that is a gamma mixture of exponentials, i.e., , where is an exponential density and is a gamma density. We can apply algorithm (3) to solve the above equation for . Figure 1 shows a plot of the estimate from (3), with , based on 200 iterations and a half-Cauchy starting density, along with the true gamma density . Clearly, the approximation is quite accurate over most of the support.
Before formally addressing the convergence properties of (3), it will help to provide some intuition as to why it should work. The argument presented in Vardi and Lee (1993) proceeds by considering i.i.d. samples from the distribution with density . Replacing the in (3) with , where is the empirical distribution based on , gives the algorithm
It turns out that this is precisely an EM algorithm to compute the nonparametric maximum likelihood estimator of , see e.g., Laird (1978). This connection to likelihood–based estimation gives the algorithm (5) some justification for a fixed sample . The dependence on a particular sample can be removed by allowing
and applying the law of large numbers to get convergence of (5) to (3), hence motivation for the latter. However, despite the identification of (5) as an EM algorithm, no formal convergence proof has been given that the iterates in (5) converge to the nonparametric maximum likelihood estimator, but see Chae et al. (2017).
The above argument giving intuitive support for algorithm (3) is not fully satisfactory because it does not give any indication that the algorithm will converge to a solution of (1). For a more satisfactory argument, note that, if the algorithm converges to a limit , then we must have
is the Kullback–Leibler divergence and the infimum is over all densities of the formfor some probability measure on . So, if there exists a solution to (1), then the limit would have to be one of them. Even if there is no solution to (1), will be such that the corresponding is closest to in the Kullback–Leibler sense. To make this argument fully rigorous, we need to establish that algorithm (3) does indeed converge.
3 Convergence properties
In Section 3.4 of Vardi and Lee (1993), the authors consider (1), but with some restrictions. The first is that is a finite measure, i.e., point masses on a finite number of atoms, and the second is that the solution is piecewise constant. Our arguments here do not require such restrictions.
By Jensen’s inequality, the last term is lower bounded by , the Kullback–Leibler divergence of from . Since is non–negative, we can deduce that
This implies that is a non–negative and non–increasing sequence, hence has a limit, say, , which, in turn, implies strongly in the or sense. Therefore, in agreement with Landweber (1951) for the additive algorithm (2), we have that is converging to a set where for all . Of course, if is a unique solution to (1) then strongly. A similar argument for convergence of (3) is presented in Shyamalkumr (1996), with some generalizations.
On the other hand, if (1) does not have a solution, then, as discussed above, we expect that in algorithm (3) will converge to a density with limit such that the corresponding mixture satisfies (6). This can be proved by considering (3) as an alternating minimization procedure; see, for example, Csiszár (1975); Csiszzár and Tusnády (1984); Dykstra (1985).
In the following, define the joint densities on ,
For an initial solution , let be obtained via (3). Assume there exists a sequence of densities such that
and for some , where and
Then, decreases to .
Proof. Let be the set of all bivariate densities on with -marginal , i.e., such that . Similarly, let be the set of all bivariate densities on such that for some density . Note that and in (7) satisfy and .
We first claim that and are obtained by the alternating minimization procedure of Csiszzár and Tusnády (1984) with the objective function , where and ranges over and , respectively. To see this, note that for every ; the inequality holding since and are marginal densities of and , respectively. It follows that . Also, note that
where does not depend on . The last integral is maximized at , so . Since and are convex, we have
by Theorem 3 in Csiszzár and Tusnády (1984), with . We have , and and the assumption for some implies that . Since , we conclude that . ∎
Two remarks are in order. First, the integrability condition for some is not especially strong. For example, assume there exists which minimizes and has a density . Such a minimizer exists under a mild identifiability condition; see Lemma 3.1 of Kleijn and van der Vaart (2006). Then for some implies the required integrability condition, where
Second, if is compact and the minimizer is unique, every subsequence of has a further subsequence weakly converging to , so converges to .
It is also worth noting that the monotonicity property of from the theorem can be used to define a stopping rule. For example, one could terminate algorithm (3) when itself and/or the difference falls below a certain user–specified tolerance.
4 Smooth mixing density estimation
Suppose we observe as i.i.d. from a distribution with density as in (1), and we wish to estimate a smooth version of . The nonparametric maximum likelihood estimator of is known to be discrete, so is not fully satisfactory for our purposes. One idea for a smooth estimate of is to maximize a penalized likelihood. The proposal in Liu et al. (2009) is to define a penalized log–likelihood function
where is a smoothing parameter, and determines the mixing density
The right-most integral in the expression for measures the curvature of , so maximizing will encourage solutions which are “less curved,” i.e., more smooth. An EM algorithm is available to produce an estimate of and, hence, of . However, each iteration requires solving a non–trivial functional differential equation.
Here we propose a more direct approach, namely, to use algorithm (3) with the true density
replaced by, say, a kernel density estimate of. For some , define the kernel estimate
where is the standard normal density function. The bandwidth can be selected in a variety of ways. One option is the traditional approach (Sheather and Jones (1991)) of minimizing asymptotic mean integrated square error, which gives of order . Another idea is to get the nonparametric maximum likelihood estimator of , define the corresponding mixture
and then choose the bandwidth , where is, say, the distance. Finally, our proposal is to estimate by running algorithm (3) with replaced by . Since is smooth, so too will be our estimate of .
As a quick real–data illustration, consider the well–known galaxy data set, available in the MASS package in R (R Core Team (2015)). We estimate the mixture density with the Gaussian kernel method described above, using the default settings in density. Then we estimate the mixing density via the procedure just described above. In this case, we used 25 iterations of (3) and the results are displayed in Figure 2. Panel (a) shows from (3), and Panel (b) shows the data histogram, the kernel density estimator (solid), and the mixture corresponding to . Both densities in Panel (b) fit the data well, and the fact that the two are virtually indistinguishable suggests that the density in Panel (a) is indeed a solution to the mixture equation.
Next, for further investigation into the performance of the proposed mixing density estimation procedure, we present some examples with simulated data. We start by considering a deconvolution problem, where is the normal density with mean. Two mixing densities on the unit interval [0,1] are illustrated;
With a sample of size , we first obtained a kernel density estimator , using the Gaussian kernel, with bandwidth as in Sheather and Jones (1991), and then ran (3). As suggested above, monotonicity of suggests a stopping rule, and we terminated the algorithm when . The number of iterations used were for and for . The estimates of the mixing and corresponding mixture densities are depicted in Figure 3. As can be seen in the right figures, the and are indistinguishable; that is, are effectively zero after a few iterations.
Next, we consider scale mixtures of normal distributions where
is the centered normal density with variance. For the mixing density, two well–known distributions on the positive real line, the inverse-gamma and gamma densities, are considered, i.e.,
With a sample of size , the estimator of the mixing density is obtained as previously. The number of iterations are for and for , and the estimates of and are illustrated in Figure 4. Although the density estimates, and , are close to the true , there are deviations between and , in particular for the inverse–gamma case. This is mainly due to the ill–posedness of the problem.
5 First passage time for Brownian motion
The second example is computing the stopping time density for Brownian motion hitting a boundary function. For example, since Breiman (1966) there has been substantial interest in the first passage time distribution of Brownian motion passing a square root boundary. To set the scene, denote as standard Brownian motion started at 0, let be a continuous function with , the boundary function, and define
Then Peskir et al. (2002) provides an equation satisfied by the density of , in particular,
are the probability measures/cumulative distribution functions corresponding toand , respectively. This leads to a Volterra equation and Peskir et al. (2002) demonstrates solutions when is constant and linear. When is increasing and concave and then Peskir et al. (2002) proves that
Clearly meets the requirements for this result.
On the other hand, we can solve using (3) after setting up a suitable Fredholm equation for increasing and bounded by , but not necessarily concave. Define, for any , the martingale
With with , and , we have that is a martingale and, from the optional stopping theorem, that .
Now it is easy to show that , which is bounded and, hence, from the dominated convergence theorem,
Effectively, we are also using the result that almost surely. Hence,
which after some algebra yields the Fredholm equation
where is the normal density with mean and variance , constrained on , and
constructs a Fredholm equation using Girsanov’s theorem, as we do, this is quite different and each equation depends on the type of boundary, whereas ours is always based on the half–normal and exponential distributions.
Here we demonstrate the algorithm for solving the Fredholm equation for the hitting time density with a square root boundary; specifically with and . We start with a grid of points , where and . The starting density is and at each iteration we evaluate . Due to the ease of sampling from an exponential density, we evaluate
using Monte Carlo methods, i.e.,
where the are iid from the exponential distribution with mean , and we take . The is evaluated using the trapezoidal rule using the values, so
where . Consequently, we have numerically,
6 General Fredholm equation
Algorithm (3) can be applied to solve general Fredholm equations of the first kind, where and are not necessarily probability density functions. Assume first that and in (1) are non-negative functions, but not necessarily probability densities. Then equation (1) can be rewritten as
where and . Thus, the update (3) can be replaced with
which was also considered in Vardi and Lee (1993). Here it is assumed that and are both finite.
Next, assume that the solution and may not be necessarily non-negative functions. In this case, instead of the original equation (1), we solve an equivalent non-negative equation
where , , and is a constant to be specified. Since is non-negative, so are both and for a sufficiently large . As illustrated below, the value of rarely affects the convergence rate in practice. Therefore, can be chosen as a very large number.
Finally, assume that is not necessarily non-negative, and write , where . This case can also be solved by transforming the original equation to a non-negative one. For convenience, assume that , then the original equation (1) can be written as
and therefore, by adding the last two display equations, we have
then we have the Fredholm equation
with a non-negative kernel, which can be solved as previously described.
Note that and may have discontinuities; but this does not cause any problem once (12) has the unique solution. If there is another solution to (12), say , and (11) is not satisfied, the restriction of on [0,1] may not be a solution of the original equation. In this case we may apply another decomposition of . In many examples, however, the simple approach (12) works well.
For illustration, consider the equation (1), where is the normal density with mean and standard deviation , and has both positive and negative components. In particular, we consider two examples
where is the density of Beta distribution. In both cases, a transformed equation (10) is solved with and the results are illustrated in Figure 6. It can be easily seen that the true solution and are nearly the same. Different values of ( with ) have been tried, and in any case, the algorithm has been stopped in 10 iterations yielding nearly the same solution.
Next, we consider a general kernel with and
where . The equations are solved by setting , and . For both examples, the algorithm stopped in 5 iterations. Results are illustrated in Figure 7.
In this paper, we focused on an algorithm for solving Fredholm integral equations of the first kind, its properties, and some applications. For the mixing density estimation application described in Section 4, we did not address the question of whether the estimate based on plugging in a kernel density estimator for in (3) would be consistent in the statistical estimation sense. The predictive recursion method of Newton (2002) can also quickly produce a smooth estimator of the mixing density, and it was shown in Tokdar et al. (2009) and Martin et al. (2009) that the estimator is consistent, but non-standard arguments are needed because of its dependence on the data order. We are optimistic that the estimator described in Section 4, through the simple formula (3) for the updates, and the well-known behavior of the kernel density estimator, can have even stronger convergence properties than those demonstrated for predictive recursion.
In Section 5 we were able to extend the class of boundary function for which the hitting time density can be solved using a novel Fredholm equation. Future work would consider upper and lower boundaries. Section 6 we were able to extend the basic algorithm, which was set up for density functions, to non–negative functions.
- Rcpp code:
It contains code to perform the iterative algorithm proposed in this paper and numerical experiments.
- Breiman (1966) Breiman, L. (1966). First exit times from a square root boundary. In Proceedings of 5 Berkeley Symposium on Mathematical Statistics & Probability, volume 2, pages 9–16.
- Chae et al. (2017) Chae, M., Martin, R., and Walker, S. G. (2017). Fast nonparametric near–maximum likelihood estimation of a mixing density. In Preparation.
- Corduneanu (1994) Corduneanu, C. (1994). Integral Equations and Applications. Cambridge University Press, Cambridge.
Csiszár, I. (1975).
-divergence geometry of probability distributions and minimization problems.Annals of Probability, 3:146–158.
- Csiszzár and Tusnády (1984) Csiszzár, I. and Tusnády, G. (1984). Information geometry and alternating minimization procedures. Statistics & Decisions, Supplemental Issue No. 1, pages 205–237.
- Dykstra (1985) Dykstra, R. L. (1985). An iterative procedure for obtaining -projections onto the intersection of convex sets. Annals of Probability, 13:975–984.
- Groetsch (2007) Groetsch, C. (2007). Integral equations of the first kind, inverse problems and regularization. 73:1–32.
- Hansen (1999) Hansen, P. C. (1999). Linear Intergal Equations. 2nd Edition, Volume 82 of Applied Mathematical Sciences. Springer–Verlag, NY.
Kleijn and van der Vaart (2006)
Kleijn, B. J. and van der Vaart, A. W. (2006).
Misspecification in infinite-dimensional Bayesian statistics.Annals of Statistics, 34:837–877.
- Laird (1978) Laird, N. (1978). Nonparametric maximum likelihood estimation of a mixing distribution. Journal of the American Statistical Association, 73:805–811.
- Landweber (1951) Landweber, L. (1951). An iteration formula for Fredholm integral equations of the first kind. American Journal of Mathematics, 73:615–624.
- Liu et al. (2009) Liu, L., Levine, M., and Zhu, Y. (2009). A functional EM algorithm for mixing density estimation via nonparametric penalized likelihood maximization. Journal of Computational and Graphical Statistics, 18:481–504.
- Martin et al. (2009) Martin, R., Tokdar, S. T., et al. (2009). Asymptotic properties of predictive recursion: robustness and rate of convergence. Electronic Journal of Statistics, 3:1455–1472.
- Morozov (1984) Morozov, V. A. (1984). Methods of Solving Incorrectly Posed Problems. Springer–Verlag, NY.
- Newton (2002) Newton, M. A. (2002). On a nonparametric recursive estimator of the mixing distribution. Sankhyā A, 64:306–322.
- Patilea (2001) Patilea, V. (2001). Convex models, MLE and misspecification. Annals of Statistics, 29:94–123.
- Peskir et al. (2002) Peskir, G. et al. (2002). On integral equations arising in the first-passage problem for Brownian motion. Journal Integral Equations and Applications, 14:397–423.
- Phillips (1962) Phillips, D. L. (1962). A technique for the numerical solution of certain integral equations of the first kind. Journal of the Association for Computing Machinery, 9:84–97.
- R Core Team (2015) R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
- Ramm (2005) Ramm, A. G. (2005). Inverse Problems. Springer–Verlag, NY.
- Sheather and Jones (1991) Sheather, S. J. and Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society, Series B, 53:683–690.
- Shyamalkumr (1996) Shyamalkumr, N. D. (1996). Cyclic projections and its applications in statistics. Technical Report #96–24, Department of Statistics, Purdue University.
- Tokdar et al. (2009) Tokdar, S. T., Martin, R., and Ghosh, J. K. (2009). Consistency of a recursive estimate of mixing distributions. Annals of Statistics, 37:2502–2522.
- Valov (2009) Valov, A. V. (2009). First passage times: Integral equations, randomization and analytical approximations. PhD Thesis, Department of Statistics, University of Toronto.
- Vangel (1992) Vangel, M. (1992). Iterative algorithms for integral equations of the first kind with applications to statistics. Technical report, PHD Thesis, Harvard University. Technical Report ONR-C-12.
- Vardi and Lee (1993) Vardi, Y. and Lee, D. (1993). From image deblurring to optimal investments: Maximum likelihood solutions for positive linear inverse problems. Journal of the Royal Statistical Society, Series B, 55:569–612.
- Wing (1990) Wing, G. M. (1990). A Primer on Integral Equations of the First Kind: The Problem of Deconvolution and Unfolding. SIAM, Philadelphia.