1 Introduction
The problem of estimating a probability density function (PDF) associated with a given set of samples is of major relevance in a variety of mathematical and statistical applications; see, e.g., [1, 5, 6, 9, 11, 13, 18, 25, 29]. Histograms are perhaps the most popular means used in practice for this purpose. A histogram is a piecewiseconstant approximation of an unknown PDF that is based on the subdivision of the sample space into subdomains that are commonly referred as bins. For simplicity, consider the case in which all bins have equal volume. The value of the histogram on any bin is given by the number of samples that lie within that bin along with a global scaling applied to ensure that the histogram has unitary integral. As an example, assume that a bounded onedimensional sample domain has been discretized into a set of nonoverlapping, covering, bins (i.e., intervals) of length . Assume also that one has in hand samples of an unknown PDF . For every , the value of the histogram function on is given by
(1) 
where denotes the indicator function for . The above formula easily generalizes to higherdimensional sample domains. Although the histogram is probably the easiest, with regards to implementation, means for estimating a PDF, it suffers from some limitations. For instance, being only a piecewiseconstant approximation, it is discontinuous across bin boundaries and also the use of piecewiseconstant approximations severely limits the discretization accuracy one can achieve.
An alternative method capable of overcoming the differentiability issue is kernel density estimation (KDE) for which the PDF is approximated by a sum of smooth kernel functions centered at the samples so that a desired smoothness of the approximation can be obtained [18]. Considering again a onedimensional sample domain , the KDE approximation is defined as
(2) 
where , which is referred to as the bandwidth, usually governs the decay rate of the kernel as increases. KDE approximations have been shown to be effective in a variety of applications; see, e.g., [13, 20, 27, 28]. However, despite possessing smoothness, KDE has some limitations. For example, the accuracy of the KDE approximation depends on the choice of kernel and the choice of the bandwidth also strongly affects the accuracy [17, 26]. More important, the KDE method does not scale well with the dimension of the sample data set, i.e., for a given , the evaluation of the KDE approximation (2) requires kernel evaluations so that clearly evaluating (2) becomes more expensive as the value of grows. The method based on spline smoothing given in [16] has features that are similar to our approach and is also scalable with respect to the sample size. Our method presents several advantages compared to that of [16]. First, to determine the coefficients, the solution of a linear system is not required. Second, our method does not involve a smoothing factor so that there is no need to tune the method with respect to such a factor. Third, no special constraints have to be introduced to ensure positivity of the PDF approximation. We note that in [23], sparsegrid sampling is substituted for the Monte Carlo sampling in the method of [16], allowing for consideration of larger . Of course, sparsegrid sampling can be substituted for Monte Carlo sampling in our approach as well.
In our approach, the unknown PDF is approximated by a piecewisepolynomial function, specifically a piecewiselinear polynomial, that is defined, as are histograms, with respect to a subdivision of the sample domain into bins. The approach is consistent in the sense that the approximate PDF converges to the exact PDF with respect to the norm as the number of samples tends to infinity and the volume of the bins tend to zero.
The paper is structured as follows. In Section 2, the mathematical foundation of our approach is laid out; there, it is shown analytically that the proposed approximation satisfies several requirements needed for it to be considered as a PDF estimator. A numerical investigation of the accuracy and computational costs incurred by our approach is then provided. First, in Section 3, our method is tested and validated using sample sets associated with different types of known PDFs so that exact errors can be determined. Then, in Section 4, the method is applied to the estimation of unknown PDFs of outputs of interest associated with the solution of a stochastic partial differential equation. Finally, concluding remarks and future work are discussed in Section 5.
2 Piecewiselinear polynomial approximations of PDFs
Let
denote a multivariate random variable belonging to a closed, bounded parameter domain
which, for simplicity, we assume is a polytope in . The probability density function (PDF) corresponding to is not known. However, we assume that we have in hand a data set of samples , , of . The goal is to construct an estimator for using the given data set of samples. To this end, we use an approximating space that is popular in the finite element community [7, 10].Let denote a covering, nonoverlapping subdivision of the sample domain into bins.^{2}^{2}2In the partial differential equation (PDE) setting, what we refer to as bins are often referred to as grid cells or finite elements or finite volumes. We instead refer to the subdomains as bins because that is the notation in common use for histograms which we use to compare to our new approach. Furthermore, in Section 4, we also use finite element grids for spatial discretization of partial differential equations, so that using the notation “bins” for parameter domain subdivisions helps us differentiate between subdivisions of parameter and spatial domains. For the same reason, we use instead of to parametrize parameter bin sizes because is in common use to parametrize spatial grid sizes. Here, parametrizes the subdivision and may be taken as, e.g, the largest diameter of any of the bins . The bins are usually chosen to be simplices or hyperquadrilaterals; for example, if , they would be triangles or quadrilaterals. It is also assumed that the faces of the bins are either also complete faces of abutting bins or part of the boundary of . From a practical point of view, our considerations are limited to relatively small because . Detailed discussion about the subdivisions we use can be found in, e.g., [7, 10].
Let denote the set of nodes, i.e., vertices, of with denoting the number of nodes of . Note that we also have that . Based on the subdivision , we define the space of continuous piecewise polynomials
where, for simplicial elements, denotes the space of multivariate linear polynomials and, for hyperquadrilateral elements, denotes the space of linear polynomials, e.g., bilinear and trilinear polynomials in two and three dimensions, respectively.
A basis for is given by, for ,
where denotes the Kronecker delta function. In detail, we have that denotes the continuous piecewiselinear or piecewise linear Lagrangian FEM basis corresponding to , i.e, we have that, for ,

for simplicial bins, is a linear function on each bin , ;

for hyperquadrilateral bins, is an linear function on each bin , , e.g., for , a linear, bilinear, or trilinear function, respectively;

is continuous on ;

at the th node of the subdivision ; and

if , at the th node of the subdivision .
For , let and let ; note that consists of the union of the bins having the node as one of its vertices. Thus, the basis functions have compact support with respect to . An illustration of the basis functions in one dimension is given in Figure 1. We further let , for , denote the number of bins in , i.e., the number of bins that share the vertex .
Note that the approximating space and the basis are in common use for the finite element discretization or partial differential equations. Details about the geometric subdivision , the approximation space , and the basis functions and their properties may be found in, e.g., [7, 10].
Below we make use of two properties of the the basis . First, we have the wellknown relation
(3) 
We also have that
(4) 
Note that, in general, is proportional to .
2.1 The piecewiselinear approximation of a PDF
Given the samples values in , we define the approximation of the unknown PDF given by
(5)  
Note that only the samples , i.e., only the samples in the support of the basis function , are used to determine .
Of course, the approximate PDF (5) should be a PDF in its own right. That it is indeed a PDF is shown in the following lemma.
Lemma 1
for all and .
For simplicity, from now on we only focus on bins consisting of dimensional hypercubes of side . The next lemma is useful to prove the convergence of our approximation.
Lemma 2
Let with and let denote the expectation of with respect to . Then
(7) 
where the constant does not depend on either or , and is a positive integer. If for some positive constant , then . Otherwise .
Proof 2
Let
be the characteristic function of
and define . Letbe defined in the usual tensor product fashion. Assuming
, we have for all . Then(8) 
is the value of a kernel density estimator of evaluated at , with the function as a kernel. Using a standard argument for the bias of kernel density estimators we have that
(9)  
(10) 
The above inequality proves the result for . Thanks to the symmetry of , if for some positive constant , then , hence the result follows with .
The next theorem shows that the approximate PDF obtained with our method converges to the exact PDF with respect to the norm.
Theorem 1
Let be a polytope in and with If is the approximation of given in Eq. (5), then:
Moreover, if for some positive constant , then
where is a constant that does not depend on or .
Proof 3
Let
be the finite element nodal interpolant of
, then(11)  
(12) 
where is a constant that does not depend on [7]. Considering the second term in the above inequality, we have
(13)  
(14)  
(15) 
The last inequality is obtained using Lemma 2 and Eq. (3). Considering that , we have
(16)  
where for all , and depends on but not on . Hence
(17) 
so the first result is obtained. If for some positive constant , then in Eq. (17), so the second result also follows taking the limit as .
We note that the numerical examples considered below show that convergence can be obtained even for cases where the PDF is not in , even when the PDF is not differentiable or even continuous.
2.2 Numerical illustrations
In Section 3, we validate our approach by approximating known joint PDFs . Of course, in comparing approximations to an exact known PDF, we pretend that we have no or very little knowledge about the latter except that we have available samples of the PDF at points , . Comparing with known PDFs allows us to precisely determine errors in the approximation of the PDF determined using our method. Then, in Section 4, we use our method to approximate the PDFs of outputs of interest associated with the solution of a stochastic partial differential equation; in that case, the PDF is not known. All computations were performed on a Dell Inspiron 15, 5000 series laptop with the CPU {Intel(R) Core(TM) i34030U CPU1.90GHz, 1895 MHz} and 8 GB of RAM.
Note that in all the numerical examples, denotes a sampling domain, i.e., all samples lie within . For most cases, is also the support domain for the PDF. However, we also consider the case in which the support of the PDF is not known beforehand so that the sampling domain is merely assumed to contain, but not be the same as, the support domain.
For simplicity, the sample space is assumed to be a bounded dimensional box with . We subdivide the parameter domain into a congruent set of bins consisting of dimensional hypercubes of side , where denotes the number of intervals in the subdivision of along each of the coordinate directions. We then have that the number of bins is given by and the number of nodes is given by . Our new method (5) can be easily extended to more general parameter domains, including ones of infinite extent, and to more general meshing strategies, but at the cost of substantially more complex notations. Likewise, for simplicity, we assume throughout that the components of the random variable are independently distributed so that the joint PDFs are given as the product of univariate PDFs; our method can also be extended to cases in which the components of are correlated.
3 Validation through comparisons with known PDFs
In this section, we assume that we have available samples drawn from a known PDF . The error incurred by any approximation of the exact PDF is measured by
(18) 
In particular, we use this error measure for our approximation defined in (5).
The accuracy of approximations of a PDF, be they by histograms or by our method, depends on both (the number of samples available) and (the length of the bin edges). Thus, and should be related to each other in such a way that errors due to sampling and bin size are commensurate with each other. Thus, if the bin size errors are of and the sampling error is of , we set
(19) 
Thus, once and are specified, one can choose (or equivalently ) and the value of is determined by (19) or vice versa. Clearly, increases as decreases.
For most of the convergence rate illustrations given below, we
choose , , so that and .  (20) 
Note that neither (19) or (20) depend on the dimension of the parameter domain but, of course, and
do. If the variance of the PDF is large, one may want to increase the size of
by multiplying the term in (19) by the variance.The computation of the coefficients defined in (5) may be costly if is large and consequently is small. To improve the computational efficiency, we evaluate a basis function at a sample point only if the point is within the support of . However, the determination of the bin such that may be expensive in case of large and . For this task, we employ the efficient particle locating algorithm described in [8].
3.1 A smooth PDF with known support
For the first example we consider, we ignore the fact that we know the exact PDF we are trying to approximate, but assume we know the support of the PDF so the sampling domain is also the support domain. We also use this example to illustrate that the number of Monte Carlo samples needed is independent of the dimension of the parameter space.
We set so that
and assume that the components of the random vector
are independently and identically distributed according to a truncated standard Gaussian PDF so that the joint PDF is given by(21)  
The scaling factor is introduced to insure that we indeed have a PDF, i.e., that the integral of over
is unity. Note that because the standard deviation of the underlying standard Gaussian PDF is unity, the values of the truncated Gaussian distribution (
21) near the faces of the boxare very small so that the results of this example are given to a precision such that they would not change if one considers instead the (nontruncated) standard Gaussian distribution. Also note that because the second moment of the standard Gaussian distribution is unity, the absolute error (
18) is also very close to the error relative to the given PDF.Before we use the formula (20) to relate and , we first separately examine the convergence rates with respect to and . To this end, to illustrate the convergence with respect to , we set
(22) 
so that the error due to sampling is relatively negligible. For illustrating the convergence with respect to , we set
(23) 
so that the error due to the bin size is relatively negligible. The plots for in Figure 2 illustrate the secondorder convergence rate with respect to and the halforder convergence rate with respect to as predicted in Theorem 1.
We now turn to relating and using the formula (20). We consider the multivariate truncated standard Gaussian PDF (21) for . Plots of the error vs. both and are given in Figure 3 from which we observe, in all cases, the secondorder convergence rate with respect and the halforder convergence rate with respect to . We also observe that the errors and the number of samples used are largely independent of the value of . A visual comparison of the exact truncated standard Gaussian distribution (21) and its approximation (5) for the bivariate case is given in Figure 4.
Computational costs are reported in Table 1 in which, for each , we choose in (20) to determine and . Reading vertically for each , we see the increase in computational costs due to the decrease in and the related increase in . Reading horizontally so that and are fixed, the increase in costs is due to the increasing number of bins and nodes as increases. We note that our method is amenable to highly scalable parallelization not only as decreases and increases, but also as increases so that, through parallelization, our method may prove to be useful in dimensions higher than those considered here. When developing a parallel implementation of our method, using a particle locating algorithm such as that of [8] to locate a given node that is shared by several processors would be crucial to realizing the gains in efficiency due to parallelization.
3.2 A smooth PDF with unknown support
Still considering a known PDF, we now consider a case for which we not only pretend we do not know the PDF, but also we do not know its support. Specifically, we consider the uniform distribution
on . A univariate distribution suffices for the discussions of this case; multivariate distributions can be handled by the obvious extensions of what is said here about the univariate case. We assume that we know that the support of the known PDF lies within a larger interval . Of course, we may be mistaken about this so that once we examine the sample set , we may observe that some of the samples fall outside of . In this case we can enlarge the interval until we observe that the interval spanned by smallest to largest sample values is contained within the new .We first simply assume that we have determined, either through external knowledge or by the process just described, that the support of the PDF we are considering lies somewhere within the interval . Not knowing the true support, we not only sample in the larger interval (so that here we have and ), but we also build the approximate PDF with respect to . We remark that a uniform distribution provides a stern test when the support of the distribution is not known because that distribution is as large at the boundaries of its support as it is in the interior. Distributions that are small near the boundaries of their support, e.g., the truncated Gaussian distribution of Section 3.1, would yield considerably smaller errors and better convergence rates compared to what are obtained for the uniform distribution. Choosing in (20), we obtain the errors plotted in Figure 5. Clearly, the convergence rates are nowhere near optimal. Of course, the reason for this is that by building the approximation with respect to , we are not approximating the uniform distribution on , but instead we are approximating the discontinuous distribution
For comparison purposes we provide, in Figure 6, results for the case where we use the support interval for both sampling and for approximation construction. Because now the PDF is smooth, in fact constant, throughout the interval in which the approximation is constructed, we obtain optimal convergence rates.
One can improve on the results of Figure 5, even if one does not know the support of the PDF one is trying to approximate, by taking advantage of the fact that the samples obtained necessarily have to belong to the support of the PDF and therefore provide an estimate for that support. For instance, for the example we are considering, one could proceed as follows.

For a chosen , sample over .

Determine the minimum and maximum values and , respectively, of the sample set .

Choose the number of bins and set .

Build the approximation over the interval with a bin size .
It is reasonable to expect that as increases, the interval becomes a better approximation to the true support interval . Figure 7 illustrates the convergence of to . Note that because , the exact PDF is continuous within . Thus, it is also reasonable to expect that because the approximate PDF is built with respect an interval which is contained within the support of the exact PDF, that there will be an improvement in the accuracy of that approximation compared to that reported in Figure 5 and, in particular, that as one increases so that decreases and increases, better rates of convergence will be obtained. Figure 8 corresponds to the application of this procedure and shows the substantially smaller errors and substantially higher convergence rates compared that reported in Figure 5.
A visual comparisons of the approximations obtained using the smallest /largest pairing corresponding to Figures 5, 6, and 8 are given in Figure 9. The defects resulting from the use of the interval for constructing the approximation of a uniform PDF that has support on the interval are clearly evident. On the other had, using the support interval approximation process outlined above results in a visually identical approximation as that obtained using the correct support interval . Note that for the smallest value of , we have that approximates and approximates to seven decimal places.
3.3 A nonsmooth PDF
We next consider the approximation of a nonsmooth PDF. Specifically, we consider the centered truncated Laplace distribution
(24) 
over , where is a scaling factor that ensures a unitary integral of the PDF over . Here, the support domain and sampling domain are the same. This distribution is merely continuous. i.e., its derivative is discontinuous at , so one cannot expect optimally accurate approximations. However, as illustrated in Figure 10, it seems the approximation does converge, but at a lower rate with respect to and at the optimal rate with respect to . The latter is not surprising because Monte Carlo sampling is largely impervious to the smoothness or lack thereof of the function being approximated.
Whenever there is any information about the smoothness of the PDF, one can choose an appropriate value of in (19). Alternately, possibly through a preliminary investigation, one can estimate the convergence rate of the approximation (5). In the case of the Laplace distribution which is continuous but not continuously differentiable, one cannot expect a convergence rate greater than one. Selecting in (20) to relate and , we obtain the results given in Figure 11 which depicts rates somewhat worse that we should perhaps expect.
The Laplace distribution, although not globally , is piecewise smooth, with failure of smoothness only occurring at the symmetry point of the distribution. For example, for the particular case of the centered distribution (24), the distribution is smooth for and . Thus, in general, one could build two separate, optimally accurate approximations, one for the right of the symmetry point and the other for the left of that point. Of course, doing so requires knowledge of where that point is located. If this information is not available, then one can estimate the location of that point by a process analogous to what we described in Section 3.2 for distributions whose support is not known a priori. Such a process can be extended to distributions with multiple points at which smoothness is compromised.
3.4 Bivariate mixed PDF
We now consider a bivariate PDF in which the random variables and are independently distributed according to different PDFs. Specifically, we have that is distributed according to a truncated Gaussian distribution with zero mean and standard deviation , whereas is distributed according to a truncated standard Gaussian. We choose so that the joint PDF is given by
(25) 
where is as in (21) and . Results for this case are shown in Figure 12, where we observe optimal convergence rates with respect to both and . Visual evidence of the accuracy of our approach is given in Figure 13 that shows the approximation of the exact PDF (25) and zoomins of the approximate and approximate PDFs. Computational times are very similar to those for in Table 1 so that they are not provided here.
4 Application to an unknown PDFs associated with a stochastic PDE
In this section, we consider the construction of approximations of the PDF of outputs of interest that depend on the solution of a stochastic PDE. In general, such PDFs are unknown a priori.
The boundary value problem considered is the stochastic Poisson problem
(26) 
where denotes a spatial domain with boundary and is the sample space for the input random vector variable which we assume is distributed according to a known input joint PDF .
For the coefficient function , we assume that there exists a positive constant almost surely on for all . We also assume that is measurable with respect to . It is then known that the system (26) is well posed almost surely for ; see, e.g., [2, 15, 21, 22] for details.
Stochastic Galerkin approximation of the solution of the PDE
We assume that we have in hand an approximation of the solution of the system (26). Specifically, spatial approximation is effected via a piecewisequadratic finite element method [7, 10]. Because this aspect of our algorithm is completely standard, we do not give further details about how we effect spatial approximation. For stochastic approximation, i.e., for approximation with respect to the parameter domain , we employ a spectral method. Specifically, we approximate using global orthogonal polynomials, where orthogonality is with respect to and the known PDF . Thus, if denotes a basis for the finite element space used for spatial approximation and denotes a basis for the spectral space used for approximation with respect to , the stochastic Galerkin approximation has the form
(27)  
Note that once the SGM approximation (27) is constructed, it may be evaluated at any point and for any parameter vector . Having chosen the types of spatial and stochastic approximations we use, the approximate solution , i.e., the set of coefficients , is determined by a stochastic Galerkin method, i.e., we determine an approximation to the Galerkin projection of the exact solution with respect to both the spatial domain and the parameter domain . This approach is well documented so that we do not dwell on it any further; one may consult, e.g., [3, 4, 9, 14, 15], for details. Note that once the surrogate (27) for the solution of the PDE is built, it can be used through direct evaluation to cheaply determine an approximation of the solution of the PDE for any and any instead of having to do a new approximate PDE solve for any new choice of x and .
The error in depends on the gridsize parameter used for spatial approximation and the degree of the orthogonal polynomials used for approximation with respect to the input parameter vector . In practice, these parameters should be chosen so that the two errors introduced are commensurate. However, here, because our focus is on stochastic approximation and because throughout we use the same finite element method for spatial approximation, we use a small enough spatial grid size so that the error due to spatial approximation is, for all practical purposes, negligible compared to the errors due to stochastic approximation.
Outputs of interest depending on the solution of the PDE
In our context, outputs of interest are spatiallyindependent functionals of the solution of the system (26). Here, we consider the two specific functionals
Comments
There are no comments yet.