Asymptotic properties and approximation of Bayesian logspline density estimators for communication-free parallel methods

10/25/2017
by   Alexey Miroshnikov, et al.
0

In this article we perform an asymptotic analysis of Bayesian parallel density estimators which are based on logspline density estimation presented in Stone and Koo (1986). The parallel estimator we introduce is in the spirit of the kernel density estimator presented by Neiswanger, Wang and Xing (2015). We provide a numerical procedure that produces the density estimator itself in place of the sampling algorithm. We derive an error bound for the mean integrated squared error for the full data posterior estimator and investigate the parameters that arise from the logspline density estimation and the numerical approximation procedure. Our investigation leads to the choice of parameters that result in the error bound scaling appropriately in relation to them.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

03/15/2022

TAKDE: Temporal Adaptive Kernel Density Estimator for Real-Time Dynamic Density Estimation

Real-time density estimation is ubiquitous in many applications, includi...
12/17/2019

Nonparametric density estimation for intentionally corrupted functional data

We consider statistical models where functional data are artificially co...
05/04/2018

Axiomatic Approach to Variable Kernel Density Estimation

Variable kernel density estimation allows the approximation of a probabi...
10/18/2018

Asymptotic Properties for Methods Combining Minimum Hellinger Distance Estimates and Bayesian Nonparametric Density Estimates

In frequentist inference, minimizing the Hellinger distance between a ke...
02/14/2018

An adaptive procedure for Fourier estimators: illustration to deconvolution and decompounding

We introduce a new procedure to select the optimal cutoff parameter for ...
03/20/2015

Nonparametric Estimation of Band-limited Probability Density Functions

In this paper, a nonparametric maximum likelihood (ML) estimator for ban...
06/16/2022

Voronoi Density Estimator for High-Dimensional Data: Computation, Compactification and Convergence

The Voronoi Density Estimator (VDE) is an established density estimation...

1 Introduction

The recent advances in data science and big data research have brought challenges in analyzing large data sets in full. These massive data sets may be too large to read into a computer’s memory in full, and data sets may be located on different machines. In addition, there is a lengthy time needed to process these data sets. To alleviate these difficulties, many parallel computing methods have recently been developed. One such approach partitions large data sets into subsets, where each subset is analyzed on a separate machine using parallel Markov chain Monte Carlo (MCMC) methods

[8, 9, 10]; here, communication between machines is required for each MCMC iteration, increasing computation time.

Due to the limitations of methods requiring communication between machines, a number of alternative communication- free parallel MCMC methods have been developed for Bayesian analysis of big data [5, 6]. For these approaches, Bayesian MCMC analysis is performed on each subset independently, and the subset posterior samples are combined to estimate the full data posterior distributions. Neiswanger, Wang and Xing [5] introduced a parallel kernel density estimator that first approximates each subset posterior density and then estimates the full data posterior by multiplying together the subset posterior estimators. The authors of [5] show that the estimator they use is asymptotically exact; they then develop an algorithm that generates samples from the posterior distribution approximating the full data posterior estimator. Though the estimator is asymptotically exact, the algorithm of [5] does not perform well for posteriors that have non-Gaussian shape. This under-performance is attributed to the method of construction of the subset posterior densities; this method produces near-Gaussian posteriors even if the true underlying distribution is non-Gaussian. Another limitation of the method of Neiswanger, Wang and Xing is its use in high-dimensional parameter spaces, since it becomes impractical to carry out this method when the number of model parameters increases.

Miroshnikov and Conlon [6] introduced a new approach for parallel MCMC that addresses the limitations of [5]. Their method performs well for non-Gaussian posterior distributions and only analyzes densities marginally for each parameter, so that the size of the parameter space is not a limitation. The authors use logspline density estimation for each subset posterior, and the subsets are combined by a direct numeric product of the subset posterior estimates. However, note that this technique does not produce joint posterior estimates, as in [5].

The estimator introduced in [6] follows the ideas of Neiswanger et al. [5]. Specifically, let be the likelihood of the full data given the parameter . We partition x into disjoint subsets , with . For each subset we draw samples whose distribution is given by the subset posterior density . Given prior , the datasets and assuming that they are independent from each other, then the posterior density, see [5], is expressed by

(1)

In our work, we investigate the properties of the estimator , defined in [6], that has the form

(2)

where is the logspline density estimator of and where we suppressed the information about the data .

The estimated product of the subset posterior densities is, in general, unnormalized. This motivates us to define the normalization constant for the estimated product . Thus, the normalized density , one of the main points of interest in our work, is given by

Computing the normalization constant analytically is a difficult task since the subset posterior densities are not explicitly calculated, with the exception of a finite number of points where . By taking the product of these values for each we obtain the value of . This allows us to numerically approximate the unnormalized product

by using a Lagrange interpolation polynomials. This approximation is denoted by

. Then we approximate the constant by numerically integrating . The approximation of the normalization constant is denoted by , given by

The newly defined density acts as the estimator for the full-data posterior .

In this paper, we establish error estimates between the three densities via the mean integrated squared error or MISE, defined for two functions as

(3)

Thus, our work involves two types of approximations: 1) the construction of using logspline density estimators and 2) the construction of the interpolation polynomial . The methodology of logspline density estimation was introduced in [2] and corresponding error estimates between the estimator and the density it is approximating are presented in [3, 4]. These error estimates depend on three factors: i) the number of samples drawn from the subset posterior density, ii) the number of knots used to create the -order B-splines, and iii) the step-size of those knots, which we denote by .

In our work we estimate the MISE between the functions and by adapting the estimation techniques introduced in [3, 4]. We then utilize this analysis to establish a similar estimate for the normalized densities and ,

where and is the number of continuous derivatives of . Notice that the exponential contains two terms, where the first depends on the number of samples and the number of knots and the other depends on the placement of the spline knots. Both terms converge to zero and for MISE to scale optimally both terms must converge at the same rate. To this end, we choose and each

to be functions of the vector

and scale appropriately with the norm . This simplifies the above estimate to

where the parameter is related to the convergence of the logspline density estimators.

The estimate for MISE between and is obtained in a similar way by utilizing Lagrange interpolation error bounds, as described in [7]. This error depends on two factors: i) the step-size of the grid points chosen to construct the polynomial, where the grid points correspond to the coordinates discussed earlier, and ii) the degree of the Lagrange polynomial. The estimate obtained is also shown to hold for the normalized densities and .

where is the minimal distance between the spline knots and is chosen to asymptotically scale with the norm of the vector of samples N, see Section 2.

We then combine both estimates to obtain a bound for MISE for the densities and . We obtain

In order for MISE to scale optimally the two terms in the sum must converge to zero at the same rate. As before with the distance between and , we choose to scale appropriately with the norm of the vector N. This leads to the optimum error bound for the distance between the estimator and the density ,

(4)

The paper is arranged as follows. In Section 2 we set notation and hypotheses that form the foundation of the analysis. In Section 3 we derive an asymptotic expansion for MISE of the non-normalized estimator, which are central to the analysis performed in subsequent sections. We also perform there the analysis of MISE for the full data set posterior density estimator . In Section 4, we perform the analysis for the numerical estimator . In Section 5 we showcase our simulated experiments and discuss the results. Finally, in the appendix we provide supplementary lemmas and theorems employed in Section 3 and Section 4.

2 Notation and hypotheses

For the convenience of the reader we collect in this section all hypotheses and results relevant to our analysis and present the notation that is utilized throughout the article.

  1. Motivated by the form of the posterior density at Neiswanger et al. [5]

    we consider the probability density function of the form

    (5)

    where we assume that have compact support on the interval .

  2. For each is a probability density function. We consider the estimator of in the form

    (H2-a)

    and for each is the logspline density estimator of the probability density that has the form

    (H2-b)

    We also consider the additional estimators of as defined in (71) and

    Here

    are independent identically distributed random variables and

    is the logspline density estimate introduced in Definition (37) with number of knots and the order of the B-splines is .

    (6)

    where .

The mean integrated square error of the estimator of the product is defined by

(7)

where we use the notation .

We assume that the probability densities functions satisfy the following hypotheses:

  1. The number of samples for each subset are parameterized by a governing parameter as follows:

    such that for all

    (8)

    Note that .

  2. For each for some fixed in . For the number of knots for each are parameterized by as follows:

    (9)

    where is the number of knots for B-splines on the interval and thus

    and we require

  3. For the knots , we write

    (10)
  4. For each , and density there exists such that

    (11)
  5. Let denote the -norm on . For defined as in H1, there exists such that

    (12)
  6. For each subset , the B-splines are created by choosing a uniform knot sequence. Thus,

    (13)

    Let

    (14)

    We assume that scale in a similar way to the number of samples, i.e

    where is the same as in hypothesis (H6).

3 Analysis of Mise for

3.1 Error analysis for unnormalized estimator

Suppose we are given a data set x and it is partitioned into disjoint subsets . We are interested in the subset posterior densities . For each such density we apply the analysis from before. Let and be the corresponding logspline estimators as defined in (70) and (71) respectively. By definition of , that is equal to the logspline density estimate on , where is the set defined in (69) for .

Definition 1.

For , let be the set defined in (6). We then set

which is the set where the maximizer for the log-likelihood exists given each data subset and thus all logspline density estimators exist.

Lemma 2.

Suppose the conditions in (H3) and (H4) hold. Given the previous definition, we have that

Proof.

By Theorem 53 we have that

and the result follows by taking to infinity. ∎

Since the probability of the set where the estimators exist for all tends to 1, it makes sense to do our analysis for a conditional MISE on the set . Considering the practical aspect, we will never encounter the set where the maximizer of the log-likelihood doesn’t exist.

At this point, let’s state a bound for which will be essential in our analysis of MISE.

Lemma 3.

Suppose the hypotheses (H1)-(H7) hold and that we are restricted to the sample subspace . We then have the following:

  • There exists a positive constant such that

  • There exists a positive constant such that

  • Using the bounds from and we have

Proof.
  • The bound can be shown by writing

    and then applying Theorem 56. For each there will be an appearing in the bound and we can take .

  • Similar to part we can write

    and then we apply Lemma 47. For each there will be constants and appearing and we can take .

  • To see why this is true, we write

    If then

    If then

    where the last step is justified by the fact that . This implies

    and then we apply the bounds from the previous two parts.

This leads us directly to the theorem for the conditional MISE of the unnormalized densities and .

Theorem 4.

(Conditional MISE for unnormalized and )
Assume the conditions (H1)-(H7) hold. Given we have

(15)

where are as in Lemma 3.

In addition, if (H8) holds, then MISE scales optimally in regards to the number of samples,

(16)
Proof.

By definition of the conditional MISE and Lemma 3, we have

which implies (15). Next, if (H8) holds, then (16) follows directly. ∎

Remark 5.

It’s interesting to note how the number of knots, their placement and the number of samples all play a role in the above bound. If we want to be accurate, all of the parameters and must be chosen appropriately.

3.2 Analysis for renormalization constant

We will now consider the error that arises for MISE when one renormalizes the product of the estimators so it can be a probability density. The renormalization can affect the error since and are rescaled. We define the renormalization constant and its estimator to be

(17)

Therefore, we are interested in analyzing

We first state the following lemma for and .

Lemma 6.

Let and be defined as in (6). Suppose that (H8) holds and we are restricted to the sample subspace . Then we have

(18)
Proof.

By definition of and , we have

where the inequality is justified by Lemma 3(c). Dividing by the result then follows by hypothesis (H8). ∎

So what the above lemma suggests is that when restricted to the sample subspace , the space where the logspline density estimators , are all defined, the renormalization constant of the product of the estimators approximates the true renormalization constant .

Knowing now how scales we can start analyzing on the sample subspace. However, to make the analysis slightly easier we introduce a new functional, called . This new functional is asymptotically equivalent to MISE as we will show, thus providing us with the means to view how MISE scales without having to directly analyze it.

Definition 7.

Suppose and hypotheses (H1)-(H2) hold. Given the sample subspace we define the functional

(19)
Proposition 8.

The functional is asymptotically equivalent to MISE on , in the sense that

(20)
Proof.

Notice that can be written as

and thus by Lemma 6

which then implies the result. ∎

We conclude our analysis with the next theorem, which states how MISE scales for the renormalized estimators.

Theorem 9.

Let . Assume the conditions (H1)-(H8) hold. Then

(21)
Proof.

We will do the work for and the result will follow from Proposition 8. Notice that can be written as below. Also, let

We now determine how each of the , scale. For by Lemma 6 we have

for we have from (H8)

and for we have from Lemmas 3(c) and 6

Thus, by hypotheses (H7)-(H8)

4 Numerical Error

So far we have estimated the error that arises between the unknown density and the full-data estimator . However, in practice it is difficult to evaluate the renormalization constant

defined in (17). The difficulty is due to the process of generating MCMC samples and thus is not explicitly known. In order to circumvent this issue, our idea is to approximate the integral above numerically. To accomplish this, we interpolate using Lagrange polynomials. This procedure leads to the construction of an interpolant estimator which we then integrate numerically. We then normalize and use that as a density estimator for . Unfortunately, to estimate the error by considering that kind of approximation given an arbitrary grid of points for Lagrange polynomials, independent of the set of knots for B-splines gives a stringent condition on the smoothness of B-splines we incorporate. It turns out that we have to utilize B-splines of order at least . For this reason we consider using Lagrange polynomials of order which satisfy .

4.1 Interpolation of an estimator: preliminaries

We remind the reader the model we deal with throughout our work. We recall that the (marginal) posterior of the parameter (which is a component of a multidimensional parameter ) given the data

partitioned into disjoint sets ,