The recent advances in data science and big data research have brought challenges in analyzing large data sets in full. These massive data sets may be too large to read into a computer’s memory in full, and data sets may be located on different machines. In addition, there is a lengthy time needed to process these data sets. To alleviate these difficulties, many parallel computing methods have recently been developed. One such approach partitions large data sets into subsets, where each subset is analyzed on a separate machine using parallel Markov chain Monte Carlo (MCMC) methods[8, 9, 10]; here, communication between machines is required for each MCMC iteration, increasing computation time.
Due to the limitations of methods requiring communication between machines, a number of alternative communication- free parallel MCMC methods have been developed for Bayesian analysis of big data [5, 6]. For these approaches, Bayesian MCMC analysis is performed on each subset independently, and the subset posterior samples are combined to estimate the full data posterior distributions. Neiswanger, Wang and Xing  introduced a parallel kernel density estimator that first approximates each subset posterior density and then estimates the full data posterior by multiplying together the subset posterior estimators. The authors of  show that the estimator they use is asymptotically exact; they then develop an algorithm that generates samples from the posterior distribution approximating the full data posterior estimator. Though the estimator is asymptotically exact, the algorithm of  does not perform well for posteriors that have non-Gaussian shape. This under-performance is attributed to the method of construction of the subset posterior densities; this method produces near-Gaussian posteriors even if the true underlying distribution is non-Gaussian. Another limitation of the method of Neiswanger, Wang and Xing is its use in high-dimensional parameter spaces, since it becomes impractical to carry out this method when the number of model parameters increases.
Miroshnikov and Conlon  introduced a new approach for parallel MCMC that addresses the limitations of . Their method performs well for non-Gaussian posterior distributions and only analyzes densities marginally for each parameter, so that the size of the parameter space is not a limitation. The authors use logspline density estimation for each subset posterior, and the subsets are combined by a direct numeric product of the subset posterior estimates. However, note that this technique does not produce joint posterior estimates, as in .
The estimator introduced in  follows the ideas of Neiswanger et al. . Specifically, let be the likelihood of the full data given the parameter . We partition x into disjoint subsets , with . For each subset we draw samples whose distribution is given by the subset posterior density . Given prior , the datasets and assuming that they are independent from each other, then the posterior density, see , is expressed by
In our work, we investigate the properties of the estimator , defined in , that has the form
where is the logspline density estimator of and where we suppressed the information about the data .
The estimated product of the subset posterior densities is, in general, unnormalized. This motivates us to define the normalization constant for the estimated product . Thus, the normalized density , one of the main points of interest in our work, is given by
Computing the normalization constant analytically is a difficult task since the subset posterior densities are not explicitly calculated, with the exception of a finite number of points where . By taking the product of these values for each we obtain the value of . This allows us to numerically approximate the unnormalized product
by using a Lagrange interpolation polynomials. This approximation is denoted by. Then we approximate the constant by numerically integrating . The approximation of the normalization constant is denoted by , given by
The newly defined density acts as the estimator for the full-data posterior .
In this paper, we establish error estimates between the three densities via the mean integrated squared error or MISE, defined for two functions as
Thus, our work involves two types of approximations: 1) the construction of using logspline density estimators and 2) the construction of the interpolation polynomial . The methodology of logspline density estimation was introduced in  and corresponding error estimates between the estimator and the density it is approximating are presented in [3, 4]. These error estimates depend on three factors: i) the number of samples drawn from the subset posterior density, ii) the number of knots used to create the -order B-splines, and iii) the step-size of those knots, which we denote by .
In our work we estimate the MISE between the functions and by adapting the estimation techniques introduced in [3, 4]. We then utilize this analysis to establish a similar estimate for the normalized densities and ,
where and is the number of continuous derivatives of . Notice that the exponential contains two terms, where the first depends on the number of samples and the number of knots and the other depends on the placement of the spline knots. Both terms converge to zero and for MISE to scale optimally both terms must converge at the same rate. To this end, we choose and each
to be functions of the vectorand scale appropriately with the norm . This simplifies the above estimate to
where the parameter is related to the convergence of the logspline density estimators.
The estimate for MISE between and is obtained in a similar way by utilizing Lagrange interpolation error bounds, as described in . This error depends on two factors: i) the step-size of the grid points chosen to construct the polynomial, where the grid points correspond to the coordinates discussed earlier, and ii) the degree of the Lagrange polynomial. The estimate obtained is also shown to hold for the normalized densities and .
where is the minimal distance between the spline knots and is chosen to asymptotically scale with the norm of the vector of samples N, see Section 2.
We then combine both estimates to obtain a bound for MISE for the densities and . We obtain
In order for MISE to scale optimally the two terms in the sum must converge to zero at the same rate. As before with the distance between and , we choose to scale appropriately with the norm of the vector N. This leads to the optimum error bound for the distance between the estimator and the density ,
The paper is arranged as follows. In Section 2 we set notation and hypotheses that form the foundation of the analysis. In Section 3 we derive an asymptotic expansion for MISE of the non-normalized estimator, which are central to the analysis performed in subsequent sections. We also perform there the analysis of MISE for the full data set posterior density estimator . In Section 4, we perform the analysis for the numerical estimator . In Section 5 we showcase our simulated experiments and discuss the results. Finally, in the appendix we provide supplementary lemmas and theorems employed in Section 3 and Section 4.
2 Notation and hypotheses
For the convenience of the reader we collect in this section all hypotheses and results relevant to our analysis and present the notation that is utilized throughout the article.
For each is a probability density function. We consider the estimator of in the form
and for each is the logspline density estimator of the probability density that has the form
We also consider the additional estimators of as defined in (71) and
are independent identically distributed random variables andis the logspline density estimate introduced in Definition (37) with number of knots and the order of the B-splines is .
The mean integrated square error of the estimator of the product is defined by
where we use the notation .
We assume that the probability densities functions satisfy the following hypotheses:
The number of samples for each subset are parameterized by a governing parameter as follows:
such that for all
Note that .
For each for some fixed in . For the number of knots for each are parameterized by as follows:
where is the number of knots for B-splines on the interval and thus
and we require
For the knots , we write
For each , and density there exists such that
Let denote the -norm on . For defined as in H1, there exists such that
For each subset , the B-splines are created by choosing a uniform knot sequence. Thus,
We assume that scale in a similar way to the number of samples, i.e
where is the same as in hypothesis (H6).
3 Analysis of Mise for
3.1 Error analysis for unnormalized estimator
Suppose we are given a data set x and it is partitioned into disjoint subsets . We are interested in the subset posterior densities . For each such density we apply the analysis from before. Let and be the corresponding logspline estimators as defined in (70) and (71) respectively. By definition of , that is equal to the logspline density estimate on , where is the set defined in (69) for .
For , let be the set defined in (6). We then set
which is the set where the maximizer for the log-likelihood exists given each data subset and thus all logspline density estimators exist.
By Theorem 53 we have that
and the result follows by taking to infinity. ∎
Since the probability of the set where the estimators exist for all tends to 1, it makes sense to do our analysis for a conditional MISE on the set . Considering the practical aspect, we will never encounter the set where the maximizer of the log-likelihood doesn’t exist.
At this point, let’s state a bound for which will be essential in our analysis of MISE.
The bound can be shown by writing
and then applying Theorem 56. For each there will be an appearing in the bound and we can take .
Similar to part we can write
and then we apply Lemma 47. For each there will be constants and appearing and we can take .
To see why this is true, we write
where the last step is justified by the fact that . This implies
and then we apply the bounds from the previous two parts.
This leads us directly to the theorem for the conditional MISE of the unnormalized densities and .
where are as in Lemma 3.
In addition, if (H8) holds, then MISE scales optimally in regards to the number of samples,
It’s interesting to note how the number of knots, their placement and the number of samples all play a role in the above bound. If we want to be accurate, all of the parameters and must be chosen appropriately.
3.2 Analysis for renormalization constant
We will now consider the error that arises for MISE when one renormalizes the product of the estimators so it can be a probability density. The renormalization can affect the error since and are rescaled. We define the renormalization constant and its estimator to be
Therefore, we are interested in analyzing
We first state the following lemma for and .
So what the above lemma suggests is that when restricted to the sample subspace , the space where the logspline density estimators , are all defined, the renormalization constant of the product of the estimators approximates the true renormalization constant .
Knowing now how scales we can start analyzing on the sample subspace. However, to make the analysis slightly easier we introduce a new functional, called . This new functional is asymptotically equivalent to MISE as we will show, thus providing us with the means to view how MISE scales without having to directly analyze it.
The functional is asymptotically equivalent to MISE on , in the sense that
Notice that can be written as
and thus by Lemma 6
which then implies the result. ∎
We conclude our analysis with the next theorem, which states how MISE scales for the renormalized estimators.
4 Numerical Error
So far we have estimated the error that arises between the unknown density and the full-data estimator . However, in practice it is difficult to evaluate the renormalization constant
defined in (17). The difficulty is due to the process of generating MCMC samples and thus is not explicitly known. In order to circumvent this issue, our idea is to approximate the integral above numerically. To accomplish this, we interpolate using Lagrange polynomials. This procedure leads to the construction of an interpolant estimator which we then integrate numerically. We then normalize and use that as a density estimator for . Unfortunately, to estimate the error by considering that kind of approximation given an arbitrary grid of points for Lagrange polynomials, independent of the set of knots for B-splines gives a stringent condition on the smoothness of B-splines we incorporate. It turns out that we have to utilize B-splines of order at least . For this reason we consider using Lagrange polynomials of order which satisfy .
4.1 Interpolation of an estimator: preliminaries
We remind the reader the model we deal with throughout our work. We recall that the (marginal) posterior of the parameter (which is a component of a multidimensional parameter ) given the data
partitioned into disjoint sets ,