Decompounding discrete distributions: A non-parametric Bayesian approach

03/26/2019 ∙ by Shota Gugushvili, et al. ∙ Universität Potsdam Delft University of Technology 0

Suppose that a compound Poisson process is observed discretely in time and assume that its jump distribution is supported on the set of natural numbers. In this paper we propose a non-parametric Bayesian approach to estimate the intensity of the underlying Poisson process and the distribution of the jumps. We provide a MCMC scheme for obtaining samples from the posterior. We apply our method on both simulated and real data examples, and compare its performance with the frequentist plug-in estimator proposed by Buchmann and Grübel. On a theoretical side, we study the posterior from the frequentist point of view and prove that as the sample size n→∞, it contracts around the `true', data-generating parameters at rate 1/√(n), up to a n factor.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

1.1. Problem formulation

Let be a Poisson process with a constant intensity , and let

be a sequence of independent random variables, each with distribution

, that are also independent of . By definition, a compound Poisson process (abbreviated CPP) is

(1)

where here and below the sum over an empty index set is understood to be equal to zero. In particular, . CPP constitutes a classical model in, e.g., risk theory, see Embrechts et al. (1997).

Assume that the process is observed at discrete times , where the grid is not necessarily uniform on . Based on the observations , our goal is to estimate the jump size distribution and the intensity . We specifically restrict our attention to the case where is a discrete distribution, , and we will write

for the probability mass function corresponding to

, where . A similar notation will be used for any other discrete law. The distribution is called the base distribution. Abusing terminology, we will at times identify it with the corresponding probability mass function An assumption that has no atom at zero is made for identifiability: otherwise this atom gets confounded with , which does not allow consistent estimation of the intensity . For a discussion of applications of this CPP model in risk theory, see Zhang et al. (2014).

Define the increments , . Then is a sequence of independent random variables. When are uniformly spaced on , the random variables have in fact a common distribution satisfying . As carries as much information as do, we can base our estimation procedure directly on the increments . Since summing up the jumps ’s amounts to compounding their distributions, the inverse problem of recovering and from can be referred to as decompounding; see Buchmann and Grübel (2003).

There are two natural ways to parametrise the CPP model: either in terms of the pair , or in terms of the Lévy measure of the process , see Sato (2013). A relationship between the two is and . Inferential conclusions in one parametrisation can be easily translated into inferential conclusions into another parametrisation. However, for our specific statistical approach the Lévy measure parametrisation turns out to be more advantageous from the computational point of view.

1.2. Approach and results

In this paper, we take a non-parametric Bayesian approach to estimation of the Lévy measure of See Ghosal and van der Vaart (2017) and Müller et al. (2015) for modern expositions of Bayesian non-parametrics. A case for non-parametric Bayesian methods has already been made elsewhere in the literature, and will not be repeated here. On the practical side, we implement our procedure via the Gibbs sampler and data augmentation, and show that it performs well under various simulation setups. On the theoretical side, we establish its consistency and derive the corresponding posterior contraction rate, which can be thought of as an analogue of a convergence rate of a frequentist estimator (see Ghosal and van der Vaart (2017)). The posterior contraction rate, up to a practically insignificant factor, turns out to be

, which is an optimal rate for non-parametric estimation of cumulative distribution functions. Our contribution thus nicely bridges practical and theoretical aspects of Bayesian non-parametrics.

1.3. Related literature

To provide a better motivation for our model and approach, in this subsection we briefly survey the existing literature. A Bayesian approach to non-parametric inference for Lévy processes is a very recent and emerging topic, with references limited at the moment to

Belomestny et al. (2018), Gugushvili et al. (2015), Gugushvili et al. (2018) and Nickl and Söhl (2017). These deal exclusively with the case when the Lévy measure is absolutely continuous with respect to the Lebesgue density. At least from the computational point of view, these works are of no help in our present context.

Related frequentist papers for CPP models with discrete base distributions are Buchmann and Grübel (2003) and Buchmann and Grübel (2004), which, after earlier contributions dating from the previous century, in fact revived interest in non-parametric techniques for Lévy processes. To estimate the base distribution , Buchmann and Grübel (2003) employ a frequentist plug-in approach relying on the Panjer recursion (i.e., an empirical cumulative distribution estimate of is plugged into the Panjer recursion equations to yield an estimate of ; see below on the Panjer recursion). The drawback is that the parameter estimates are not guaranteed to be non-negative. Buchmann and Grübel (2004) fix this problem by truncation and renormalisation. This works, but looks artificial. As noted in Buchmann and Grübel (2004), in practice the latter approach breaks down if no zero values are observed among ’s. Buchmann and Grübel (2004) establish weak convergence of their modified estimator, but on the downside its asymptotic distribution is unwieldy to give confidence statements on . Most importantly, the plug-in approaches in Buchmann and Grübel (2003) and Buchmann and Grübel (2004) do not allow obvious generalisations to non-uniformly spaced observation times In Lindo et al. (2018), another frequentist estimator of the jump measure is introduced, that is obtained via the steepest descent technique as a solution to an optimisation problem over the cone of positive measures. The emphasis in Lindo et al. (2018) is on numerical aspects; again, no obvious generalisation to the case of non-uniform is available.

Finally, some important, predominantly theoretical references on inference for Lévy processes are Comte and Genon-Catalot (2011), Duval and Hoffmann (2011), Kappus (2014), Neumann and Reiß (2009), Nickl and Reiß (2012), van Es et al. (2007) and Trabs (2015). We refer to Belomestny et al. (2015), Coca (2018), Coca (2018) and Duval and Mariucci (2017) for more extensive literature surveys.

1.4. Outline

The rest of the paper is organised as follows: in Section 2 we introduce our approach and describe an algorithm for drawing from the posterior distribution. In Sections 3 and 4 we study its performance on synthetic and real examples. Section 5 is devoted to the examination of asymptotic frequentist properties of our procedure. An outlook on our results is given in Section 6. Finally, in Appendix A technical lemmas used in the proofs of Section 5 are collected.

1.5. Notation

For two sequences and of positive real numbers, the notation (or ) means that there exists a constant that is independent of and such that We write if both and hold. We denote a prior (possibly depending on the sample size ) by . The corresponding posterior measure is denoted by

The Gamma distribution with shape parameter

and rate parameter () is denoted by . Its density is given by , where is the Gamma function. The inverse Gamma distribution with shape parameter and scale parameter is denoted by . Its density is . We use the notation

for an exponential distribution with mean

.

2. Algorithm for drawing from the posterior

A Bayesian statistical approach relies on the combination of the likelihood and the prior on the parameter of interest through Bayes’ formula. We start with specifying the prior. As far as the likelihood is concerned, although explicit, it is intractable from the computational point of view for non-parametric inference in CPP models. We will circumvent the latter problem by means of data augmentation, as detailed below.

2.1. Prior

We define a prior on through a hierarchical specification

Note that the (fixed) hyperparameters

, are denoted by Latin letters.

The hyperparameter incorporates our a priori opinion on the support of the Lévy measure , or equivalently, the base measure . In applications, the support of may be unknown, which necessitates the use of a large , e.g. ; this latter is the maximal value suggested by the data at hand. Nevertheless, we may simultaneously expect that the ‘true’, data-generating charges full mass only to a proper, perhaps even a small subset of the set . In other words, may form a sparse sequence, with many components equal to zero. In fact, there are at least two plausible explanations for an occurrence of a large increment in the data: either a few large jumps ’s occurred, which points towards a large right endpoint of the support of , or is predominantly formed of many small jumps, which in turn indicates that the intensity of the Poisson arrival process may be large. To achieve accurate estimation results, a prior should take a possible sparsity of into account. This is precisely the reason of our hierarchical definition of the prior : a small encourages a priori the shrinkage of the components of towards zero.

2.2. Data augmentation

Assume temporarily , and write for . Then have the distribution

(2)

with denoting convolution. The compounding mapping can be expressed explicitly via the Panjer recursion (see Panjer (1981)):

This recursion can be inverted to give the inverse mapping via

In view of (2), the likelihood in the CPP model is explicit. Nevertheless, an attempt to directly use (2) or the Panjer recursions in posterior computations results in a numerically intractable procedure. Equally important is the fact that a Panjer recursion based approach would not apply to non-uniformly spaced observation times . Therefore, instead of (2) and the Panjer recursion, we will employ data augmentation, see Tanner and Wong (1987). We switch back to the case when are not necessarily uniformly spaced. The details of our procedure are as follows: when the process is observed continuously over the time interval , so that our observations are a full sample path of CPP, the likelihood is tractable and is proportional to

see Shreve (2004), p. 498. Here

(3)

i.e. the total number of jumps of size . Then the prior from Subsection 2.1 leads to conjugate posterior computations. In fact, the full conditionals are

Therefore, the Gibbs sampler for posterior inference on can be implemented. The Gibbs sampler cycles through the above conditionals a large number of times, generating approximate (dependent) samples from the posterior. See, e.g., Gelfand and Smith (1990) and Section 24.5 in Wasserman (2004) on the Gibbs sampler and its use in Bayesian statistics.

As we do not observe the process continuously, we will combine the above with the data augmentation device. First note that we have

where are independent, and for and ; see Corollary 11.3.4 in Shreve (2004). Furthermore, for as in (3) we trivially have . Data augmentation iterates the following two steps:

  1. Draw conditional on the data and the parameter .

  2. Draw conditional on .

Once the algorithm has been run long enough, this gives approximate (dependent) samples from the posterior of . We already know how to deal with step (ii); now we need to handle step (i).

Thus, keeping fixed, for each we want to compute the conditional distribution , and furthermore, we want to be able to simulate from this distribution. In turn, this will immediately allow us to simulate conditional on the data . Now, with referring to probability under the parameter , it holds that

Knowledge of the normalising constant will not be needed in our approach.

In general, simulation from a discrete multivariate distribution is non-trivial; some general options are discussed in Devroye (1986), Chapter XI, Section 1.5, but are unlikely to work easily for a large . We will take an alternative route and use the Metropolis-Hastings algorithm, see, e.g., Section 24.4 in Wasserman (2004). We start by observing that for a fixed , the support of is precisely the set of non-negative solutions of the Diophantine equation The R package nilde (see Pya Arnqvist et al. (2018)) implements an algorithm from Voinov and Nikulin (1997) that finds all such solutions for given integers and . By Markovianity of the process

, we can simulate the vectors

independently for each . If or , there is only one solution to the Diophantine equation: the trivial solution in the first case, and the solution in the second case; for such , no simulation is required, as is known explicitly. We thus only need to consider each in turn, and design a Metropolis-Hastings move on the set of the corresponding solutions . Fix once and for all an ordering of elements in (this could be, e.g., lexicographic, or reverse lexicographic); we use the notation for the cardinality of . Let be the current state of the chain, corresponding to the th element of . A proposal is obtained as follows:

  1. If , draw uniformly at random among the elements of .

  2. If , draw uniformly at random among the elements of .

  3. If or , draw uniformly at random among the elements of .

Occasionally, one may want to propose another type of a move too.

  1. Draw uniformly at random from .

The two proposals lead to reversible moves, and one may also alternate them with probabilities and , e.g. . The logarithm of the acceptance probability of a move from to is computed as

The move is accepted if for an independently generated uniform random variate on , and in that case the current state of the chain is reset to . Otherwise the chain stays in .

3. Simulation examples

In this section, we test performance of our approach in a range of representative simulation examples. For benchmarking, we use the frequentist plug-in estimator from Buchmann and Grübel (2004). Two real data examples are given in Section 4. Unless otherwise stated, we took and as hyperparameters in our prior specification. As can be seen from the update formulae for the Gibbs sampler, as long as is not taken too large, its precise value is not very influential on the posterior, given a reasonable sample size. The value ensures that the update step for

has finite variance. At each step of updating the imputed data for increment size

we have chosen with probability to propose uniformly from all solutions to the Diophantine equation (for that particular value of ).

We implemented our procedure in Julia, see Bezanson et al. (2017). The code and datasets for replication of our examples are available on GitHub111See https://github.com/fmeulen/Bdd and Zenodo, see Gugushvili et al. (2019).

3.1. Uniform base distribution

This simulation example follows with some extensions that in Buchmann and Grübel (2004). Let , and let

be the discrete uniform distribution on

. We simulated data according to the following settings:

  • , for ;

  • , for (the data under (a) are augmented with extra observations);

  • , for .

We set , where In all cases this led to , as the value of was equal to , and for the simulated data under settings (a), (b) and (c), respectively. The Gibbs sampler was run for iterations, of which the first were discarded as burn-in. From the remaining samples, the posterior mean and and percentiles were computed for each coefficient . The results for the first coefficients are shown in Figure 1. For comparison, the estimator from Buchmann and Grübel (2004) is also included in the figure.

Figure 1. Simulation example from Section 3.1. In each figure, the horizontal axis gives the magnitudes of , . The orange balls denote the true values, the black triangles the Buchmann-Grübel estimator. The blue crosses give the posterior means, whereas the vertical blue line segments represent (pointwise) credible intervals. The settings corresponding to (a), (b) and (c) are explained in the main text. Note the differences in vertical scale across the figures.

For setting (b), traceplots of every th iteration for a couple of coefficients are shown in Figure 2.

Figure 2. Traceplots for the simulation example from Section 3.1 under setting (b). The posterior samples were subsampled, with every th iteration kept. The displayed results are for parameters , and .

We measure the error of an estimate by . The errors are reported in Table 1. In all settings, for these particular realisations of the simulated data, the Bayesian procedure outperformed the truncated estimator from Buchmann and Grübel (2004). For setting (c), the latter is completely off, as was to be expected, given that it is derived under the assumption for all . An advantage of the Bayesian procedure is the included measure of uncertainty, namely the credible intervals for

. On the other hand, for the Buchmann-Grübel estimator it is hardly possible to derive confidence intervals, since the limiting distribution of the estimator is fairly complicated.

Simulation setting (a) (b) (c)
Buchmann-Grübel estimator
Posterior mean
Table 1. Estimation results for scenarios (a)–(c) from Section 3.1. See the text for details.

3.2. Geometric base distribution

The setup of this synthetic data example likewise follows that in Buchmann and Grübel (2004). Assume

is a geometric distribution with parameter

, i.e.  for . Then , and

Hence, .

We consider two simulation setups:

  • , for and ;

  • , for and .

We set and ran the sampler according to the settings of Section 3.1. The results for both scenarios (a) and (b) are reported in Figure 3. In Table 2 we also report estimation errors in one simulation run. For this example and these generated data, the Bayesian procedure gives less precise point estimates than the Buchmann-Grübel method. Note that estimation error for is smaller than that for This appears intuitive, as a smaller value of corresponds to a larger value of The latter implies that on average each is a superposition of a larger number of jumps, which renders the decompounding problem more difficult. However, this argument is hard to formalise.

Simulation setting (a) (b)
Buchmann-Grübel estimator
Posterior mean
Table 2. Estimation results for scenarios (a)–(b) from Section 3.2. See the text for details.
Figure 3. Simulation example from Section 3.2. Settings (a) and (b) correspond to the true jump distributions and respectively. The horizontal axis gives the magnitudes of , . The orange balls denote the true values, the black triangles the Buchmann-Grübel estimator. The blue crosses give the posterior means, whereas the vertical blue line segments represent (pointwise) credible intervals.

3.3. Monte Carlo study

For a more thorough comparison of the Buchmann-Grübel estimator and our Bayesian method, we performed a small Monte Carlo experiment. We considered two settings:

  1. The setting from Section 3.1 with . We took .

  2. The setting from Section 3.2 with . We took .

In both cases we assumed for all . The number of Monte Carlo repetitions was taken equal to . We took MCMC iterations and discarded the first half of these as burn-in. In Figure 4 we give a graphical display of the results by means of boxplots of the errors. Here, as before, if the true values are denoted by and the estimate within a particular simulation run by , the error is defined by (we truncated the infinite summation to ). The results agree with our earlier findings, in that there is no clear “winner” in the comparison. Note that for the setting (ii) we considered both and in the prior specification. Both values give similar performance of the Bayesian method. This provides insight into sensitivity of our results with respect to the prior specification. A minor difference between the middle and rightmost panels of Figure 4 may be attributed to Monte Carlo error: the simulated datasets on which these panels are based on are not the same. Note that our prior promotes sparsity, and in that respect it is not surprising it does better when the true data-generating Lévy measure is ‘sparse’.

Figure 4. Monte Carlo study from Subsection 3.3 comparing the Buchmann-Grübel estimator and the Bayesian method proposed in this paper. In this figure “bg” refers to the Buchmann-Grübel estimator, while “bayesmedian” and “bayesmean” refer to the Bayesian method, where either the median or mean was used as a point estimator for each . The leftmost panel corresponds to the setting (i), whereas the other two panels to the setting (ii). In the latter we used both and in the prior specification.

3.4. Computing time

In terms of computational effort, the time it takes to evaluate the Buchmann-Grübel estimator is negligible compared to our algorithm for sampling from the posterior. This is not surprising, as that frequentist estimator relies on a plug-in approach, whereas in our case an approximation to the posterior is obtained by MCMC simulation. However, if proper uncertainty quantification is desired, then the Bayesian method is advantageous in the sense that it does not solely produce a point estimate.

Note that the proposed MCMC scheme requires determination of the solutions to the Diophantine equation for all unique values in the observation set. For moderate values of , say this is rather quick, but for large values of the computing time increases exponentially, as does the amount of the allocated memory. The computing time of each Metropolis-Hastings step is then very small, but we potentially need a very large number of iterations to reach stationarity. The latter is caused firstly by the fact that at a particular iteration, our proposals for do not take into account the current values of ; secondly, the size of the state space that needs to be explored increases exponentially with .

4. Real data examples

4.1. Horse kick data

To further illustrate our procedure, we will use the von Bortkewitsch data on the number of soldiers in the Prussian cavalry killed by horse kicks (available by year and by cavalry corps); this example was also employed in Buchmann and Grübel (2003). Each observation is an integer from to , giving the number of deaths for a given year and a given cavalry corps, with overall counts reported in Table 3. The data are extracted from the table on p. 25 in von Bortkewitsch (1898). Note that von Bortkewitsch corrects for the fact that the Guards and I, VI and XI cavalry corpses of the Prussian army had a different organisation from other units, and justifiably omits the corresponding counts from consideration.

It has been demonstrated by von Bortkewitsch that the Poisson distribution fits the horse kick data remarkably well. Assuming instead that observations follow a compound Poisson distribution is a stretch of imagination, as that would correspond to a horse running amok and killing possibly more than one soldier in one go. Nevertheless, this example provides a good sanity check for our statistical procedure.

Deaths
Counts
Table 3. Data on the number of soldiers in the Prussian cavalry killed by horse kicks. See the table on p. 25 in von Bortkewitsch (1898).

The estimation results are graphically depicted in Figure 5. Clearly, point estimates of both methods are in agreement and lend support to the Poisson model for this dataset.

Figure 5. Estimation for the horse kick data from Subsection 4.1. The horizontal axis gives the magnitudes of , . The black triangles denote the Buchmann-Grübel estimator, the blue crosses give the posterior means, whereas the vertical blue line segments represent (pointwise) credible intervals.

4.2. Plant data

Our second real example is the one used in Buchmann and Grübel (2004). Consider the data in Table 4, taken from Evans (1953). The data were collected as follows: the area was divided into plots of equal size and in each plot the number of plants was counted; the number of plants in each plot ranges from to . The second row of Table 4 gives the counts of plots containing a given number of plants; thus, there were plots that contained no plant, that contained plant, etc. It is customary in the ecological literature to model such count data as i.i.d. realizations from a compound Poisson distribution. Thus, e.g., Neyman (1939) advocated the use of a Poisson base distribution in this context; another option here is a geometric base distribution. Given existence of several distinct modelling possibilities, performing an exploratory non-parametric analysis appears to be a sensible strategy.

The estimation results are graphically depicted in Figure 6. There are some small differences between the posterior mean and the Buchmann-Grübel estimate, but overall they are very similar.

Plants
Counts
Table 4. Plant population data from Evans (1953).
Figure 6. Estimation results for the plant data from Subsection 4.2. The horizontal axis gives the magnitudes of , . The black triangles denote the Buchmann-Grübel estimates, the blue crosses give the posterior means, whereas the vertical blue line segments represent (pointwise) credible intervals.

5. Frequentist asymptotics

In this section we assume that the observation times are equidistant: . To evaluate our Bayesian method from a theoretical point of view, we will verify that it is consistent, and we will establish the rate at which the posterior contracts around the ‘true’, data-generating Lévy measure ; see Ghosal and van der Vaart (2017) for a thorough treatment of Bayesian asymptotics from the frequentist point of view. From now on the subscript in various quantities will refer to the data-generating distribution.

Our strategy consists in proving that the posterior contraction rate for , given the sample , can be derived from the posterior contraction rate for given , which is mathematically easier since is a sequence of independent and identically distributed random variables with distribution . We therefore effectively avoid dealing directly with the inverse nature of the problem of estimating .

The prior we consider in this section is defined as follows:

  • Endow the rate of the Poisson process with a prior distribution.

  • Independently, endow the vector with a Dirichlet distribution with parameter .

  • Set a priori for all .

This is a somewhat simplified version of the prior we used in Section 2, which allows us to concentrate on essential features of the problem, without need to clutter the analysis with extra and unenlightening technicalities. Also remember the well-known relationship between the Gamma and Dirichlet distributions: if are independent Gamma distributed random variables, , then for , the vector follows the Dirichlet distribution with parameter ; furthermore, we have that , and are independent of .

In our asymptotic setting, we will make dependent on and let at a suitable rate as .

Recall that we write for . Let denote the collection of all probability measures supported on .

Theorem 1.

Suppose there exists , such that for all . Suppose with and that has a compact support. Then, for any ,

in -probability, as .

Remark 1.

Note that since the support of is not assumed to be known, our CPP model is still naturally non-parametric. The assumption of the compact support of does not cover the simulation example of Section 3.2. However, its relaxation appears to pose very difficult technical challenges and is not attempted in this work.

The remainder of this section is devoted to the proof of Theorem 1.

5.1. Basic posterior inequality via the stability estimate

A key step of the proof of Theorem 1 is the stability estimate in Equation (5) below, which bounds the total variation distance between the Lévy measures in terms of the total variation distance between the corresponding compound distributions .

In principle, it is conceivable that the Panjer recursion should allow one to bound probability distances between -probabilities via distances between -probabilities; we call such a bound a stability estimate. Nevertheless, explicit as the equations of the Panjer recursion are, they are still somewhat unwieldy for that purpose. Hence we will use another inversion formula from Buchmann and Grübel (2003), that will lead to the stability estimate we are after.

First we introduce some notation, and also recall a few useful facts summarised in Buchmann and Grübel (2003). The space of absolutely summable sequences is defined as , with a norm given by . For probability vectors , the norm is (twice) the total variation distance between and . For any , we have the inequality

(4)

where denotes convolution of and We define a mapping from into via

The exponential has the following two useful properties:

and

We define a sequence , such that and its all other entries are equal to zero. Then, using the above properties of the exponential, we can write concisely the compounding mapping in (2) in terms of convolutions of infinite sequences: . Its convolution inverse, i.e.  such that , is given by . Note that . We have the following recursive expressions

Lemma 1.

Let correspond to two pairs and , respectively (and correspond to , i.e. the pair ). Then, in accordance with the notation introduced above and provided that , it holds that

(5)
Proof.

The result is a direct consequence of Lemma 3 in Buchmann and Grübel (2003), which states that

whenever . Taking the -norm on both sides and some elementary bounding via (4) imply that

and thus Equation (5) follows. ∎

We will use Equation (5) to establish the key inequality for the posterior measure . We recall once again that the subscript refers to ‘true’, data-generating quantities.

Proposition 1.

For any prior on , for any and for any , the following posterior inequality holds:

Proof.

Write as a union of the sets

and

Thanks to Lemma 1, the set

is a subset of . The proof is concluded by observing that is a subset of , too, since . ∎

In general, stability estimates like the one in Equation (5) are unknown in the literature on Lévy processes. Consequently, studying Bayesian asymptotics for Lévy models, even in the CPP case, necessitates the use of very intricate arguments under restrictive assumptions; see, e.g., Nickl and Söhl (2017).

5.2. Proof of Theorem 1

The usefulness of Proposition 1 above lies in the fact that the posterior contraction rate in the inverse problem of estimating the Lévy measure from indirect observations can be now deduced from the posterior contraction rate in the direct problem of estimating the compound distribution , which is easier (observe that is determined by and is therefore fixed in the proofs). The general machinery developed in Ghosal et al. (2000) can be applied to handle the latter, and also several inequalities obtained in Gugushvili et al. (2015) are useful in that respect. In particular, we make use of the following inequality for the Hellinger distance,

(6)

Cf. Lemma 1 in Gugushvili et al. (2015). To ease our notation, in the sequel we will often write and instead of and respectively.

Denote

Another two inequalities we will use are the following: let Then there exists a positive constant , such that

(7)

cf. equations (14) and (15) in Lemma 1 in Gugushvili et al. (2015).

These three inequalities can be obtained by adjustment of the arguments used in Gugushvili et al. (2015). However, we opted to give their direct proofs in Lemma 5 from Appendix  A under slightly weaker conditions than required in Gugushvili et al. (2015).

Our proof of Theorem 1 proceeds via verification of the conditions for posterior contraction in Theorem 2.1 in Ghosal et al. (2000). In our setting, the latter result reads as follows.

Theorem 2.

Assume where are independent and identically distributed with distribution . Let denote the Hellinger metric on , a collection of all measures with support in . Suppose that for a sequence with and , a constant and sets , we have

Then, for sufficiently large , we have that in -probability.

We will now verify the three conditions of this theorem, which we refer to as the entropy condition, the remaining mass condition, and the prior mass condition, respectively. To that end, fix strictly positive sequences , , and define the sieves

5.2.1. Entropy

We start with bounding the entropy of the sieve for -balls of radius .

Lemma 2.

Assume that as

(8)

Then

(9)
Proof.

For ,

Furthermore, from Section 3.2 in Pollard (2002),

Combining the preceding two displays and Equation (6), we get

Hence, if

then the Hellinger distance between and is bounded by . To cover , we need at most intervals of length . To cover discrete distributions with support in , we need at most

-balls of radius . Under assumption (8), the summand in the above display is asymptotically negligible and can be omitted. In that case, the number of -balls that we need to cover is of order

Taking the logarithm and next a straightforward rearrangement of the terms gives the statement of the lemma. ∎

5.2.2. Remaining prior mass

Now we will derive an inequality for the remaining prior mass.

Lemma 3.

For with ,

Proof.

We have (with a slight abuse of notation)

Now,

Furthermore,

The proof is concluded. ∎

5.2.3. Prior mass

Finally, we lower bound the prior mass in a small Kullback-Leibler neighbourhood of the data-generating compound distribution . Define the function by

where is the constant appearing in the statement of Lemma 6 below.

Lemma 4.

Assume that

  1. there exists , such that for all ;

  2. strictly positive sequences , and satisfy the inequalities