Imprecise Monte Carlo simulation and iterative importance sampling for the estimation of lower previsions

06/27/2018 ∙ by Matthias C. M. Troffaes, et al. ∙ Durham University 0

We develop a theoretical framework for studying numerical estimation of lower previsions, generally applicable to two-level Monte Carlo methods, importance sampling methods, and a wide range of other sampling methods one might devise. We link consistency of these estimators to Glivenko-Cantelli classes, and for the sub-Gaussian case we show how the correlation structure of this process can be used to bound the bias and prove consistency. We also propose a new upper estimator, which can be used along with the standard lower estimator, in order to provide a simple confidence interval. As a case study of this framework, we then discuss how importance sampling can be exploited to provide accurate numerical estimates of lower previsions. We propose an iterative importance sampling method to drastically improve the performance of imprecise importance sampling. We demonstrate our results on the imprecise Dirichlet model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Various sensible approaches to sampling for estimation of lower previsions can be found in the literature. A case study comparing a wide range of techniques, specifically aimed at reliability analysis, can be found in [10].

A first approach is to use two-level Monte Carlo sampling, where first one samples distributions over the (extreme points of the) credal set, and then samples from these distributions, In the context of belief functions, one can also sample random sets, and then evaluate the resulting belief function through optimisation over these sets [8]. A third more sophisticated approach comprises of importance sampling from a reference distribution, and then solve an optimisation problem over the importance sampling weights [11, 5, 16].

Two-level Monte Carlo can be rather inefficient, especially if a credal set is high-dimensional or if it has a large number of extreme points. Moreover, two-level Monte Carlo generally only provides a non-conservative solution.

Random set sampling is more efficient, but requires a large number of optimisation problems to be solved (one for each sample), and requires a suitable belief function approximation to be identified if one wants to apply this to arbitrary lower previsions.

Importance sampling in imprecise probability has been studied already in the ’90s; see for example [8, 1, 6] for some early works. Importance sampling can be quite effective. For example, [2]

have successfully used sensitivity analysis over importance sampling weights with respect to the mean parameter of a normal distribution. In

[5]

, importance sampling is used over both the mean and the variance parameters of a normal distribution using a 2-dimensional grid. A common issue with importance sampling is that estimates can be very poor if the reference distribution is far off the optimal distribution. This has been recently addressed in

[15], where an iterative self-normalised importance sampling method is proposed that requires far less computational power compared to standard importance sampling methods for sensitivity analysis, in the sense that far smaller samples can be used, and that far smaller optimisation problems need to be solved. A very similar approach using standard importance sampling was proposed in [4].

A second issue, which has received little attention in the literature, concerns the bias and consistency of these estimates. Two-level Monte Carlo methods and importance sampling methods are essentially constructed as lower envelopes of estimators for precise expectations. To the best of the author’s knowledge, in the context of imprecise probability, the bias and consistency of such lower envelopes has not yet been studied elsewhere in the literature.

The first purpose of this paper is to develop a theoretical framework for studying numerical estimation of lower previsions. We do so by looking, in essence, at the estimation of the minimum of an unknown function, given that we can ‘simulate’ the function to an arbitrary degree of precision. This framework applies to two-level Monte Carlo methods, importance sampling methods, and a wide range of other sampling methods one might devise. A first contribution of this paper is a link between consistency of such envelopes and a Glivenko-Cantelli class condition. In case our estimators can be described by a sub-Gaussian process (by the central limit theorem, this will often hold if the sample size is taken large enough), we show how the correlation structure of this process can be used to bound the bias. We provide sufficient conditions for consistency, and we also identify situations where consistency fails. We rely heavily on stochastic process theory, and in particular, on Talagrand’s results on the supremum of stochastic processes

[13].

Although we obtain theoretical bounds, these bounds are not practical in the sense that they require us to bound a rather tricky functional. A second theoretical contribution of the paper is that we propose a new ‘upper’ estimator, which can be used along with the standard lower envelope estimator, in order to provide a single comprehensive confidence interval. Unfortunately, the consistency of this upper estimator is still an open problem, although it is guaranteed to provide an upper bound (hence its name). However, we can identify one situation under which the upper estimator is unbiased.

A second aim of the paper is to study importance sampling for the estimation of lower previsions. We follow [11, 15, 4], and look specifically at how we can take envelopes over the importance sampling weights directly in order to obtain sampling estimates, without needing to draw large numbers of samples, and without needing to solve large numbers of optimisation problems. Unlike [11], however, we do not just look at Bayesian sensitivity analysis, and admit arbitrary sets of distributions in our theoretical treatment. Also unlike for instance [11, 2, 5, 16, 4], in this paper, we use self-normalised importance sampling instead of standard importance sampling, as it is known that this drastically speeds up calculations [15]. In this paper, we will also show that it ensures coherence of the resulting estimates (regardless of bias).

We then revisit the iterative importance sampling method proposed in [15, 4]. The key novelty of this method is the idea of iteratively changing the importance sampling distribution itself, in order to ensure that the final answer has an effective sample size that is as close as possible to the actual sample size. In this paper, we focus on obtaining proper confidence intervals. We reconfirm that the iterative method requires far less computational power compared to standard importance sampling methods for sensitivity analysis, in the sense that far smaller samples can be used, and that far smaller optimisation problems need to be solved. We also identify conditions under which we can obtain confidence intervals almost instantly.

The paper is structured as follows. In section 2, we study the theory behind lower envelopes of estimators. We study bias, consistency, and two ways to obtain a confidence interval. In section 3, we study importance sampling for the estimation of lower previsions. We briefly review the basic theory, and then we revisit the iterative importance sampling method proposed in [15, 4], from the perspective of the results derived in the preceding section. Some numerical examples demonstrate our approach, using the imprecise Dirichlet model. We conclude in section 4.

2. Imprecise Estimation

The aim of this section is to provide a general theoretical framework for studying the lower envelope of a parameterized set of estimators. The main idea is that we can use such a lower envelope to estimate the minimum of an unknown (or, hard to evaluate) function. The application that we have in mind is one where we have a parameterized set of estimators for the expectation of some quantity, and we wish to estimate the lower expectation. However, the theory developed in this section is not tied to any specific parameterized set of estimators.

2.1. Lower and Upper Estimators for the Minimum of a Function

Let

be a random variable (or, vector of random variables) taking values in some set

. Let be a parameter taking values in some set . Let be an arbitrary function, such that for every ,

is an unbiased estimator for

. In other words, is an unbiased estimator for the function . We explicitly isolate the random part of our estimator by writing it as a function of a random variable . This decomposition is essential later, as it will allow us to construct an upper estimator.

More specifically, we assume that for every fixed , is measurable,

(1)

and is finite (and hopefully quite small).

We assume that the function has a minimum:

(2)

Our aim is to construct an estimator for that minimum.

In case of estimation of lower previsions, consider a gamble , a (weak-*) compact collection of previsions (expectation operators) parameterized by , and a collection of estimators for . With

(3)

we aim to construct estimators for and study the properties of such estimators.

Throughout, we assume that is a compact subset of , and that is continuous in for all . This guarantees that can be minimised over for any value of . Because is separable, there is a countable subset of such that, for all ,

(4)

Consequently, is measurable, and additionally there is a measurable function such that

(5)

The next theorem provides us with a lower and upper estimator for .

Theorem 1.

Assume and are i.i.d. and let

(6)
(7)

Then

(8)

and

(9)
Proof.

For all and ,

(10)

This proves the first inequality.

For the second inequality, note that, for all ,

(11)

Now take the expectation, and then the minimum over , from both sides of this inequality, to arrive at

(12)

For the final inequality, first note that for all , by the independence of and ,

(13)

Consequently, by the law of iterated expectation,

(14)

The estimator is used throughout the literature (see [10] and references therein) as an estimator for lower previsions. In that context, it is not normally noted in the literature that is negatively biased, so in that sense, the above theorem provides a ‘new’ result. We shall see that the bias can be very large in specifically constructed examples. However, provided that our estimator has the form of a sample mean, we will show that is a consistent estimator for if is a Glivenko-Cantelli class. If, additionally, is (approximately) a Gaussian process over —this holds if the estimators satisfy the central limit theorem which will be the case for most estimators used in practice—then we show consistency is satisfied if the process has a finite Talagrand functional, and in that case we can also explicitly bound the bias as a function of this functional. Moreover, if for every realisation of , is a coherent prevision when seen as a function of the gamble , then is guaranteed to be a coherent lower prevision in itself.

The estimator is novel, as far as we know. As we shall see, we cannot yet prove much about it. Its main use is that it allows us to bound the bias of without having to explicitly bound the Talagrand functional, which is a challenging problem for general estimators. Currently, we do not know the conditions under which is consistent in general. Further, it is easy to see that is not coherent in general when seen as a function of . In this paper, we will simply use as a diagnostic for to avoid complicated generic chaining arguments to obtain bounds for the Talagrand functional.

Note that is a positively biased estimator for , for any random variable taking values in , as long as is independent of . So we do not necessarily have to take as in the theorem, although this seems the most obvious choice. The theoretically optimal choice for is to take , and in that case is an unbiased estimator for . Finding this optimal choice amounts to calculating , which is the quantity we are aiming to estimate. Therefore, is not a useful choice.

2.2. Unbiased Case

The following theorem states a simple condition under which both and estimators are unbiased.

Theorem 2.

If there is a such that for all , then

(15)

and consequently,

(16)
Proof.

Simply note that we may choose for all . Now apply theorem 1. ∎

So, in this case, the lower and upper estimators coincide, and therefore there is no bias. The theorem provides a reason for choosing as our upper estimator. Indeed, if there is a such that , then will identify it. Normally however, there is no such that for all .

2.3. Consistency

A good estimator will allow us to control the error, through the sample size. For example, an estimator may take the form of a sample mean:

(17)

where and where the are i.i.d. random variables taking values in . Such estimators are related to so-called empirical processes.

For any fixed , we can make the error arbitrary small because

(18)

An estimator where the error can be made arbitrarily small is called consistent. More formally [3, Chapter 6]:

Definition 3.

A (sequence of) estimator(s) for is called consistent whenever, for all , we have that

(19)

A natural question is: if is consistent for , will

(20)

be consistent for ? Even if is biased, consistency of would help, because in that case the bias can be made arbitrarily small by increasing . First, we show that is consistent in case is finite, based on a simple union bound. Then, we consider the infinite case, linking consistency to so-called Glivenko-Cantelli classes. Under the additional assumption that is approximately normally distributed, we will also quantify the bias of based on the Talagrand functional.

Theorem 4.

Suppose that, for all , is a consistent estimator for . If is finite, then is a consistent estimator for .

Proof.

Fix . We know that, for all ,

(21)

We need to show that

(22)

Because

(23)

it suffices to show that both terms on the right hand side converge to zero.

Both terms can be easily bounded using Boole’s inequality. Indeed,

(24)
(25)
(26)
(27)

The last expression converges to zero because is a consistent estimator for .

Similarly,

(28)
(29)
(30)
(31)

Again, the last expression converges to zero because is a consistent estimator for . ∎

The case where is not finite is considerably more difficult. A first important insight is that the problem can be linked to so-called Glivenko-Cantelli classes for those estimators that take the form of a sample mean (which covers a large range of estimators).

Definition 5.

[13, p. 272] Consider a sequence , , …of i.i.d. random variables taking values in . A countable set of measurable functions on is called a Glivenko-Cantelli class if

(32)

Remember that is a countable dense subset of whose existence is guaranteed under the assumptions made at the beginning of the paper. We can now state our first main result.

Theorem 6.

If the set of functions is a Glivenko-Cantelli class, then is a consistent estimator for .

Proof.

Fix any . Then, by Markov’s inequality,

(33)

so it suffices to show that converges to zero. Indeed,

(34)
(35)

By the Glivenko-Cantelli class assumption, the right hand side converges to zero. ∎

2.4. Discrepancy Bounds

For practical reasons, it is useful to theoretically quantify the bias, in order to gain some intuition for how we should design a set of estimators that can achieve a small bias. A range of powerful techniques for evaluating so-called discrepancy bounds is presented in [13, Chapter 9], and these can be readily applied to our problem. However, by the central limit theorem, will approximate a Gaussian process. Consequently, the theory for evaluating the supremum of a Gaussian process [13, Chapter 2] is applicable here too, and as discrepancy bounds for Gaussian processes are much easier to analyse, we will present and apply the key results just for the Gaussian process case here, referring the reader to the literature [13] for further (and a lot more technical) treatment of this fascinating theoretical problem.

For convenience, we introduce:

(36)

We define a pseudo-metric on as follows:

(37)

For any , let denote the diameter of .

We assume that the process satisfies the following increment condition [13, p. 13]:

(38)

This holds if is a Gaussian process. If eq. 38 holds, we will say that is sub-Gaussian.

Definition 7.

[13, p. 25] An admissible sequence is an increasing sequence of partitions of such that the cardinality of is for , and less or equal than for .

For any , denotes the unique element of containing .

Definition 8.

[13, p. 25] For any , define the Talagrand functional as:

(39)

where the infimum is taken over all admissible sequences.

The Talagrand functional is not necessarily finite.

By

, we denote the minimal standard deviation of

, i.e.

(40)

We can now prove our second main result:

Theorem 9.

Assume . There is a constant such that, if is sub-Gaussian, then, for all ,

(41)

and

(42)

Note that the constant in the above theorem is universal, and bounds for it can be computed directly from Talagrand’s proof [13, see bounds preceding Eq. (2.31)].

Proof.

Because of our continuity assumptions, there is an such that

(43)

Without loss of generality, we can assume that . If not, just add to .

Because is sub-Gaussian, we have the following bound for some constant [13, see Eq. (2.31) and use ]:

(44)

Consequently,

(45)
(46)
(47)
and now using for appropriate choice of and ,
(48)
(49)
where we used the Gaussian tail bound, for standard Gaussian , and also eq. 44; now choose to arrive at
(50)

Finally, use and to arrive at eq. 41. The equality follows from the usual properties of the variance of a sum of i.i.d. variables. The Talagrand functional equality follows if we can show that .

Indeed, first, let , and note that

(51)
(52)
(53)
(54)

After taking expectations on both sides,

(55)
(56)
(57)
(58)

where we used the fact that and are independent for , and that are i.i.d. and have expectation zero. Using the definition of , we conclude that

(59)

as desired.

To see why the inequality in eq. 42 holds, observe that for any non-negative random variable

(60)

implies

(61)

Now choose and as in eq. 45, choose , and apply the inequality established in eq. 49. ∎

We immediately conclude:

Theorem 10.

Assume . If is sub-Gaussian, then is a consistent estimator for whenever the minimal standard deviation and the Talagrand functional are finite.

To establish whether or not is finite, a range of practical lower and upper bounds are provided in [13, Section 2.3 et. seq.]. On a very basic intuitive level, we want

(62)
(63)

to be ‘small’. This happens precisely when the estimators and are highly correlated for all and . A simple but important practical example where is ‘too large’ is given next:

Theorem 11.

There is a constant such that, if for some we have that for all , then

(64)

where is the cardinality of .

Proof.

Immediate from [13, Theorem 2.4.1] (the majorizing measure theorem) and [13, Lemma 2.4.1] (Sudakov minoration). ∎

In particular, if is not finite, then under the conditions of the theorem. The condition for all obtains for instance when all are pairwise independent and . Such an estimator will perform very badly. For example, this tells us that when doing two-level Monte Carlo, one should fix the random seed for every run, in order to ensure that the different runs are maximally correlated (and definitely not independent!).

2.5. Confidence Interval

In cases where we can bound the Talagrand functional, eq. 41 in theorem 9 can be used directly to construct a confidence interval for around . In general, however, bounding the Talagrand functional is a non-obvious procedure. The next theorem provides a much simpler procedure for constructing a confidence interval for , using i.i.d. realisations of and instead of using the Talagrand functional. The price we pay is that we need to repeat our calculation for a sufficient number of realisations of .

Note a crucial difference in notation from section 2.3: there was a vector of random variables . In the theorem below, we need to work with independent realisations of , where might be a vector as before, or something else. Either way, to avoid possible confusion with the components of , we will denote these independent realisations by , , …, , and so on.

To apply the central limit theorem, we assume below that is uniformly bounded, but obviously this can be relaxed in the usual ways [9].

Theorem 12.

Let , …, , , …, be a sequence of i.i.d. realisations of . Define

(65)
(66)

Let and be the sample means of these sequences, and let and be their sample standard deviations. Let denote the usual two-sided critical value of the t-distribution with degrees of freedom at confidence level . Then, provided that ,

(67)

In other words, is an approximate confidence interval for with confidence level (at least) .

Before we proceed with the proof, it is worth to make a note about computational efficiency. The slowest part of the computation normally is the optimisation over , i.e. the evaluation of . To find the confidence interval, we need to run evaluations of : one for each in and one for each in . However, instead of using , it will be much faster to use , because for the latter we can recycle the already computed value of . This still produces a valid confidence interval for . However, because it may happen that for some realisations of and , it is not guaranteed that with this modification. Consequently, the resulting confidence interval might be empty. However, the proof below only relies on the probability of

(68)

being close to zero. Since this is very likely to be the case in practice, we suggest to use this faster method, even if it has a minor theoretical flaw, because it will be about twice as fast. It should only not be used when you find that your confidence intervals are regularly empty.

Proof.

By theorem 1, and the central limit theorem,

(69)
(70)

where is as before an i.i.d. realization of . So, because ,

(71)
(72)

We also know that

(73)

and moreover, the intersection of the sets on the left hand side of the above expression must be empty, because it is guaranteed that by eq. 8. Consequently,

(74)

2.6. Confidence Interval for Biased Estimators

We briefly consider the case where is biased. This will allow us to apply our results also on estimators that are only asymptotically unbiased. Specifically, let be any estimator for such that

(75)

for some constant which does not depend on . We then have the following result, extending theorem 12.

Theorem 13.

Let , …, , , …, be a sequence of i.i.d. realisations of . Define

(76)
(77)

Let and be the sample means of these sequences, and let and be their sample standard deviations. Let denote the usual two-sided critical value of the t-distribution with degrees of freedom at confidence level . Then, provided that ,

(78)
Proof.

We know that, for all ,

(79)

and therefore

(80)

since . Consequently,

(81)