1 Introduction
In this paper we consider a mixture importance sampling strategy to estimate the probability that one or more of a set of rare events takes place. The sampler repeatedly chooses a rare event at random, and then samples the system conditionally on that one event taking place. For each such sample, the total number of occuring events is recorded and a certain reciprocal moment of them is used in the estimate.
This method is a special case of an algorithm in Adler et al. (2012) for computing exceedance probabilities of Gaussian random fields. It was used earlier by Shi et al. (2007) and Naiman and Priebe (2001) for extrema of genomic scan statistics. Priebe et al. (2001) used it for extrema of some spatial statistic involving marked point processes. The earliest use we know is Frigessi and Vercellis (1985) who used it to estimate the cardinality of the union of a given list of finite sets. The above cited papers refer to this method as importance sampling. To distinguish it from other samplers, we will call it ALOE for “At Least One rare Event”.
We develop general bounds for the variance of the ALOE importance sampler, and for its coefficient of variation. It has a sampling standard deviation that is no more than some modest multiple of the event probability. This is an especially desirable property in rare event settings. For background on importance sampling of rare events see
L’Ecuyer et al. (2009).Our motivating context is the reliability of the electrical grid when subject to random inputs, such as variable demand by users and variable production, as occurs at wind farms. The rare events describe unacceptably large electrical phase differences at pairs of connected nodes in the grid.
It is common to use a simplified linear direct current (DC) model of the electrical grid, because the equations describing alternating current (AC) are significantly more difficult to work with, and some authors (e.g., Van den Bergh et al. (2014)) find that there is little to be gained from the complexity of an AC model. This DC model is presented in Sauer and Christensen (1984) and Stott et al. (2009). It is also common to model the randomness in the grid as Gaussian, especially over short time horizons.
We make both of these simplifications: linearity and Gaussianity. The probability we consider can then be written
(1) 
Section 2 introduces more notation for problem (1) and develops the ALOE sampler as an especially convenient version of mixture importance sampling. In this setting we can compute the union bound . Theorem 1 proves that the ALOE estimate has variance at most when IID samples are used. This can be much smaller than which arises from sampling the nominal distribution of . Section 3 discusses some further sampling properties of our estimator that hold without the Gaussian assumption. When there are events, the variance of is at most when the system is sampled times. Section 4 compares ALOE to a state of the art code mvtnorm (Genz et al., 2017) for estimating the probability that a multivariate Gaussian of up to variables with arbitrary covariance belongs to a given hyperrectangle. ALOE is simpler and extends to higher dimensions. When we studied rare event cases, ALOE was more accurate. In our examples that are not rare events, mvtnorm was more accurate. Section 5 describes the power system application. Section 6 contains some discussions. The appendix proves Theorem 1 for any set of events, not just those given by a Gaussian distribution. The theorem applies so long as we can sample conditionally on any one event and then determine which other events also occur. We finish this section with some comments and some references.
One common way for rare event sampling to be inaccurate is that we might fail to obtain any points where the rare event happens. That leads to a severe underestimation of the rare event probability. In ALOE, the corresponding problem is the failure to sample any points where two or more of the rare constituent events occur. In that case ALOE will return the union bound as the estimated rare event probability instead of zero. That is also a setting where the union bound is likely to be a good approximation. So ALOE is robust against severe underestimates of the rare event probability. The second common problem for rare event sampling is an extreme value of the likelihood ratio weighting applied to the observations. In ALOE, the largest possible weight is only times as large as the smallest one.
Our sampler is closely related to instanton methods in power systems engineering. See Chertkov, Pan et al. (2011), Chertkov, Stepanov et al. (2011), and Kersulis et al. (2015). Out of all the configurations of random inputs to a system, the most probable one causing the failure is called the instanton. When there are thousands of failure types there are correspondingly thousands of instantons, each one a conditional mode of the distribution of . Our initial thought was to do importance sampling from a mixture of distributions, with each mixture component defined by shifting the Gaussian distribution’s mean to an instanton. By sampling conditionally on an event, ALOE avoids wasting samples outside the failure region. By conditioning instead of shifting, we get better control over the likelihood ratio in the importance sampler.
ALOE is a form of multiple importance sampling. Multiple importance sampling originated in computer graphics (Lafortune and Willems, 1993; Veach and Guibas, 1994). Owen and Zhou (2000) found a useful way to combine it with control variates defined by the mixture components. Elvira et al. (2015a, b) investigate computational efficiency of some mixture importance sampling and weighting strategies.
We do not consider selfnormalized importance sampling (SNIS) in this paper. SNIS is useful in settings where we can compute an unnormalized version of our target density but cannot sample from it efficiently, if at all. SNIS is common in Bayesian applications (Liu, 2001, Chapter 2). For a recent adaptive version of SNIS, see Cornuet et al. (2012). In estimating rare event probabilities, the optimal sampler for SNIS can be shown to place half of its probability in the rare event and half outside the rare event, while the optimal plain IS estimator places all of its probability on the rare event. The best coefficient of variation for rare event estimation under SNIS is asymptotically . Ordinary importance sampling can attain much smaller variances, and so we focus on it for the rare event problem.
2 Gaussian case
For concreteness, we present ALOE first for Gaussian random variables. We let have the standard Gaussian distribution, , deferring general Gaussians to Section 2.1. We are interested in computing the probability that lies outside a polytope . In our motivating applications, the interior of the polytope defines a safe operating region and we assume that is a rare event. For , define halfspaces
where each and , with . Then and we want to find where . The set is convex and not necessarily bounded. Ordinarily , because we are interested in rare events.
The setting is illustrated in Figure 1 for halfspaces. In that example, two of the halfspaces have their conditional modes inside the union of the other halfspaces. One of those halfspaces is entirely included in the union of the others.
Letting , we know that
The right hand side is the union bound which is sometimes very conservative and sometimes quite accurate.
We will need to use some inclusionexclusion formulas, so some notation for these follows. For any , let , so and by convention . We identify the set with the function . Next define for . Let count the number of rare events that happen. For , let give the distribution of . We use for the cardinality of . Our estimand is
(2) 
by inclusionexclusion.
Frigessi and Vercellis (1985) and other papers present the ALOE sampler but do not derive it. Here we motivate it as an especially simple mixture sampler. The mixture components we use are conditional distributions , for
. They have probability density functions
.Let be nonnegative numbers summing to , and . A mixture importance sampling estimate of based on draws is
(3) 
Notice that has conveniently canceled from numerator and denominator. Although the inclusionexclusion formula (2) contains nonzero terms, each summand in the unbiased estimate in (3) can be computed at cost .
We can induce further cancellation in (3) by making constant in . Taking , we get
(4) 
because always holds for . The estimate (4) is a multiplicative adjustment to the union bound . The terms range from to and so we will never get larger than the union bound or smaller than . This is convenient because always holds.
Theorem 1.
Proof.
See the appendix, where this is proved for a general set of events, not necessarily from Gaussian halfspaces. ∎
The upper bound (6) involves the unknown , so it is not available for planning purpose when we want to select . The variance and the coefficient of variation, can both be bounded in terms of known quantities and as follows.
Corollary 1.
Let be given by (4). Then . If then also . Similarly,
(7) 
A rare event estimator has bounded relative error if remains bounded as one takes the limit in a sequence of problems (Asmussen and Glynn, 2007, Chapter VI). The sequence is typically one where the event of interest becomes increasingly rare. Corollary 1 provides a bounded relative error property for ALOE in any sequence of problems where is uniformly bounded.
2.1 General Gaussians
Now suppose that we are given and the halfspaces are defined by . We assume that is nonsingular. If it is not, then we can reduce to a subset of components whose variance is nonsingular, and write the other components as linear functions of this reduced set. We also assume that we can afford to take a matrix square root . Now , and . Then the halfspaces are given by
for . For rare events, we will have . In some of our motivating contexts one must optimize a cost over . Here we remark that changes to change but not .
2.2 Sampling algorithms
We want to sample conditionally on
for a unit vector
and scalar . We can use the following steps:
Sample .

Sample .

Let .

Deliver .
Step 3 generates a truncated Gaussian by inversion.
The algorithm above can be problematic numerically when is close to as it will be for very rare events. For instance, in the R language (R Core Team, 2015), yields and then yields for any . Some of our electrical grid examples have . That is, some of the potential failure modes are virtually impossible.
Because might be quite large, we get better numerical stability by sampling conditionally on and then delivering . The steps are as follows:

Sample .

Sample .

Let .

Let .

Deliver .
Even a very small combined with yields
without any underflow in the R language (R Core Team, 2015). In cases with extremely large we will ordinarily get and then never sample conditionally on the corresponding . We compute step 4 via to avoid a potentially expensive multiplication .
3 Importance sampling properties
As shown in the Appendix, Theorem 1 holds more generally than the Gaussian case. In this more general setting, we have events, , on a common sample space where has probability density . Event has probability . As before, we want where and the union bound is . We assume that . The upper bound only has to be checked if . If , then we know without any sampling.
When we sample, we ensure that at least one rare event takes place every time, by first picking an event with probability proportional to . Then we sample conditionally on and find , the total number of events that occur. This includes and so our sample values always have . The importance sampling estimate averages over independent replicates. As in the prior section, we use
for the probability of exactly events happening. Then the variance of is given by (6).
The optimal importance sampling distribution for estimating is uniform on . Sampling from this distribution would yield an estimate with variance zero. Not surprisingly, we are seldom able to do that in applications. The ALOE sampler takes with probability proportional to , so it has support set .
We think that many applications will have events that rarely cooccur. In that case is nearly constant at for , and the ALOE sampler is close to the optimal importance sampler. Other applications may have a few near duplicated events that cooccur often. One extreme setting has a common cause that triggers all events at once and those events almost never arise outside of that common situation. In that case is again nearly constant on , this time usually equal to , and ALOE is again nearly optimal.
The variance bound from (6) can be conservative. It stems from , when . If is appreciably large then the variance can be meaningfully less than that bound. We can improve the variance bound by using the following lemma.
Lemma 1.
Let be a random variable supported on for . Then
(9) 
with equality if and only if .
Proof.
See the appendix. ∎
Lemma 1 tells us that for , our worst case setting is one where half of the time that one or more events happen, exactly one happens and half of the time, all of them happen. While that is not plausible for Gaussian and large it can indeed happen for combinatorial enumeration problems like those of Frigessi and Vercellis (1985). From Theorem 2 and Lemma 1, we get
(10) 
because
is a probability distribution on
.Sometimes we are interested in the probability of subevents of . Let be supported on and define . We may use ALOE, via
Then by the same arguments used in the Appendix,
If , then . That is, when describes a rare event that can only occur if one or more of the also occur, we can reduce its Monte Carlo variance from to at most , in cases where .
4 Comparisons
Here we consider some numerical examples comparing ALOE to pmvnorm from the R package mvtnorm (Genz et al., 2017). This package can make use of special properties of the Gaussian distribution, and it works in high dimensions.
We begin by describing mvtnorm based on Genz and Bretz (2009) and a personal communication from Alan Genz. The program computes
for , where for , and can be rank deficient. We can use it to compute for via
The code can handle dimensions up to . In our context, that means at most halfspaces. The dimension can be higher. The related pmvt function handles multivariate random variables. The code has three different algorithms in it. One from Genz (2004) handles two and three dimensional semiinfinite regions, one from Miwa et al. (2003) is for dimensions up to and the rest are handled by an algorithm from Genz and Bretz (2009). This latter algorithm uses a number of methods. It uses randomized Korobov lattice rules as described by Cranley and Patterson (1976) for the first dimensions, in conjunction with antithetic sampling. There are usually randomizations. For more than dimensions it applies a method from Niederreiter (1972). There are a series of increasing sample sizes in use, and the method provides an estimated error (standard errors) based on the randomization. The approach is via sequential conditional sampling, after strategically ordering the variables (e.g., putting unconstrained ones first). The R package calls a FORTRAN program for the computation, so it is very fast. We use the default implementation which uses up to quadrature points.
The main finding is that importance sampling is more effective when the polytope of interest is the complement of a rare event. This is not meant to be a criticism of pmvnorm. That code was not specifically designed to compute the complement of a rare event. The comparison is relevant because we are not aware of alternative code tuned for the high dimensional rare event cases that we need, and pmvnorm is a well regarded and widely available general solution, that seemed to us like the best offtheshelf tool.
4.1 Circumscribed polygon
Let be the regular polygon of sides circumscribed around the circle of radius . This polygon is the intersection of where where , for . We want for . Here we know that . Also, the gap between the circle of radius and the circumscribed polygon has area . The bivariate Gaussian density in this gap is at most . Therefore
that is
for large .
For and , we have . The lower bound is about times the upper bound, so we treat the upper bound as exact. Figure 2 shows histograms of simulations of using ALOE and using pvnorm. In this case ALOE is much more accurate. The mean square relative error is about
fold smaller for ALOE than pvnorm. We also see that pvnorm has high positive skewness and the histogram of estimates has most of its mass well below the mean.
Table 1 shows summary results for this problem with different values of . We see that pvnorm is superior when the event is not rare but ALOE is superior for rare events. The large error for pvnorm with
stemmed from a small number of outliers among the
trials.2  1.35  0.000399  9.42 
3  1.11  0.000451  9.24 
4  3.35  0.000549  2.37 
5  3.73  0.000600  1.81 
6  1.52  0.000543  4.39 
7  2.29  0.000559  3.62 
8  1.27  0.000540  1.34 
The upper bound in equation (6) is , from which . For this yields about , which is over times the actual mean squared relative error from Table 1.
It is possible that this example is artificially easy for importance sampling, due to the symmetry. Whichever halfspace we sample, the distribution of overlapping halfspaces for is the same. Two halfspaces differ from by a one degree angle, two differ by a two degree angle and so on. To get a more varied range of overlap patterns, we replaced angles by angles where is the ’th prime among integers up to . There are of them, of which the largest is . With and replications using points in importance sampling, we have variance of equal to . The comparable figure for mvtnorm is . There were a few outliers there including one that was more than times the union bound. The gap between the prime angle polygon and the inscribed circle is larger than the one formed by the full polygon. Pooling all the importance sampling runs leaves an estimate of about for . In this example, we see importance sampling working quite well without symmetry.
4.2 High dimensional halfspaces
The previous example was low dimensional and each of the halfspaces sampled had numerous similar ones, differing in angle by a small number of degrees. Thus was quite a bit smaller than . Here we consider a high dimensional setting where the halfspaces have less overlap.
Two uniform random unit vectors and in are very likely to be nearly orthogonal for large . Then are nearly independent events. For independent events, we would have
To make a rare event, the must be small and then the probability above will be close to the union bound. Theorem 2 predicts good performance for importance sampling here.
For this test sample problems were constructed. The dimensions were chosen by . Then there were constraints chosen with uniform random unit vectors . The threshold was chosen so that of the union bound was , followed by rounding to two significant figures. Then was computed by importance sampling with samples, and by pmvnorm. Figure 3 shows the results. The ALOE sampling value was always very close to the union bound which in turn is essentially equal to what one would see for independent events. The values from pmvnorm were usually too small. By construction the intersection probabilities are quite rare. In importance sampling, % of the simulations had no intersections among trials and the others had only a few intersections. Therefore it is clear that the probabilities should be close to the union bounds here. We also see evidence of strong positive skewness for pmvnorm.
5 Power system infeasibility
5.1 Model
Our power system models are based on a network of nodes, called busses. Some busses put power into the network and others consume power. The edges in the network correspond to power lines between busses. The network is ordinarily sparse, with a small multiple of .
The power production at bus is , with negative values indicating consumption. For some busses, is tightly controlled and deterministic in the relevant time horizon. Other busses have random corresponding, for example, to variable consumption levels, that we treat as independent. Busses corresponding to wind farms have random power production levels with meaningfully large correlations. Our models contain one special bus , called the slack bus, at which the power is . The total power in the system is zero because transmission power losses are ignored in the DC approximation that we use.
The power at all busses can be represented by the vector corresponding to fixed busses, ordinary random busses (including any correlated ones) and the slack bus. There are fixed busses, slack bus and random busses apart from the slack bus. We will use to denote a column vector of ones, and
to denote the identity matrix of size
and similarly for and .The power at bus must satisfy the constraints
(11) 
The vector has a Gaussian distribution, determined entirely by the random components . Therefore in the present context, is the Gaussian random variable from Section 2. The fixed components satisfy and then the slack bus satisfies where and . Because all of the randomness comes from , we will abbreviate to .
The node to node inductances in the network form a Laplacian matrix where if busses and are connected with
(up to rounding). The Laplacian is symmetric and has one eigenvalue of zero for a connected network. It has a pseudoinverse
. We partition and as followsWe also group into three sets of columns via .
The phase at bus is denoted . In our DC approximation of AC power flow, the phases approximately satisfy . Given the power vector , we take
The phase constraints on the network are
(12) 
In our examples, all for a single value such as or .
Let be the incidence matrix. Each edge in the network is represented by one row of with an entry of for one of the busses on that edge and for the other. The phase constraints are componentwise. Now
The constraint that for every edge can be written
Now and , so the constraint on is
(13) 
We also have constraints which can be written
(14) 
Equations (13) and (14) supply constraints on the random vector . We have also the two slack bus constraints and , that is
(15) 
Finally, there are individual constraints on the random busses
componentwise.
When we combine all of the constraints on , we get a matrix with rows and a vector of upper bounds with entries for which the constraints are componentwise. Here those matrices are
These are the linear constraints on . There are phase constraints and there are two constraints for all of the nonfixed busses, including the slack bus. These constraints can be turned into constraints and on a vector as described in Section 2.1.
5.2 Examples
We considered several model electrical grids included in the MATPOWER distribution (Zimmerman et al., 2011). In each case we modeled violations of the phase constraints, and used samples. For some cases we found that, under our model, phase constraint violations were not rare events. In some other cases, the rare event probability was dominated by one single phase condition: . For cases like this there is no need for elaborate computation because we know is within a narrow interval . The interesting cases were of rare events not dominated by a single failure mode. We investigate two of them.
The first is the Polish winter peak grid of 2383 busses. There were random (uncontrolled) busses and phase constraints. We varied as shown in Table 2. For constraint violations are not very rare. At they are quite rare. The estimated coefficient of variation is nearly constant over this range.
se  

The second interesting case is the Pegase 2869 model of Fliscounakis et al. (2013). This has uncontrolled busses and phase constraints. It is described as “power flow for a large part of the European system”. The results are shown in Table 3. We include an unrealistically large bound in that table, to test the limits of our approach. For , the standard error given is zero. One halfspace was sampled times, another was sampled times but in no instance were there two or more phase violations. The estimate reverts to the union bound. Getting doubletons ( among tries is compatible with the true probability of a doubleton being as high as . Even if and then we would have instead of . We return to this issue in the discussion.
se  

In addition to the examples above we investigated IEEE case 14, IEEE case 300, and Pegase 1354, which were all dominated by one failure. We considered a system which included random and correlated wind power generators, but phase failure was not a rare event in that system. Pegase 13659 was too large for our computation. The Laplacian matrix has rows and columns and we use the SVD to compute the generalized inverses we need. Pegase 9241 was large enough to be very slow and it did not have rare failures.
6 Discussion
We have introduced a version of mixture importance sampling for problems with multiple failure modes. The sample values are constrained to have at least one failure and we obtain bounded relative error.
The ALOE importance sampler is more accurate than a state of the art code for computing high dimensional Gaussian probabilities in our rare event setting, but not otherwise.
We have noticed two areas where ALOE can be improved. First, if we never see concurrent rare events, ALOE will return the union bound , with an estimated variance of zero. That variance estimate could be undesirable even when . Because is supported on we can get an interval estimate of by putting a multinomial prior on this support set and using the posterior distribution given the sample.
A second and related issue is that while always occurs, it is possible to get . We have seen this in cases where because one of the dominates all of the others combined. In such cases , and are all very close together and has small relative standard deviation. Improving these two issues is outside the scope of this paper. They are both things that happen in cases where we already have a very good idea of the magnitude of .
In large problems the algebra can potentially be reduced by ignoring the very rarest events and simply adding their probabilities to the estimate. This will provide a mildly conservative bound. There is also the possibility of exploiting many generalized upper and lower bounds on the probability of a union. See for instance the survey by Yang et al. (2014).
Acknowledgments
We thank Alan Genz for providing details of the mvtnorm package. We thank Yanbo Tang and Jeffrey Negrea for noticing that Lemma 1 could be proved by CauchySchwarz, which is shorter than our original proof and also establishes necessity. Thanks also to Bert Zwart, Jose Blanchet and David Siegmund for pointers to the literature. ABO thanks the Center for Nonlinear Studies at Los Alamos for their hospitality while he visited. The work of YM and MC was funded by CNLS and DOE/GMLC 2.0 project: “Emergency Monitoring and controls through new technologies and analytics”. ABO was supported by the NSF under grants DMS1521145 and DMS1407397.
References
 Adler et al. (2012) Adler, R. J., J. H. Blanchet, and J. Liu (2012). Efficient Monte Carlo for high excursions of Gaussian random fields. The Annals of Applied Probability
Comments
There are no comments yet.