1 Introduction
We consider the problem of computing an expectation in the context of rare event simulation using a Monte Carlo approach. It is well known that, in such a framework, the accuracy of the Monte Carlo estimator is quite bad because too few samples lead to a non zero value of the involved characteristics. Assume we want to compute
, where andis a random vector with values in
. There are basically two wellknown tools to improve the accuracy of the crude Monte Carlo estimator: importance sampling and stratified sampling. As we will see in Section 2, importance sampling can be implemented in a quite fully automated way while stratification requires an a priori good knowledge of the realization space.As an application problem, we will study the efficient and accurate determination of the yield of an integrated circuit, through electrical circuit level simulation, under stochastic constraints due to the manufacturing process. Several orders of magnitude of variance reduction can be achieved, for example millions or billions of MC runs necessary for very small probability might be reduced to a few hundreds or thousands runs. Section
5 presents quantitative results and speedup factors obtained with our importance sampling algorithm. The performance will be demonstrated on several integrated circuits from SRAM bitcells, to digital standard cells, analog IP blocks as well as on some small toy circuits.We first give an overview of the semiconductor industry, and of the integrated circuits in particular. The notion of yield is explained and how this translates to very low probability of failure estimation for some kind of circuits. We will also introduce shortly the concept of analog circuit simulation used in our context.
1.1 EDA market and its segmentation
The past decades saw a considerable evolution of the manufacturing processes of integrated circuits (IC). The beginning of the twentyfirst century experiences the same trend but with the emergence of connected objects, applications linked to mobility, onboard systems (especially in the automotive domain), the processing of mass data or medical devices.
For circuit designers, it is necessary to reduce the time of putting new products on the market, while satisfying both the demands for increased performance (consumption, reduced weights, sizes and costs). To these constraints related to the end user, there are the problems of safety and security (aviation, motor vehicles), connectivity and adaptivity to more and more standards: WLAN (Wireless Local Area Network), Bluetooth, WiFi (Wireless Fidelity). The result is an exponential growth in terms of complexity of operation and size (in number of transistors) for ICs. The time to market of a product is a crucial factor in its success. These systems of great complexity have to be designed and validated efficiently. The manufacturing costs being high, multiple attempts cannot be accepted.
The initial design of complex circuits is obtained in particular by the use of standard cell libraries and blocks of intellectual property (IP) developed by specialized teams or companies. Simulation constitutes the initial step after the specification. From our perspective, we will consider only a limited number of segments of this market: memory circuit simulation and standard cells. The simulation of complete digital blocks (digital SoCs) or large AMS/RF circuits cannot be addressed with our techniques due to the prohibitive CPU time required for each individual simulation. For our experiments, electrical simulations will be achieved with the commercial SPICE simulator Eldo^{®}. SPICEtype simulators (Simulation Program with Integrated Circuit Emphasis) are the industrystandard way to verify circuit operation at the transistor level before manufacturing an integrated circuit.
1.2 Principles of analog circuit simulation
The automatic analysis routines of a SPICE simulation program determine the numerical solution of a mathematical representation of the circuit under validation. After each circuit element has been modeled, the system of equations describing the complete circuit is determined and topological constraints are determined by the interconnection of the elements. The equations of the circuit are generally a nonlinear differential algebraic system of of the form:
(1) 
where is a nonlinear vector function, is the vector of circuit variables (e.g. node voltages and charges or branch currents), is the derivative of this vector with respect to time . The circuit analysis consists in determining the solution of Eq. (1
) for the following different simulation domains: nonlinear quiescent point calculation (DC), timedomain largesignal solution of nonlinear differential algebraic equations (TRAN), linear smallsignal frequency domain (AC), noise analysis (NOISE). SPICE circuit simulation programs are presented in
Ho et al. (1975), Vlach and Singhal (1993) and McCalla (1987).The Monte Carlo simulation in SPICE programs is performed on multiple model evaluations with randomly selected model process variables. These variables are noted
in this paper. It is worth mentioning that the input parameters in models are in general the various design parameters (geometry of transistors), the process parameters (random fluctuations due to manufacturing process), and the environmental conditions (voltages conditions and temperatures). The electrical models of all major semiconductor manufacturers contain statistical models of process parameters with the estimation of central tendency characteristics, such as mean value, standard deviation, even higher moments such as skewness or kurtosis. In the context of circuit simulation, the problems can live in very large dimension. The number of parameters are in general proportional to the number of transistors. In section
5.2 some circuit examples will be presented where the dimension of ranges from a few tens to several thousands components.A circuit performance may be any electrical characteristic that is extracted from the simulation. It will be noted . In our context, the simulator acts as a blackbox. We do not know anything a priori about the intrinsic nature of the relation between the measured output and the statistical parameters. Typical measures of performance may be the time delay for signal propagation across a circuit, the gain of an amplifier or the oscillation frequency of an oscillator.
1.3 Yield losses and probability of failure
The yield of an Integrated Circuit (IC) plays a major role in the manufacturing cost of an ApplicationSpecific IC (ASIC). This concept typically enters in a manufacturing flow called “Yield Learning Flow”. In such a flow, the yield is defined to be the proportion of the manufactured circuits that satisfy both the design engineers and the process engineers specifications. The yield of a manufactured product should increase from the initial design to the largevolume production. Yield is therefore related to the cost of production. In nanoscale technologies, yield losses can be entirely caused by the uncontrolled and unavoidable factors like random dopant fluctuation and line edge roughness (see Agarwal and Nassif (2008)). And each new technology node tend to make the problem more difficult, since the size of the transistors within ICs has shrunk exponentially, following Moore’s Law. Over the past decades, the typical MOS transistors channel lengths were once several micrometers (Intel 8086 CPU launched in 1978 was using a 3.2 m process), but today ICs are incorporating MOS with channel lengths of tens of nanometers. This empirical law predicted that the number of transistors would double every 2 years due to shrinking transistor dimensions and additional technical improvements. As a consequence, some authors predicted that power consumption per unit area would remain constant, and that that computer chip performance would roughly double every 18 months. The leading manufacturers continue to satisfy this empirical trend. In 2020, TSMC have already started mass producing 5 nanometer (nm) chips for companies including Apple, Huawei and Qualcomm. TSMC and Intel both have planned 2 nm products on their roadmaps, with TSMC scheduling earliest production for 2023 and Intel for 2029. In May 2021, IBM announced the development of the first chip with 2nm technology.
Yield is interpreted as the probability of failure, rather than the probability of success. In some applications, it is much simpler to determine the yield of a single type of circuit which is repeated a very large number of times, say several millions. A typical and important example is Static RandomAccess Memory (SRAM) manufacturing yield. The simple block in SRAM circuits is the bitcell. Its role is to store one elementary bit of data using latching circuitry (flipflop). The probability of failure is defined here to be the probability of a wrong operation (read, write, or retention) occurring in an SRAM circuit. For such problems, the yield may be quantitatively appreciated by considering an explicit value for the acceptable quality limit (AQL). For example, an AQL for 1Megabit SRAM bitcell circuits could be 100 ppm. If production of one million such circuits is targeted, the manufacturing process cannot allow more than 100 defective bitcells globally. This means that the probability of failure must be at most:
This relations shows that yield drops exponentially with the number of elementary blocks, and imposes extremely small values for where a large number of repetitions is used.
The practical problem of estimating failure probabilities for a large class of circuits is then translated into a general mathematical framework. The goal is to compute the expectation where the function writes as an indicator function with , where is a circuit performance of interest (evaluated through SPICE simulations) and is an upper bound specified by the circuit designer. Note also the practical importance of the inverse problem, e.g. the estimation of the quantile given the probability of failure such that the quantile function is defined as follows:
(2) 
in terms of the output measure distribution . The quantile problem is fundamental in verification flows where the circuit designers have to validate the design of complete library of standard cells. The circuits must satisfy prescribed specifications, for example the time delay must be less than some upper bound . We will show, with the examples, how this quantile estimator can be obtained in a fully automatic way with our algorithm.
The rest of the paper unfolds as follows. In Section 2, we present importance sampling in a Gaussian framework and further enhance it for very rare events. Then, in Section 3, we apply the methodology developed before to the computation of a default probability and its expected shortfall. In Section 4, we explain how importance sampling and stratified sampling can be coupled to improve the performance of our algorithm on some particularly demanding cases. Finally, Section 5 presents intensive numerical experiments on real electronic circuits.
2 Importance sampling
2.1 General framework
Assume the random vector has a density function . Consider a family of density functions parametrized by a finite dimensional parameter satisfying for all . Then, we can write
(3) 
where is a random vector with density function .
From a practical point of view, one has to find the best density function in order to maximize the accuracy of the Monte Carlo estimator. First, we need to make precise in which sens “best” has to be understood.
As the convergence of the Monte Carlo estimator is governed by the central limit theorem, it is quite natural to try to find the density function
which minimizes best the variance of the estimatorOnly the second moment depends on the density function and moreover using Eq. (3) again, we obtain
If the optimization problem is wellposed (strongly convex for instance), we can solve this problem directly.
2.2 The Gaussian case
Assume the density function stands for the standard Gaussian density function in
. Consider a family of Gaussian distributions with values in
parametrized by their expectationsThen, the importance sampling formula Eq. (3) writes as
(4) 
The second moment can be also written in a simplified way
(5) 
where the last equality comes from the application of Eq. (4) with .
Proposition 2.1.
Assume that for all , and that . Then, the function is infinitely differentiable and strongly convex. Moreover, we can exchange derivation and expectation.
From this proposition, we can write
(6)  
(7) 
Hence, the value minimizing the variance is uniquely defined by . As the Hessian matrix is tractable at roughly the same cost as the gradient itself, one can implement a Newton algorithm with optimal descent direction to minimize .
2.3 Existing approaches in EDA
Importance Sampling is used with great success for many decades in multiple scientific fields, including structural safety engineering, aeronautics, communication networks, finance, nuclear safety. The world of EDA is actually rather behind, comparatively. IS with Gaussian mixtures has been used noticeably in the context of SRAM simulation (see Kanj et al. (2006)).
The mathematics of the basic idea are rather straightforward, although some users have difficulties with this notion of using a modified distribution. The performance of any IS strategy heavily depends on the chosen instrumental distribution, and implementation must be very thorough. However, incorrect choice of the instrumental distribution (as demonstrated in Hocevar et al. (1983)) or bad implementation can lead to poor performance, and even to an increase of the variance.
However, at least compared to the field of mechanical engineering, the relation between the process parameters and the output is more often mildly nonlinear than not. Given the difficulty, many approaches that yield only an approximation of the probability or quantile have been tried. Machine Learning techniques can help as well. None of these approaches can avoid the curse of dimensionality (see for example
McConaghy et al. (2013), Singhee and Rutenbar (2009)), and the validity of the approximated confidence intervals, when they can be estimated, is questionable.
2.4 Algorithm for the mean shift approach in the Gaussian case
We consider the framework detailed in Section 2.2 relying on the variance criterion to determine the optimal parameter uniquely defined by thanks to Proposition 2.1. The idea of the algorithm is to replace all the expectations by empirical means and then deterministic optimization techniques.
Algorithm 2.2 (Adaptive Sample Average Monte Carlo).

Draw
i.i.d. samples following the normal distribution

Compute the minimizer of
by solving using Newton’s method.

Draw i.i.d. samples following the normal distribution and independent of .

Compute the expectation by Monte Carlo using this new sample
The number of samples used for solving the optimization problem may differ from the number of samples used in the Monte Carlo procedure. The way to balance the number of samples between these two steps has to be further studied. For large enough, inherits its strong convexity and differentiability from the ones of .
The convergence of the Sample Average approximation is known to converge to and to satisfy a central limit theorem under integrability conditions. We introduce the following condition
(8) 
Proposition 2.3.
Assume that for all , and that . Then, and converge a.s. to and as . If, moreover, (8) holds, then where
We refer to Jourdain and Lelong (2009) and Rubinstein and Shapiro (1993) for a proof of this result.
The convergence of the sequence is governed by the following theorem. We can state the following result.
Proposition 2.4.
Let be an increasing function of tending to infinity with . We assume the Assumptions of Proposition 2.3. The following results hold.

The estimator is an unbiased and convergent estimator of ;

If the function is almost everywhere continuous and for some and some compact neighborhood of , , then where
Proof.
The proof of (i.) can be found in Badouraly Kassim et al. (2015).
We only give the guidelines to prove (ii.).
From the standard central limit theorem, . Therefore, it is sufficient to prove that . Let .
(9) 
We deduce from Proposition 2.3, that . We introduce
Conditionally on ,
is a sum of i.i.d. centered random variables. Then, it is sufficient to monitor differences as
on the set , which is a subset of for large enough . As is almost everywhere continuous, we deduce that the above difference a.s. goes to when tends to infinity. Let , the conditional independence combined with Hölder’s inequality yield that for large enough
From the local boundedness of , we deduce that
which proves the uniform integrability of the family . Then, we deduce that the second term in (9) tends to zero. ∎
An immediate consequence of this result is that the accuracy of a Monte Carlo algorithm based on can be monitored using the standard theory of confidence intervals.
A key issue from a practical and computational point of view is to fix for a given value of . As has not impact on the convergence rate but only on how the empirical variance is close to the optimal variance. Hence, we are actually more interested in the convergence of to than in the one of itself. Monitoring the convergence of would give us a hint on how to choose .
Proposition 2.5.
Under the assumptions of Proposition 2.3, where
Proof.
By the standard central limit theorem, converges in distribution to a normal random variable with mean and variance .
To handle the term , we use the following Taylor expansion
where is a point on the segment joining and . Since converges locally uniformly to (see Jourdain and Lelong (2009)), converges a.s. to . Then, we deduce from the convergence in distribution of (see Proposition 2.3) that goes to zero in probability. ∎
As usual, if we consider an estimator of , we obtain , which can be used in practice to determine the magnitude of . A natural estimator of can be computed online within the minimization algorithm
The convergence of to
ensues from the locally uniform strong law of large numbers for the sequence of random functions
, see Rubinstein and Shapiro (1993, Lemma A1) or Ledoux and Talagrand (1991, Corollary 7.10, page 189).2.5 Using importance sampling for very rare event simulation
Assume the function writes as an indicator function with where makes the event almost never occur in practice, is much smaller than . Then, the empirical updating formulae are not well defined and we will never manage to compute the mixture parameters straightaway. In such a situation, Kroese et al. (2006) suggest to use a ladder with finite number of levels to reach level . Consider some levels where is a priori unknown but computed online. In this scheme, the samples used at step are drawn according to the importance sampling distribution obtained from step , which ensures that the level is never too rare.
In this section, we focus only the mean shift approach. The multilevel approach is based on a step by step update of the importance density thus ensuring the targeted event is never too rare. At step , the samples are generated according to the importance sampling density computed at step . Before stating the algorithm, we need to extend Eq. (4) to a wider setting.
Consider two valued normal random vector and with densities and respectively for some . Then, from Eq. (3), we get
(10)  
(11) 
These two formulae can be made more explicit by expanding the density functions
The Algorithm can be written as
Algorithm 2.6 (Multilevel importance sampling).
Fix in .

Draw some samples according to the distribution of .

Using the previous samples, compute the quantile of and set it to provided it is less than ; otherwise set .

Compute the variance criterion
Then, .

If , return to step 1 and proceed with .

Draw some new samples following the standard normal distribution and independent of the past. Approximate by
Using intermediate levels has no theoretical impact on the convergence of the algorithm, it is a practical though essential trick to circumvent the difficulty of sampling a rare event. If the levels are chosen so that the events are never too rare, then a fairly small number of samples can be used. We know from Proposition 2.4 that the convergence rate of is monitored by so the computational effort should be put on .
Computation of .
The vector is defined as the solution of a minimization problem, which is known to be strongly convex. Hence, the solution can be computed using the Newton algorithm with optimal descent direction given for finding the root of the gradient.
When goes to infinity, the Hessian matrix converges to
whose smallest eigenvalue is larger than
since
Depending on the choice of the levels , the lower bound can become quite small, which may lead to a badly performing Newton algorithm as the smallest eigenvalue of the Hessian matrix measures the convexity of the function and the less convex the function is, the slower the Newton algorithm converges. To circumvent this problem, we suggest to consider an alternative characterization of . The equation can be rewritten
If we introduce the function defined by
then can be seen as the root of and we can prove that the Hessian matrix is uniformly lower bounded by . We refer the reader to Jourdain and Lelong (2009) for more details on this subtle transformation of the original problem to ensure a minimum convexity.
Note that in practice, one needs to solve the linear systems at each iteration of the Newton algorithm. The Hessian matrix being a dense matrix, it is more interesting to use an iterative algorithm such as the conjugate gradient algorithm.
The number of input variables may also be high. This dimension is directly linked to number of electronic devices on a given circuit and specifically to the number of transistors. However this dimension does not impact the theoretical convergence properties of the approach. As this dimension grows, it becomes difficult to make sufficient progress during the multilevel algorithm. When the budget of simulation is limited, we experienced non convergence (and therefore failure) of the Importance Sampling algorithm if the number of variables is closed to variables. A simple idea is to look for the importance sampling parameter in a subspace where is matrix with rank . In our situation, the matrix
needs to be identified automatically, and is a submatrix of the identity matrix after removing its
columns corresponding to the unimportant variables. Our approach falls into the category of sequential deterministic methods. It produces the same subset on a given problem every time, by starting with a single solution (a variable subset) and iteratively add or remove dimensions until some termination criterion is met. This approach is clearly suboptimal, there is no guarantee for finding the optimal subset solution (because we do not examine all possible subsets). We made some tests to validate the approach with large scale problems see section 5.1.3 Computing a default probability and its expected shortfall
In this section, we assume that the function is given by where and . We consider the problem of both estimating and , where is a standard normal random vector in . The quantity is called conditional value at risk () or expected shortfall and we denote it by . The can be seen as a mesure of the dispersion of the distribution tail. In particular, a close to means that the fail circuits are very concentrated and justifies the practitioners’ habit to pick a single fail circuit at random when trying to analyse the reasons of a failure. We will give an practtical example in section 5.2.2.
The default probability is computed by applying the importance sampling approach developed in Section 2 to its standard Monte Carlo estimator.
Note that can be written
Based on Section 2.2, it is natural to consider the importance sampling version of the for
(12) 
Note that we could afford different importance sampling parameters for the numerator and the denominator, each being the solution of an optimization problem. Since, we at computing both and , we will anyway have to compute the optimal for the denominator. Considering the computational cost of solving such an optimization problem, we believe it is not worth it using two different importance sampling parameters in (12).
Then, a standard estimator writes as
(13) 
Although the estimations of and are presented one and after the other, in practice they are computed simultaneously using the same runs of the simulator. Thus, the two estimators are actually obtained for the cost of a one only.
It is clear from the strong law of large number that a.s. when . Due the random nature of an estimator, it has to come with a confidence interval, which relies on a central limit theorem.
Proposition 3.1.
The following convergence holds
(14) 
with
(15)  
(16) 
Proof.
To keep the notation of the proof as compact as possible, we introduce and . When , we simply drop the index. Similarly, for , we write .
From the standard multidimensional central limit theorem, it is known that
(17) 
where
This convergence in distribution combined with Slutsky’s theorem yields that converges in distribution to a normal random variable with mean and variance
Some comments on Proposition 3.1.
It is interesting to analyse the convergence result stated in Proposition 3.1 in light of the behaviour of an other estimator of . Let us introduce
Clearly, , which assesses that is non biased. From, the standard central limit theorem, we deduce that